Balajee Asish Brahmandam - freeCodeCamp.org

What Happened When I Replaced Copilot with Claude Code for 2 Weeks

Balajee Asish Brahmandam — Fri, 27 Mar 2026 18:46:22 +0000

GitHub Copilot costs $10/month, and I'd been using it for two years without thinking twice. But when Claude Code launched, I got curious. What if I just... switched?

I didn't want to just add Claude Code to my stack. I actually wanted to replace Copilot entirely for two weeks. I kept everything else the same – same editor, same projects, same workflow. I just swapped the autocomplete suggestion tool.

Here's what broke, what improved, and whether I went back.

The Setup
What Worked Better
What Broke (Or Slowed Things Down)
The First Week vs The Second Week
Why I Went Back
The Honest Verdict
What I Actually Use Now
Copilot vs Claude Code — The Breakdown
A Word on Developer Experience
What Would Make Me Switch
Final Thoughts

The Setup

Environment:

Python 3.12 for backend work (Django REST framework specifically)
React/TypeScript for frontend
VSCode as my editor
A mid-sized project with about 15k lines of code across backend and frontend
Two weeks, normal workload (roughly 30-40 hours of coding)
Working on features I'd normally tackle: adding endpoints, debugging issues, writing tests

What I did:

Disabled GitHub Copilot completely. Uninstalled the extension.
Set up Claude Code (via their CLI and VSCode integration).
Kept everything else identical: same repos, same Git flow, same daily work.
Tracked time on each task to see if there was a real difference.

Ground rules:

I couldn't use Copilot as a fallback. This was an honest comparison.
I logged every time I got frustrated or felt like Claude Code was slowing me down.
I kept track of bugs I caught vs. bugs I missed.

The goal: Does Claude Code work as a day-to-day replacement for Copilot, or does it force me back?

What Worked Better

Accuracy

Copilot sometimes suggests things that are close but not quite right. It might finish a regex pattern 80% correctly, and I have to tweak it. It happens maybe 20% of the time.

Claude Code was more accurate. In the first week, I noticed fewer "close but wrong" suggestions. When I typed a function signature, Claude got the implementation right more often than Copilot did.

One example: I was writing a utility to parse JSON and handle errors. Copilot suggested:

def parse_json(data):
 try:
 return json.loads(data)
 except:
 return None

That's sloppy. It catches all exceptions and silently fails.

Claude Code suggested:

def parse_json(data):
 try:
 return json.loads(data)
 except json.JSONDecodeError as e:
 logging.error(f"Failed to parse JSON: {e}")
 return None
 except Exception as e:
 logging.error(f"Unexpected error: {e}")
 raise

Better error handling. More production-ready. That's a real difference.

I estimate Claude Code's suggestions were "immediately usable" about 85% of the time. Copilot was more like 70%.

Understanding Context

Claude Code seems to understand your project better than Copilot. When I opened a file with Claude Code context, it knew:

My project's naming conventions (I use fetch_ for async functions, get_ for sync).
My error handling style.
What libraries I was using.

Copilot sometimes forgot these patterns or suggested things using the wrong library. Claude Code was more consistent.

One morning I was adding a new endpoint to an existing API. I typed the route signature:

@app.post("/api/users")
async def create_user(data: UserPayload):

Copilot might suggest:

 response = requests.post(...)

(Wrong! That's sync. This function is async.)

Claude Code suggested:

 async with httpx.AsyncClient() as client:
 response = await client.post(...)

It remembered that the entire codebase uses async/await and httpx for async calls. That's attention to detail.

Reasoning About Requirements

Sometimes Copilot just completes code. It doesn't think about whether it makes sense.

Claude Code seemed to reason about whether the suggestion was actually what you wanted. A few times, when I was writing ambiguous code, Claude Code offered a clarifying suggestion instead of just finishing it.

Example: I started a function for sorting users:

def sort_users(users):

Copilot would auto-complete with some sorting logic, but I'd have to check if it was what I meant.

Claude Code would sometimes suggest:

def sort_users(users, key="created_at", reverse=False):

It was thinking: "Sorting is ambiguous. What key? What order?" It was right more often than not.

What Broke (Or Slowed Things Down)

Response Time

This was the biggest issue. Copilot is instant. I type def get_ and it finishes before I can blink. It's autocomplete, and autocomplete needs to be fast. The latency is maybe 100-200ms.

Claude Code has a noticeable delay. Maybe 1-2 seconds before suggestions appear. On day one, that felt fine – I had time to think. By day two, I was annoyed. By day three, I was genuinely frustrated.

Over a day of coding, that adds up. If you're typing 20 functions and each one has a 2-second delay, that's 40 seconds of just waiting. It doesn't sound like much, but it breaks flow. Flow is where the good coding happens.

By day three, I was getting frustrated. I'd type faster than Claude Code could suggest, which meant I'd often just finish the code myself. The second a suggestion appeared, I'd already moved on. Defeating the purpose.

I tested this by tracking time. Same function, same complexity:

With Copilot: 3 minutes (including auto-complete time)
With Claude Code: 5 minutes (waiting for suggestions + finishing manually)

The delay isn't theoretical. It's real and measurable.

The truth: Copilot is an autocomplete tool. It needs sub-second latency. Claude Code, being more powerful, is inherently slower. That's a fundamental tradeoff. You can't have both "instant" and "smart." Choose one.

No Inline Acceptance

With Copilot, I press Tab to accept. It's in my muscle memory. Tab = accept.

Claude Code doesn't work exactly the same way. I had to click or use a different keyboard shortcut. Small thing, but it broke my rhythm constantly. I'd write code, see a suggestion, and instinctively press Tab. Nothing would happen. Then I'd remember: "Oh right, it's a different tool."

After two weeks, I never fully got used to it.

Disconnected From Flow

Copilot is so embedded in the editor that I don't think about it. It's just there, like spellcheck. Claude Code feels like a separate tool I'm using, which means I'm more aware of it. That sounds like a good thing, but it's actually more cognitively expensive.

I wanted to type and have suggestions appear. Instead, I felt like I was using a tool. There's a difference. It's the same difference between walking and thinking about walking. When you're thinking about your walking mechanics, you walk worse.

This affected my productivity more than I expected. On day three, I found myself just typing manually instead of waiting for suggestions. It wasn't a conscious decision. I'd just start typing and then remember "oh, the suggestion came in." By then I'd already finished half the function myself.

Limited to the File

Copilot understands your entire project. It knows what's in other files, what libraries you import, what conventions you follow. If I'm importing a utility function that doesn't exist yet, Copilot knows to suggest the import with the path I'd use.

Claude Code seemed more limited to the current file. Sometimes it would suggest imports that weren't already in the file, or use patterns different from the rest of my codebase. Not often, but enough to notice. On one occasion, it suggested a database query pattern that was different from my whole codebase. It would've worked, but it would've been inconsistent.

This is less of a limitation and more of a design difference. Claude Code is built for depth on individual files, not breadth across a project.

The First Week vs The Second Week

Week 1: I was excited. Claude Code felt smarter. I noticed the accuracy advantage. But the latency was starting to annoy me.

Week 2: The novelty wore off. The latency was more annoying. I was missing Copilot's speed. I found myself disabling Claude Code's suggestions and typing manually more often, which defeated the purpose. "If I'm typing it all manually anyway, why switch?"

By day 10, I was typing code faster with Claude Code disabled than with it enabled. That's when I knew it wasn't working for me.

Why I Went Back

On day 14, I re-enabled Copilot.

The first thing I noticed: speed. Code was completing again instantly. My rhythm came back. I hit Tab, it accepted, I moved on. That's the entire appeal of Copilot-it's frictionless.

I also realized how much I'd been manually typing. On days 10-14, I was writing more code by hand because the suggestions felt too slow to be worth waiting for. Without realizing it, I'd completely stopped using Claude Code's suggestions. I was just typing. That's the worst of both worlds: no AI help and the cognitive burden of being aware you're using a tool that's not helping.

Was I sacrificing accuracy? A little. But I'm accurate enough that I catch mistakes in review. For day-to-day, Copilot is fine.

The second thing: it just works. No weird setup, no integration issues. It's part of VSCode. It's always there.

By day 15, I was back to normal productivity, maybe even higher because the flow was better.

The Honest Verdict

Claude Code isn't a Copilot replacement. It's not worse. It's different. It's like comparing a calculator to a calculator app on your phone. One is designed for speed and muscle memory. One is designed to be a full computer in your pocket. They're not competitors.

If I'd tried Claude Code expecting it to be better at debugging, I would've been happy. I was trying it expecting it to replace my autocomplete, which is where it falls flat.

The experiment was valuable, though. It taught me that:

Latency matters more than I expected. A 2-second delay breaks flow.
Familiarity matters. Tab to accept is burned into my muscle memory.
Tool stacking works. Claude Code is great for debugging. Copilot is great for autocomplete. Together they're better than either alone.

What I Actually Use Now

I didn't abandon Claude Code. I just changed how I use it.

Claude Code: For debugging, analysis, and big changes. "Why is this function slow?" "Refactor this for readability." I invoke it deliberately when I need thinking, not continuous autocomplete.
Copilot: For routine coding. Finishing functions, auto-completing imports, normal flow.

That's the working solution. Claude Code is powerful, but it's not a Copilot replacement for daily work. It's a different tool for a different use case.

Copilot vs Claude Code: The Breakdown

Copilot is better for:

Pure autocomplete speed
Routine, well-understood coding
Low friction, high flow state
Simple suggestions

Claude Code is better for:

Complex suggestions that require reasoning
Debugging and analysis
Understanding intent (not just completing code)
Asking questions about code you've written

If you're a Copilot user thinking about switching, don't do it as a straight replacement. Claude Code isn't faster. It's smarter, but slower, and for day-to-day autocomplete, faster wins.

Try using both. Use Copilot for normal coding, Claude Code for debugging and complex changes. If you only want to pay for one, stick with Copilot. It's cheaper, it's faster, and it does the job.

If you're a heavy debugger and you spend a lot of time analyzing code, Claude Code might be worth it. But as a Copilot replacement? No.

A Word on Developer Experience

What surprised me wasn't just the latency. It was how much I missed the seamlessness of Copilot. With Copilot, I don't think about it. It's like breathing-automatic. I type, it suggests, I accept or reject, I move on.

With Claude Code, I was constantly aware I was using a tool. I'd finish typing before the suggestion appeared. I'd have to remember the keyboard shortcut. I'd have to context-switch to look at the suggestion.

That awareness is exhausting. It's why flow state is so important to programming. The best tools get out of your way. Copilot gets out of the way. Claude Code, for autocomplete purposes, doesn't.

Developer experience isn't a nice-to-have. It's core to productivity. A tool that's 10% smarter but 50% more annoying is worse, not better.

What Would Make Me Switch

Claude Code needs to get faster. Sub-second latency for suggestions.
It needs better editor integration. Tab to accept, like Copilot.
It needs to understand the full project, not just the current file.

Once those three things happen, it'd be competitive. Until then, Copilot is still the better choice for daily coding work.

Final Thoughts

This experiment taught me something: better isn't always better. Claude Code is arguably smarter than Copilot. But Copilot is more efficient. For autocomplete, efficiency matters more than intelligence.

It's like comparing a sports car to a Jeep. The sports car is faster on a highway. The Jeep is better on a mountain trail. Neither is "better." They're different. Copilot is trying to predict the next line of code fast. Claude Code is trying to understand your code deeply. They're solving different problems.

I went back to Copilot not because Claude Code is bad. It's actually impressive. But it's a different category of tool. Using it for autocomplete is like using a hammer when you need a screwdriver. The hammer might be fancier, but the screwdriver does the job.

What surprised me most was how much latency matters. I didn't expect a 2-second delay to be that noticeable. But when you're in the zone, typing code, and the autocomplete lags, it completely breaks your flow. It's not about the absolute time. It's about the interruption.

Don't take my word for it though. Run your own two-week experiment. Pick a tool, commit to it, and see what happens. Track your productivity. Track your frustration. The best tool is the one you'll actually use. And you can only find that out by using it.

What's Next?

If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish - Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time.

Got questions or built something similar? Drop a comment below or find me on GitHub and LinkedIn.

Happy building.

Docker Container Doctor: How I Built an AI Agent That Monitors and Fixes My Containers

Balajee Asish Brahmandam — Mon, 23 Mar 2026 17:21:11 +0000

Maybe this sounds familiar: your production container crashes at 3 AM. By the time you wake up, it's been throwing the same error for 2 hours. You SSH in, pull logs, decode the cryptic stack trace, Google the error, and finally restart it. Twenty minutes of your morning gone. And the worst part? It happens again next week.

I got tired of this cycle. I was running 5 containerized services on a single Linode box – a Flask API, a Postgres database, an Nginx reverse proxy, a Redis cache, and a background worker. Every other week, one of them would crash. The logs were messy. The errors weren't obvious. And I'd waste time debugging something that could've been auto-detected and fixed in seconds.

So I built something better: a Python agent that watches your containers in real-time, spots errors, figures out what went wrong using Claude, and fixes them without waking you up. I call it the Container Doctor. It's not magic. It's Docker API + LLM reasoning + some automation glue. Here's exactly how I built it, what went wrong along the way, and what I'd do differently.

Why Not Just Use Prometheus?
The Architecture
Setting Up the Project
The Monitoring Script — Line by Line
The Claude Diagnosis Prompt (and Why Structure Matters)
Auto-Fix Logic — Being Conservative on Purpose
Adding Slack Notifications
Health Check Endpoint
Rate Limiting Claude Calls
Docker Compose — The Full Setup
Real Errors I Caught in Production
Cost Breakdown — What This Actually Costs
Security Considerations
What I'd Do Differently
What's Next?

Why Not Just Use Prometheus?

Fair question. Prometheus, Grafana, DataDog – they're all great. But for my setup, they were overkill. I had 5 containers on a $20/month Linode. Setting up Prometheus means deploying a metrics server, configuring exporters for each service, building Grafana dashboards, and writing alert rules. That's a whole side project just to monitor a side project.

Even then, those tools tell you what happened. They'll show you a spike in memory or a 500 error rate. But they won't tell you why. You still need a human to look at the logs, figure out the root cause, and decide what to do.

That's the gap I wanted to fill. I didn't need another dashboard. I needed something that could read a stack trace, understand the context, and either fix it or tell me exactly what to do when I wake up. Claude turned out to be surprisingly good at this. It can read a Python traceback and tell you the issue faster than most junior devs (and some senior ones, honestly).

The Architecture

Here's how the pieces fit together:

┌─────────────────────────────────────────────┐
│              Docker Host                      │
│                                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   web    │  │   api    │  │    db    │   │
│  │ (nginx)  │  │ (flask)  │  │(postgres)│   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │         │
│       └──────────────┼──────────────┘         │
│                      │                         │
│              Docker Socket                     │
│                      │                         │
│            ┌─────────┴─────────┐              │
│            │ Container Doctor  │              │
│            │  (Python agent)   │              │
│            └─────────┬─────────┘              │
│                      │                         │
└──────────────────────┼─────────────────────────┘
                       │
              ┌────────┴────────┐
              │   Claude API    │
              │  (diagnosis)    │
              └────────┬────────┘
                       │
              ┌────────┴────────┐
              │  Slack Webhook  │
              │  (alerts)       │
              └─────────────────┘

The flow works like this:

The Container Doctor runs in its own container with the Docker socket mounted
Every 10 seconds, it pulls the last 50 lines of logs from each target container
It scans for error patterns (keywords like "error", "exception", "traceback", "fatal")
When it finds something, it sends the logs to Claude with a structured prompt
Claude returns a JSON diagnosis: root cause, severity, suggested fix, and whether it's safe to auto-restart
If severity is high and auto-restart is safe, the script restarts the container
Either way, it sends a Slack notification with the full diagnosis
A simple health endpoint lets you check the doctor's own status

The key insight: the script doesn't try to be smart about the diagnosis itself. It outsources all the thinking to Claude. The script's job is just plumbing: collecting logs, routing them to Claude, and executing the response.

Setting Up the Project

Create your project directory:

mkdir container-doctor && cd container-doctor

Here's your requirements.txt:

docker==7.0.0
anthropic>=0.28.0
python-dotenv==1.0.0
flask==3.0.0
requests==2.31.0

Install locally for testing: pip install -r requirements.txt

Create a .env file:

ANTHROPIC_API_KEY=sk-ant-...
TARGET_CONTAINERS=web,api,db
CHECK_INTERVAL=10
LOG_LINES=50
AUTO_FIX=true
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
POSTGRES_USER=user
POSTGRES_PASSWORD=changeme
POSTGRES_DB=mydb
MAX_DIAGNOSES_PER_HOUR=20

A quick note on CHECK_INTERVAL: 10 seconds is aggressive. For production, I'd bump this to 30-60 seconds. I kept it low during development so I could see results faster, and honestly forgot to change it. My API bill reminded me.

The Monitoring Script – Line by Line

Here's the full container_doctor.py. I'll walk through the important parts after:

import docker
import json
import time
import logging
import os
import requests
from datetime import datetime, timedelta
from collections import defaultdict
from threading import Thread
from flask import Flask, jsonify
from anthropic import Anthropic

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

client = Anthropic()
docker_client = None

# --- Config ---
TARGET_CONTAINERS = os.getenv("TARGET_CONTAINERS", "").split(",")
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
LOG_LINES = int(os.getenv("LOG_LINES", "50"))
AUTO_FIX = os.getenv("AUTO_FIX", "true").lower() == "true"
SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK_URL", "")
MAX_DIAGNOSES = int(os.getenv("MAX_DIAGNOSES_PER_HOUR", "20"))

# --- State tracking ---
diagnosis_history = []
fix_history = defaultdict(list)
last_error_seen = {}
rate_limit_counter = defaultdict(int)
rate_limit_reset = datetime.now() + timedelta(hours=1)

app = Flask(__name__)


def get_docker_client():
    """Lazily initialize Docker client."""
    global docker_client
    if docker_client is None:
        docker_client = docker.from_env()
    return docker_client


def get_container_logs(container_name):
    """Fetch last N lines from a container."""
    try:
        container = get_docker_client().containers.get(container_name)
        logs = container.logs(
            tail=LOG_LINES,
            timestamps=True
        ).decode("utf-8")
        return logs
    except docker.errors.NotFound:
        logger.warning(f"Container '{container_name}' not found. Skipping.")
        return None
    except docker.errors.APIError as e:
        logger.error(f"Docker API error for {container_name}: {e}")
        return None
    except Exception as e:
        logger.error(f"Unexpected error fetching logs for {container_name}: {e}")
        return None


def detect_errors(logs):
    """Check if logs contain error patterns."""
    error_patterns = [
        "error", "exception", "traceback", "failed", "crash",
        "fatal", "panic", "segmentation fault", "out of memory",
        "killed", "oomkiller", "connection refused", "timeout",
        "permission denied", "no such file", "errno"
    ]
    logs_lower = logs.lower()
    found = []
    for pattern in error_patterns:
        if pattern in logs_lower:
            found.append(pattern)
    return found


def is_new_error(container_name, logs):
    """Check if this is a new error or the same one we already diagnosed."""
    log_hash = hash(logs[-200:])  # Hash last 200 chars
    if last_error_seen.get(container_name) == log_hash:
        return False
    last_error_seen[container_name] = log_hash
    return True


def check_rate_limit():
    """Ensure we don't spam Claude with too many requests."""
    global rate_limit_counter, rate_limit_reset

    now = datetime.now()
    if now > rate_limit_reset:
        rate_limit_counter.clear()
        rate_limit_reset = now + timedelta(hours=1)

    total = sum(rate_limit_counter.values())
    if total >= MAX_DIAGNOSES:
        logger.warning(f"Rate limit reached ({total}/{MAX_DIAGNOSES} per hour). Skipping diagnosis.")
        return False
    return True


def diagnose_with_claude(container_name, logs, error_patterns):
    """Send logs to Claude for diagnosis."""
    if not check_rate_limit():
        return None

    rate_limit_counter[container_name] += 1

    prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

    try:
        message = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=600,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return message.content[0].text
    except Exception as e:
        logger.error(f"Claude API error: {e}")
        return None


def parse_diagnosis(diagnosis_text):
    """Extract JSON from Claude's response."""
    if not diagnosis_text:
        return None
    try:
        start = diagnosis_text.find("{")
        end = diagnosis_text.rfind("}") + 1
        if start >= 0 and end > start:
            json_str = diagnosis_text[start:end]
            return json.loads(json_str)
    except json.JSONDecodeError as e:
        logger.error(f"JSON parse error: {e}")
        logger.debug(f"Raw response: {diagnosis_text}")
    except Exception as e:
        logger.error(f"Failed to parse diagnosis: {e}")
    return None


def apply_fix(container_name, diagnosis):
    """Apply auto-fixes if safe."""
    if not AUTO_FIX:
        logger.info(f"Auto-fix disabled globally. Skipping {container_name}.")
        return False

    if not diagnosis.get("auto_restart_safe"):
        logger.info(f"Claude says restart is unsafe for {container_name}. Skipping.")
        return False

    # Don't restart the same container more than 3 times per hour
    recent_fixes = [
        t for t in fix_history[container_name]
        if t > datetime.now() - timedelta(hours=1)
    ]
    if len(recent_fixes) >= 3:
        logger.warning(
            f"Container {container_name} already restarted {len(recent_fixes)} "
            f"times this hour. Something deeper is wrong. Skipping."
        )
        send_slack_alert(
            container_name, diagnosis,
            extra="REPEATED FAILURE: This container has been restarted 3+ times "
                  "in the last hour. Manual intervention needed."
        )
        return False

    try:
        container = get_docker_client().containers.get(container_name)
        logger.info(f"Restarting container {container_name}...")
        container.restart(timeout=30)
        fix_history[container_name].append(datetime.now())
        logger.info(f"Container {container_name} restarted successfully")

        # Verify it's actually running after restart
        time.sleep(5)
        container.reload()
        if container.status != "running":
            logger.error(f"Container {container_name} failed to start after restart")
            return False

        return True
    except Exception as e:
        logger.error(f"Failed to restart {container_name}: {e}")
        return False


def send_slack_alert(container_name, diagnosis, extra=""):
    """Send diagnosis to Slack."""
    if not SLACK_WEBHOOK:
        return

    severity_emoji = {
        "low": "🟡",
        "medium": "🟠",
        "high": "🔴"
    }

    severity = diagnosis.get("severity", "unknown")
    emoji = severity_emoji.get(severity, "⚪")

    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"{emoji} Container Doctor Alert: {container_name}"
            }
        },
        {
            "type": "section",
            "fields": [
                {"type": "mrkdwn", "text": f"*Severity:* {severity}"},
                {"type": "mrkdwn", "text": f"*Container:* `{container_name}`"},
                {"type": "mrkdwn", "text": f"*Root Cause:* {diagnosis.get('root_cause', 'Unknown')}"},
                {"type": "mrkdwn", "text": f"*Fix:* {diagnosis.get('suggested_fix', 'N/A')}"},
            ]
        }
    ]

    if diagnosis.get("config_suggestions"):
        suggestions = "\n".join(
            f"• `{s}`" for s in diagnosis["config_suggestions"]
        )
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*Config Suggestions:*\n{suggestions}"
            }
        })

    if extra:
        blocks.append({
            "type": "section",
            "text": {"type": "mrkdwn", "text": f"*⚠️ {extra}*"}
        })

    try:
        requests.post(SLACK_WEBHOOK, json={"blocks": blocks}, timeout=10)
    except Exception as e:
        logger.error(f"Slack notification failed: {e}")


# --- Health Check Endpoint ---
@app.route("/health")
def health():
    """Health check endpoint for the doctor itself."""
    try:
        get_docker_client().ping()
        docker_ok = True
    except:
        docker_ok = False

    return jsonify({
        "status": "healthy" if docker_ok else "degraded",
        "docker_connected": docker_ok,
        "monitoring": TARGET_CONTAINERS,
        "total_diagnoses": len(diagnosis_history),
        "fixes_applied": {k: len(v) for k, v in fix_history.items()},
        "rate_limit_remaining": MAX_DIAGNOSES - sum(rate_limit_counter.values()),
        "uptime_check": datetime.now().isoformat()
    })


@app.route("/history")
def history():
    """Return recent diagnosis history."""
    return jsonify(diagnosis_history[-50:])


def monitor_containers():
    """Main monitoring loop."""
    logger.info(f"Container Doctor starting up")
    logger.info(f"Monitoring: {TARGET_CONTAINERS}")
    logger.info(f"Check interval: {CHECK_INTERVAL}s")
    logger.info(f"Auto-fix: {AUTO_FIX}")
    logger.info(f"Rate limit: {MAX_DIAGNOSES}/hour")

    while True:
        for container_name in TARGET_CONTAINERS:
            container_name = container_name.strip()
            if not container_name:
                continue

            logs = get_container_logs(container_name)
            if not logs:
                continue

            error_patterns = detect_errors(logs)
            if not error_patterns:
                continue

            # Skip if we already diagnosed this exact error
            if not is_new_error(container_name, logs):
                continue

            logger.warning(
                f"Errors detected in {container_name}: {error_patterns}"
            )

            diagnosis_text = diagnose_with_claude(
                container_name, logs, error_patterns
            )
            if not diagnosis_text:
                continue

            diagnosis = parse_diagnosis(diagnosis_text)
            if not diagnosis:
                logger.error("Failed to parse Claude's response. Skipping.")
                continue

            # Record it
            diagnosis_history.append({
                "container": container_name,
                "timestamp": datetime.now().isoformat(),
                "diagnosis": diagnosis,
                "patterns": error_patterns
            })

            logger.info(
                f"Diagnosis for {container_name}: "
                f"severity={diagnosis.get('severity')}, "
                f"cause={diagnosis.get('root_cause')}"
            )

            # Auto-fix only on high severity
            fixed = False
            if diagnosis.get("severity") == "high":
                fixed = apply_fix(container_name, diagnosis)

            # Always notify Slack
            send_slack_alert(
                container_name, diagnosis,
                extra="Auto-restarted" if fixed else ""
            )

        time.sleep(CHECK_INTERVAL)


if __name__ == "__main__":
    # Run Flask health endpoint in background
    flask_thread = Thread(
        target=lambda: app.run(host="0.0.0.0", port=8080, debug=False),
        daemon=True
    )
    flask_thread.start()
    logger.info("Health endpoint running on :8080")

    try:
        monitor_containers()
    except KeyboardInterrupt:
        logger.info("Container Doctor shutting down")

That's a lot of code, so let me walk through the parts that matter.

Error deduplication (is_new_error): This was a lesson I learned the hard way. Without this, the script would see the same error every 10 seconds and spam Claude with identical requests. I hash the last 200 characters of the log output and skip if it matches the last error we saw. Simple, but it cut my API costs by about 80%.

Rate limiting (check_rate_limit): Belt and suspenders. Even with deduplication, I cap it at 20 diagnoses per hour. If something is so broken that it's generating 20+ unique errors per hour, you need a human anyway.

Restart throttling (inside apply_fix): If the same container has been restarted 3 times in an hour, something deeper is wrong. A restart loop won't fix a misconfigured database or a missing volume. The script stops restarting and sends a louder Slack alert instead.

Post-restart verification: After restarting, the script waits 5 seconds and checks if the container is actually running. I've seen cases where a container restarts and immediately crashes again. Without this check, the script would report success while the container is still down.

The Claude Diagnosis Prompt (and Why Structure Matters)

Getting Claude to return parseable JSON took some iteration. My first attempt used a casual prompt and I got back paragraphs of explanation with JSON buried somewhere in the middle. Sometimes it'd use markdown code fences, sometimes not.

The version I landed on is explicit about format:

prompt = f"""You are a DevOps expert analyzing container logs.

Container: {container_name}
Timestamp: {datetime.now().isoformat()}
Detected patterns: {', '.join(error_patterns)}

Recent logs:
---
{logs}
---

Analyze these logs and respond with ONLY valid JSON (no markdown, no explanation):
{{
    "root_cause": "One sentence explaining exactly what went wrong",
    "severity": "low|medium|high",
    "suggested_fix": "Step-by-step fix the operator should apply",
    "auto_restart_safe": true or false,
    "config_suggestions": ["ENV_VAR=value", "..."],
    "likely_recurring": true or false,
    "estimated_impact": "What breaks if this isn't fixed"
}}
"""

A few things I learned:

Include the detected patterns. Telling Claude "I found 'timeout' and 'connection refused'" helps it focus. Without this, it sometimes fixated on irrelevant warnings in the logs.

Ask for estimated_impact. This field turned out to be the most useful in Slack alerts. When your team sees "Database connections will pile up and crash the API within 15 minutes," they act faster than when they see "connection pool exhausted."

likely_recurring is gold. If Claude says an issue is likely to recur, I know a restart is a band-aid and I need to actually fix the root cause. I flag these in Slack with extra emphasis.

Claude returns something like:

{
    "root_cause": "Connection pool exhausted. Default pool size is 5, but app has 8+ concurrent workers.",
    "severity": "high",
    "suggested_fix": "1. Set POOL_SIZE=20 in environment. 2. Add connection timeout of 30s. 3. Consider a connection pooler like PgBouncer.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "CONNECTION_TIMEOUT=30"],
    "likely_recurring": true,
    "estimated_impact": "API requests will queue and timeout. Users will see 503 errors within 2-3 minutes."
}

I only auto-restart on high severity. Medium and low issues get logged, sent to Slack, and I deal with them during business hours. This distinction matters: you don't want the script restarting containers over every transient warning.

Auto-Fix Logic – Being Conservative on Purpose

The auto-fix function is intentionally limited. Right now it only restarts containers. It doesn't modify environment variables, change configs, or scale services. Here's why:

Restarting is safe and reversible. If the restart makes things worse, the container just crashes again and I get another alert. But if the script started changing environment variables or modifying docker-compose files, a bad decision could cascade across services.

The three safety checks before any restart:

Global toggle: AUTO_FIX=true in .env. I can kill all auto-fixes instantly by changing one variable.
Claude's assessment: auto_restart_safe must be true. If Claude says "don't restart this, it'll corrupt the database," the script listens.
Restart throttle: No more than 3 restarts per container per hour. After that, it's a human problem.

If I were building this for a team, I'd add approval flows. Send a Slack message with "Restart?" and two buttons. Wait for a human to click yes. That adds latency but removes the risk of automated chaos.

Adding Slack Notifications

Every diagnosis gets sent to Slack, whether the container was restarted or not. The notification includes color-coded severity, root cause, suggested fix, and config suggestions.

The Slack Block Kit formatting makes these alerts scannable. A red dot for high severity, orange for medium, yellow for low. Your team can glance at the channel and know if they need to drop everything or if it can wait.

To set this up, create a Slack app at api.slack.com/apps, add an incoming webhook, and paste the URL in your .env.

Health Check Endpoint

The doctor needs a doctor. I added a simple Flask endpoint so I can monitor the monitoring script:

curl http://localhost:8080/health

Returns:

{
    "status": "healthy",
    "docker_connected": true,
    "monitoring": ["web", "api", "db"],
    "total_diagnoses": 14,
    "fixes_applied": {"api": 2, "web": 1},
    "rate_limit_remaining": 6,
    "uptime_check": "2026-03-15T14:30:00"
}

And /history returns the last 50 diagnoses:

curl http://localhost:8080/history

I point an uptime checker (UptimeRobot, free tier) at the /health endpoint. If the Container Doctor itself goes down, I get an email. It's monitoring all the way down.

Rate Limiting Claude Calls

This is where I burned money during development. Without rate limiting, the script was sending 100+ requests per hour during a container crash loop. At a few cents per request, that's a few dollars per hour. Not catastrophic, but annoying.

The rate limiter is simple: a counter that resets every hour. Default cap is 20 diagnoses per hour. If you hit the limit, the script logs a warning and skips diagnosis until the window resets. Errors still get detected, they just don't get sent to Claude.

Combined with error deduplication (same error won't trigger a second diagnosis), this keeps my Claude bill under $5/month even with 5 containers monitored.

Docker Compose – The Full Setup

Here's the complete docker-compose.yml with the Container Doctor, a sample web server, API, and database:

version: '3.8'

services:
  container_doctor:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: container_doctor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - TARGET_CONTAINERS=web,api,db
      - CHECK_INTERVAL=10
      - LOG_LINES=50
      - AUTO_FIX=true
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - MAX_DIAGNOSES_PER_HOUR=20
    ports:
      - "8080:8080"
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  web:
    image: nginx:latest
    container_name: web
    ports:
      - "80:80"
    restart: unless-stopped

  api:
    build: ./api
    container_name: api
    environment:
      - DATABASE_URL=postgres://\({POSTGRES_USER}:\){POSTGRES_PASSWORD}@db:5432/${POSTGRES_DB}
      - POOL_SIZE=20
    depends_on:
      - db
    restart: unless-stopped

  db:
    image: postgres:15
    container_name: db
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  db_data:

And the Dockerfile:

FROM python:3.12-slim

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY container_doctor.py .

EXPOSE 8080

CMD ["python", "-u", "container_doctor.py"]

Start everything: docker compose up -d

Important: The socket mount (/var/run/docker.sock:/var/run/docker.sock) gives the Container Doctor full access to the Docker daemon. Don't copy .env into the Docker image either — it bakes your API key into the image layer. Pass environment variables via the compose file or at runtime.

Real Errors I Caught in Production

I've been running this for about 3 weeks now. Here are the actual incidents it caught:

Incident 1: OOM Kill (Week 1)

Logs showed a single word: Killed. That's Linux's OOMKiller doing its thing.

Claude's diagnosis:

{
    "root_cause": "Process killed by OOMKiller. Container is requesting more memory than the 256MB limit allows under load.",
    "severity": "high",
    "suggested_fix": "Increase memory limit to 512MB in docker-compose. Monitor if the leak continues at higher limits.",
    "auto_restart_safe": true,
    "config_suggestions": ["mem_limit: 512m", "memswap_limit: 1g"],
    "likely_recurring": true,
    "estimated_impact": "API is completely down. All requests return 502 from nginx."
}

The script restarted the container in 3 seconds. I updated the compose file the next morning. Before the Container Doctor, this would've been a 2-hour outage overnight.

Incident 2: Connection Pool Exhausted (Week 2)

ERROR: database connection pool exhausted
ERROR: cannot create new pool entry
ERROR: QueuePool limit of 5 overflow 0 reached

Claude caught that my pool size was too small for the number of workers:

{
    "root_cause": "SQLAlchemy connection pool (size=5) can't keep up with 8 concurrent Gunicorn workers. Each worker holds a connection during request processing.",
    "severity": "high",
    "suggested_fix": "Set POOL_SIZE=20 and add POOL_TIMEOUT=30. Long-term: add PgBouncer as a connection pooler.",
    "auto_restart_safe": true,
    "config_suggestions": ["POOL_SIZE=20", "POOL_TIMEOUT=30", "POOL_RECYCLE=3600"],
    "likely_recurring": true,
    "estimated_impact": "New API requests will hang for 30s then timeout. Existing requests may complete but slowly."
}

Incident 3: Transient Timeout (Week 2)

WARN: timeout connecting to upstream service
WARN: retrying request (attempt 2/3)
INFO: request succeeded on retry

Claude correctly identified this as a non-issue:

{
    "root_cause": "Transient network timeout during a DNS resolution hiccup. Retries succeeded.",
    "severity": "low",
    "suggested_fix": "No action needed. This is expected during brief network blips. Only investigate if frequency increases.",
    "auto_restart_safe": false,
    "config_suggestions": [],
    "likely_recurring": false,
    "estimated_impact": "Minimal. Individual requests delayed by ~2s but all completed."
}

No restart. No alert (I filter low-severity from Slack pings). This is the right call: restarting on every transient timeout causes more downtime than it prevents.

Incident 4: Disk Full (Week 3)

ERROR: could not write to temporary file: No space left on device
FATAL: data directory has no space

{
    "root_cause": "Postgres data volume is full. WAL files and temporary sort files consumed all available space.",
    "severity": "high",
    "suggested_fix": "1. Clean WAL files: SELECT pg_switch_wal(). 2. Increase volume size. 3. Add log rotation. 4. Set max_wal_size=1GB.",
    "auto_restart_safe": false,
    "config_suggestions": ["max_wal_size=1GB", "log_rotation_age=1d"],
    "likely_recurring": true,
    "estimated_impact": "Database is read-only. All writes fail. API returns 500 on any mutation."
}

Notice Claude said auto_restart_safe: false here. Restarting Postgres when the disk is full can corrupt data. The script didn't touch it. It just sent me a detailed Slack alert at 4 AM. I cleaned up the WAL files the next morning. Good call by Claude.

Cost Breakdown – What This Actually Costs

After 3 weeks of running this on 5 containers:

Claude API: ~$3.80/month (with rate limiting and deduplication)
Linode compute: $0 extra (the Container Doctor uses about 50MB RAM)
Slack: Free tier
My time saved: ~2-3 hours/month of 3 AM debugging

Without rate limiting, my first week cost $8 in API calls. The deduplication + rate limiter brought that down dramatically. Most of my containers run fine. The script only calls Claude when something actually breaks.

If you're monitoring more containers or have noisier logs, expect higher costs. The MAX_DIAGNOSES_PER_HOUR setting is your budget knob.

Security Considerations

Let's talk about the elephant in the room: the Docker socket.

Mounting /var/run/docker.sock gives the Container Doctor root-equivalent access to your Docker daemon. It can start, stop, and remove any container. It can pull images. It can exec into running containers. If someone compromises the Container Doctor, they own your entire Docker host.

Here's how I mitigate this:

Network isolation: The Container Doctor's health endpoint is only exposed on localhost. In production, put it behind a reverse proxy with auth.
Read-mostly access: The script only reads logs and restarts containers. It never execs into containers, pulls images, or modifies volumes.
No external inputs: The script doesn't accept commands from Slack or any external source. It's outbound-only (logs out, alerts out).
API key rotation: I rotate the Anthropic API key monthly. If the container is compromised, the key has limited blast radius.

For a more secure setup, consider Docker's --read-only flag on the socket mount and a tool like docker-socket-proxy to restrict which API calls the Container Doctor can make.

What I'd Do Differently

After 3 weeks in production, here's my honest retrospective:

I'd use structured logging from day one. My regex-based error detection catches too many false positives. A JSON log format with severity levels would make detection way more accurate.

I'd add per-container policies. Right now, every container gets the same treatment. But you probably want different rules for a database vs a web server. Never auto-restart a database. Always auto-restart a stateless web server.

I'd build a simple web UI. The /history endpoint returns JSON, but a small React dashboard showing a timeline of incidents, fix success rates, and cost tracking would be much more useful.

I'd try local models first. For simple errors (OOM, connection refused), a small local model running on Ollama could handle the diagnosis without any API cost. Reserve Claude for the weird, complex stack traces where you actually need strong reasoning.

I'd add a "learning mode." Run the Container Doctor in observe-only mode for a week. Let it diagnose everything but fix nothing. Review the diagnoses manually. Once you trust its judgment, flip on auto-fix. This builds confidence before you give it restart power.

What's Next?

If you found this useful, I write about Docker, AI tools, and developer workflows every week. I'm Balajee Asish – Docker Captain, freeCodeCamp contributor, and currently building my way through the AI tools space one project at a time.

Got questions or built something similar? Drop a comment below or find me on GitHub and LinkedIn.

Happy building.

How to Optimize Your Docker Build Cache & Cut Your CI/CD Pipeline Times by 80%

Balajee Asish Brahmandam — Wed, 18 Mar 2026 21:50:25 +0000

Every developer has been there. You push a one-line fix, grab your coffee, and wait. And wait. Twelve minutes later, your Docker image finishes rebuilding from scratch because something about the cache broke again.

I spent a good chunk of last year debugging slow Docker builds across multiple teams. The pattern was always the same: builds that should take two minutes were eating up fifteen, and nobody knew why. The fix turned out to be surprisingly systematic once I understood what was actually happening under the hood.

This guide walks you through exactly how to fix slow Docker builds, step by step. We'll start with how the cache actually works, then tear apart the most common mistakes, and finish with production-ready patterns you can copy into your projects today.

Prerequisites
How Docker Build Cache Actually Works
- How Cache Keys Are Computed
- The Cache Chain Rule
How to Identify Common Cache-Busting Mistakes
How to Structure Your Dockerfile for Maximum Cache Reuse
How to Set Up CI/CD Cache Backends
How to Implement Advanced Cache Patterns
How to Measure Your Improvements
Complete Optimized Dockerfile Examples
Troubleshooting Guide
Quick-Reference Checklist
Conclusion

Prerequisites

To follow along, you'll need:

A working Docker installation (Docker Desktop or Docker Engine 20.10+)
Basic comfort with writing Dockerfiles
Access to a CI/CD system like GitHub Actions, GitLab CI, or Jenkins

How Docker Build Cache Actually Works

Every instruction in a Dockerfile produces a layer. Docker stores these layers and reuses them when it detects nothing has changed. That's the cache. Simple enough in theory, but the details matter a lot.

How Cache Keys Are Computed

Different instructions compute their cache keys differently:

Instruction	Cache Key Based On	What Breaks It
`RUN`	The exact command string	Any change to the command text
`COPY` / `ADD`	File checksums of the source content	Any modification to the copied files
`ENV` / `ARG`	The variable name and value	Changing the value
`FROM`	The base image digest	A new version of the base image

The Cache Chain Rule

Here's the thing most people miss: Docker cache is sequential. If any layer's cache gets invalidated, every layer after it rebuilds from scratch, even if those later layers haven't changed at all.

Picture a row of dominoes. Knock one over in the middle and everything after it goes down too. This is why the order of instructions in your Dockerfile is so important.

Key insight: The single most impactful optimization you can make is reordering your Dockerfile so that the stuff that changes most often comes last.

How to Identify Common Cache-Busting Mistakes

Before we fix anything, let's look at what's probably breaking your cache right now. I've seen these patterns in almost every unoptimized Dockerfile I've reviewed.

Mistake 1: Copying Everything Too Early

This is the big one. Putting COPY . . near the top of the Dockerfile, before installing dependencies, means that any file change in your project invalidates the cache from that point forward. Changed a README? Cool, now your dependencies reinstall.

# BAD: Any file change invalidates the dependency install
FROM node:20-alpine
WORKDIR /app
COPY . .                    # Cache busted on every commit
RUN npm ci                  # Reinstalls every single time
RUN npm run build

Mistake 2: Not Separating Dependency Files

Your dependency manifests (package.json, requirements.txt, go.mod, Gemfile) change way less often than your source code. If you don't copy them separately, you're reinstalling all dependencies every time you touch a source file.

Mistake 3: Using ADD Instead of COPY

ADD has special behaviors like auto-extracting archives and fetching remote URLs. Those features make its cache behavior unpredictable. Stick with COPY unless you specifically need archive extraction.

Mistake 4: Splitting apt-get update and install

When you put apt-get update and apt-get install in separate RUN commands, the update step gets cached with stale package indexes. Then the install step fails or grabs outdated packages.

# BAD: Stale package index
RUN apt-get update
RUN apt-get install -y curl    # May fail with stale index

# GOOD: Always combine them
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

Mistake 5: Embedding Timestamps or Git Hashes Too Early

Injecting build-time variables like timestamps or git commit hashes via ARG or ENV early in the Dockerfile invalidates the cache on every single build. Move these to the very last layer.

⚠️ Watch out for this: CI/CD systems often inject variables like BUILD_NUMBER or GIT_SHA as build args automatically. If those ARG declarations sit near the top, your cache is toast on every run.

How to Structure Your Dockerfile for Maximum Cache Reuse

Now let's fix those mistakes. These five steps, applied in order, will get you most of the way to an optimized build.

Step 1: Apply the Dependency-First Pattern

Copy only the dependency manifests first, install, and then copy the rest of the source code. This one change alone can cut your build times in half.

# GOOD: Dependency-first pattern for Node.js
FROM node:20-alpine
WORKDIR /app

# Copy ONLY dependency files
COPY package.json package-lock.json ./

# Install dependencies (cached unless package files change)
RUN npm ci --production

# Copy source code (only this layer rebuilds on code changes)
COPY . .

# Build
RUN npm run build

The same idea works across every language:

Language	Copy First	Install Command
Node.js	`package.json`, `package-lock.json`	`npm ci`
Python	`requirements.txt` or `pyproject.toml`	`pip install -r requirements.txt`
Go	`go.mod`, `go.sum`	`go mod download`
Rust	`Cargo.toml`, `Cargo.lock`	`cargo fetch`
Java (Maven)	`pom.xml`	`mvn dependency:go-offline`
Ruby	`Gemfile`, `Gemfile.lock`	`bundle install`

Step 2: Add an Aggressive .dockerignore

A .dockerignore file keeps irrelevant files out of the build context. Fewer files in the context means fewer things that can break your cache.

# .dockerignore
.git
node_modules
dist
*.md
*.log
.env*
docker-compose*.yml
Dockerfile*
.github
tests
coverage
__pycache__

Step 3: Use Multi-Stage Builds

Multi-stage builds let you use a full development image for compiling, then copy only the finished artifacts into a slim runtime image. You get smaller images, better security, and improved cache performance because build tools and intermediate files don't carry over.

# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package.json ./
EXPOSE 3000
CMD ["node", "dist/index.js"]

Step 4: Order Layers by Change Frequency

Think of your Dockerfile as a stack. Put the boring, stable stuff at the top and the volatile stuff at the bottom:

Base image and system dependencies (rarely change)
Language runtime configuration (occasionally change)
Application dependencies (change when you add or remove packages)
Source code (changes on every commit)
Build-time metadata like git hash or version labels (changes every build)

Step 5: Use BuildKit Mount Caches

Docker BuildKit supports RUN --mount=type=cache, which mounts a persistent cache directory that survives across builds. This is a game-changer for package managers that maintain their own download caches.

# syntax=docker/dockerfile:1

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .

# Mount pip cache so downloads persist across builds
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

COPY . .

The best part: mount caches persist even when the layer itself gets invalidated. So if you add one new package, pip only downloads that one package instead of re-fetching everything.

Here are the common cache targets for popular package managers:

Package Manager	Cache Target
pip	`/root/.cache/pip`
npm	`/root/.npm`
yarn	`/usr/local/share/.cache/yarn`
go	`/go/pkg/mod`
apt	`/var/cache/apt`
maven	`/root/.m2/repository`

How to Set Up CI/CD Cache Backends

Here's where things get tricky. Your local Docker cache works great on your laptop because the layers persist between builds. But CI/CD runners are usually ephemeral: each job starts with a totally empty cache. Without explicit cache configuration, every CI build is a cold build.

Option A: Registry-Based Cache

BuildKit can push and pull cache layers from a container registry. This is the most portable approach and works with any CI system.

docker buildx build \
  --cache-from type=registry,ref=myregistry.io/myapp:buildcache \
  --cache-to type=registry,ref=myregistry.io/myapp:buildcache,mode=max \
  --tag myregistry.io/myapp:latest \
  --push .

💡 Use mode=max to cache all layers including intermediate build stages. The default mode=min only caches layers in the final stage, which means your build stage layers get thrown away.

Option B: GitHub Actions Cache

If you're on GitHub Actions, there's native integration with BuildKit through the GitHub Actions cache API. It's fast and requires minimal setup.

# .github/workflows/build.yml
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3

- name: Build and push
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myregistry.io/myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

Option C: S3 or Cloud Storage

For teams on AWS, GCP, or Azure, cloud object storage makes a solid cache backend. It's fast, persistent, and works across any CI system.

docker buildx build \
  --cache-from type=s3,region=us-east-1,bucket=my-docker-cache,name=myapp \
  --cache-to type=s3,region=us-east-1,bucket=my-docker-cache,name=myapp,mode=max \
  --tag myapp:latest .

Option D: Local Cache with Persistent Runners

If your CI runners have persistent storage (self-hosted runners, GitLab runners with shared volumes), you can export cache to a local directory.

docker buildx build \
  --cache-from type=local,src=/ci-cache/myapp \
  --cache-to type=local,dest=/ci-cache/myapp,mode=max \
  --tag myapp:latest .

How to Implement Advanced Cache Patterns

Once you've nailed the basics, these patterns can squeeze out even more performance.

Parallel Build Stages

BuildKit builds independent stages in parallel. If your app has a frontend and a backend that don't depend on each other during build, split them into separate stages and let BuildKit run them simultaneously.

# These stages build in parallel
FROM node:20-alpine AS frontend
WORKDIR /frontend
COPY frontend/package.json frontend/package-lock.json ./
RUN npm ci
COPY frontend/ .
RUN npm run build

FROM python:3.12-slim AS backend
WORKDIR /backend
COPY backend/requirements.txt .
RUN pip install -r requirements.txt
COPY backend/ .

# Final stage combines both
FROM python:3.12-slim
COPY --from=backend /backend /app
COPY --from=frontend /frontend/dist /app/static
CMD ["python", "/app/main.py"]

Cache Warming for Feature Branches

Feature branches often start with a cold cache because they diverge from main. You can warm the cache by specifying multiple --cache-from sources. Docker checks them in order.

docker buildx build \
  --cache-from type=registry,ref=registry.io/app:cache-${BRANCH} \
  --cache-from type=registry,ref=registry.io/app:cache-main \
  --cache-to type=registry,ref=registry.io/app:cache-${BRANCH},mode=max \
  --tag registry.io/app:${BRANCH} .

If the branch cache hits, Docker uses it. If not, it falls back to main's cache, which usually shares most layers. This makes a massive difference for short-lived branches.

Selective Cache Invalidation with Build Args

You can use ARG instructions as cache boundaries. Anything above the ARG stays cached, while anything below it rebuilds when the arg value changes.

FROM node:20-alpine
WORKDIR /app

COPY package.json package-lock.json ./
RUN npm ci

# This ARG only invalidates layers below it
ARG CACHE_BUST_CODE=1
COPY . .
RUN npm run build

# This ARG only invalidates the label
ARG GIT_SHA=unknown
LABEL git.sha=$GIT_SHA

How to Measure Your Improvements

Optimization without measurement is just guessing. Here's how to actually prove your changes are working.

The Four Scenarios to Benchmark

Run each scenario at least three times and take the median:

Cold build: No cache at all (first build or after docker builder prune)
Warm build: No changes, full cache hit
Code change: Only source code modified
Dependency change: Package manifest modified

Real-World Before and After Numbers

Here's what I saw on a mid-sized Node.js project after applying the techniques from this guide:

Scenario	Before	After	Improvement
Cold build	12 min 34 sec	8 min 10 sec	35%
Warm build (no changes)	12 min 34 sec	14 sec	98%
Code change only	12 min 34 sec	1 min 52 sec	85%
Dependency change	12 min 34 sec	4 min 20 sec	65%

The "before" column is the same for all rows because without cache optimization, every build was essentially a cold build. That 85% improvement on code-only changes is the number that matters most, since that's what happens on the vast majority of commits.

How to Check Cache Hit Rates

Set BUILDKIT_PROGRESS=plain to get detailed output showing which layers hit cache:

BUILDKIT_PROGRESS=plain docker buildx build . 2>&1 | grep -E 'CACHED|DONE'

Look for the CACHED prefix on layers. Your goal is to see CACHED on everything except the layers that actually needed to change.

Complete Optimized Dockerfile Examples

Here are production-ready Dockerfiles you can adapt for your own projects.

Node.js Full-Stack App

# syntax=docker/dockerfile:1
FROM node:20-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN --mount=type=cache,target=/root/.npm npm ci

FROM node:20-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 appgroup \
    && adduser --system --uid 1001 appuser
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=deps /app/node_modules ./node_modules
COPY package.json ./
USER appuser
EXPOSE 3000
CMD ["node", "dist/index.js"]

Python FastAPI App

# syntax=docker/dockerfile:1
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --user -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Go Microservice

# syntax=docker/dockerfile:1
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod go mod download
COPY . .
RUN --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 go build -ldflags='-s -w' -o /app/server ./cmd/server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
EXPOSE 8080
ENTRYPOINT ["/server"]

Troubleshooting Guide

When things go wrong, check this table first:

Symptom	Likely Cause	Fix
All layers rebuild every time	`COPY . .` is too early, or `.dockerignore` is missing	Move `COPY . .` after dependency install; add `.dockerignore`
Cache never hits in CI	No cache backend configured	Add `--cache-from` / `--cache-to` with registry, gha, or s3 backend
Cache hits locally but not in CI	Different Docker versions or BuildKit not enabled	Set `DOCKER_BUILDKIT=1` and match Docker versions
Dependency layer always rebuilds	Source files copied before dependency install	Use the dependency-first pattern
Image size keeps growing	Build artifacts leaking into final image	Use multi-stage builds; only copy runtime artifacts
Registry cache is very slow	`mode=max` caching too many layers	Try `mode=min` or switch to gha/s3 for faster backends

Quick-Reference Checklist

Print this out and tape it next to your monitor:

[ ] Enable BuildKit: set DOCKER_BUILDKIT=1 or use docker buildx
[ ] Add a comprehensive .dockerignore file
[ ] Use the dependency-first pattern: copy manifests, install, then copy source
[ ] Order layers from least-changed to most-changed
[ ] Combine RUN commands that belong together (apt-get update && install)
[ ] Use multi-stage builds to separate build and runtime
[ ] Add RUN --mount=type=cache for package manager caches
[ ] Move volatile ARGs (git hash, build number) to the very last layers
[ ] Configure a CI/CD cache backend (registry, gha, or s3)
[ ] Set up cache warming for feature branches from the main branch
[ ] Use COPY instead of ADD unless you need archive extraction
[ ] Benchmark all four scenarios: cold, warm, code change, dependency change

Conclusion

I used to think slow Docker builds were just something you had to live with. After going through this process on a few projects, I realized the fix is pretty mechanical once you understand that one core principle: cache is sequential, and order matters.

Start with the dependency-first pattern and a .dockerignore. Those two changes alone will probably cut your build times in half. Then add multi-stage builds, mount caches, and CI/CD cache backends as you need them.

The teams I've worked with typically see 70-85% reductions in CI/CD pipeline times after spending a few hours on these changes. That's time you get back on every single commit, every single day.

If you found this helpful, consider sharing it with your team. There's a good chance whoever wrote your Dockerfile last didn't know about half of these tricks. No shade to them, I didn't either until I went looking.

Happy building.

How to Containerize Your MLOps Pipeline from Training to Serving

Balajee Asish Brahmandam — Thu, 12 Mar 2026 22:34:01 +0000

Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deploy it.

The model depended on a specific version of scikit-learn that conflicted with the production Python environment. The feature engineering pipeline required a NumPy build compiled against OpenBLAS, but the deployment servers ran MKL. A preprocessing step used a system library that existed on the data scientist's MacBook but not on the Ubuntu deployment target.

Three weeks of debugging later, we had the model running in production. Three weeks. For a model that was technically finished.

That experience is what pushed me to containerize our entire MLOps pipeline end to end. Not because Docker is trendy in ML circles, but because the alternative (hand-tuning environments, writing installation scripts that break on the next OS update, praying that what worked in training works in production) was costing us more time than the actual model development.

In this tutorial, you'll learn how to structure training and serving containers with multi-stage builds, how to set up experiment tracking with MLflow, how to version your training data with DVC, how to configure GPU passthrough for training, and how to tie it all together into a single Compose file with profiles. This is based on a year of running containerized ML pipelines across three teams.

Prerequisites

Docker Engine 24+ or Docker Desktop 4.20+ with Compose v2.22.0+
For GPU training, you'll need the NVIDIA Container Toolkit installed on the host and a compatible GPU driver. Run nvidia-smi to verify your GPU is visible, and docker compose version to check your Compose version.
Familiarity with Python, basic Docker concepts, and ML workflows (training, evaluation, serving) is assumed.

The MLOps Lifecycle: Where Containers Fit

If you've built a machine learning model, you know the process has a lot of stages. But if you're coming from a software engineering background (or you're a data scientist who mostly works in notebooks), it helps to see the full picture of what an MLOps pipeline looks like and where Docker fits into each stage.

An MLOps pipeline is a chain of interdependent stages:

Data ingestion and validation. Raw data comes in from databases, APIs, or file systems. You clean it, validate it, and store it in a format your model can use.
Feature engineering. You transform raw data into features the model can learn from. This might be as simple as normalizing numbers or as complex as generating embeddings.
Experiment tracking. You log every training run's configuration (hyperparameters, data version, code version) and results (accuracy, loss, evaluation metrics) so you can compare experiments and reproduce the best ones.
Model training. The model learns from your features. This is the compute-heavy part that often needs GPUs.
Evaluation. You measure the trained model against test data to see if it's good enough to deploy.
Packaging and serving. You wrap the trained model in an API so other systems can send it data and get predictions back.
Monitoring. You watch the model in production to catch problems like data drift (when the real-world data starts looking different from the training data) or performance degradation.

Each stage has different computational needs. Training might require GPUs and terabytes of memory. Serving needs low latency and horizontal scaling. Feature engineering might need distributed processing tools like Spark or Dask.

The thing that changed our approach: you don't containerize the entire pipeline as one monolithic image. You containerize each stage independently, with shared interfaces between them.

Think of it like microservices applied to ML infrastructure. Each container does one thing, does it well, and communicates with the others through well-defined interfaces: model artifacts stored in a registry, metrics logged to MLflow, data versioned in object storage.

This gives you the flexibility to:

Scale training on expensive GPU instances while running serving on cheaper CPU nodes
Update your feature engineering code without rebuilding your training environment
Version each stage independently in your container registry
Let data scientists and ML engineers work on training while platform engineers optimize serving

How to Build the Training Container

The training container is where most teams start, and where most teams make their first mistake.

The temptation is to create one massive image with every possible library, every CUDA version, every data processing tool. I've seen training images exceed 15GB. They take twenty minutes to build, ten minutes to push, and break whenever someone adds a new dependency.

Here's the pattern that works: use multi-stage builds to separate the build environment from the runtime environment, and use cache mounts to avoid re-downloading packages on every build.

If you're new to these concepts: a multi-stage build lets you use one Docker image to build your software and a different, smaller image to run it. You copy only the final artifacts from the build stage to the runtime stage, leaving behind compilers, build tools, and other things you don't need in production.

A cache mount tells Docker to keep a directory (like pip's download cache) between builds, so it doesn't re-download packages that haven't changed.

Here's the training Dockerfile:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

# System dependencies (rarely change)
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Dependencies (change occasionally)
COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

# Training code (changes frequently)
COPY src/ /app/src/
COPY configs/ /app/configs/

WORKDIR /app
ENTRYPOINT ["python", "-m", "src.train"]

Notice the layer ordering. Docker builds images in layers, and it caches each layer. If a layer hasn't changed, Docker reuses the cached version instead of rebuilding it. But here's the catch: if one layer changes, Docker rebuilds that layer and every layer after it.

That's why we put things in order of how often they change:

System packages at the top (they almost never change). Installing python3.11 and git takes time, but you only do it once.
Python dependencies in the middle (they change when you add or update a library). This layer rebuilds when requirements-train.txt changes.
Your actual code at the bottom (changes on every commit). This is the layer that rebuilds most often.

With this ordering, a code change only rebuilds the final layer, not the entire image. If you put COPY src/ before pip install, every code change would trigger a full reinstall of all Python packages. That's the mistake I see most often in ML Dockerfiles.

The --mount=type=cache,target=/root/.cache/pip line on the pip install command tells Docker to persist pip's download cache between builds. When you do update requirements, pip checks the cache first and only downloads packages that are new or changed. On a project with hundreds of ML dependencies (PyTorch alone pulls in dozens of sub-packages), this saves five to ten minutes per build.

Separate Training from Serving Requirements

Your training environment needs libraries that your serving environment does not. Training needs experiment tracking tools like MLflow, data processing libraries like pandas and polars, visualization libraries for debugging, and hyperparameter tuning frameworks. Serving needs a lightweight inference runtime, an API framework like FastAPI, health check endpoints, and minimal overhead.

It's a good idea to maintain separate requirements files:

# requirements-train.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
pandas==2.2.3
polars==1.20.0
dvc[s3]==3.59.1
optuna==4.2.0
matplotlib==3.10.0

# requirements-serve.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
fastapi==0.115.0
uvicorn[standard]==0.34.0
pydantic==2.10.0

The overlap is smaller than you'd think. torch and scikit-learn appear in both because the model needs them for inference. Everything else in the training file is baggage that slows down serving deployments and increases the attack surface.

CUDA and Driver Compatibility

One thing that will bite you if you ignore it: the CUDA runtime version inside your container must be compatible with the GPU driver version on the host. The rule is that the host driver must be equal to or newer than the CUDA version in the container. For example, CUDA 12.6 requires driver version 560.28+ on Linux.

Make sure you check your host driver version before choosing your base image:

# On the host machine
nvidia-smi
# Look for "Driver Version: 560.35.03" and "CUDA Version: 12.6"

# The CUDA version shown by nvidia-smi is the maximum CUDA version
# your driver supports, not the version installed

If your host driver is 535.x, don't use a cuda:12.6 base image. Use cuda:12.2 or upgrade the driver. Mismatched versions produce cryptic errors like CUDA error: no kernel image is available for execution on the device that are painful to debug.

Pin your base images to specific tags (not latest) and document the minimum driver version in your README. When you deploy to new hardware, the driver version check should be part of your provisioning checklist.

How to Set Up Experiment Tracking with MLflow

If you've ever trained a model and thought "wait, which hyperparameters gave me that good result last week?", you need experiment tracking. Without it, ML development turns into a mess of Jupyter notebooks, screenshots of metrics, and spreadsheets that nobody keeps up to date.

MLflow is the most widely adopted open-source tool for this. It logs three things for every training run: parameters (learning rate, batch size, number of epochs), metrics (accuracy, loss, F1 score), and artifacts (the trained model file, plots, evaluation reports). It stores all of this in a database and gives you a web UI to compare runs side by side.

Running MLflow as a containerized service means the tracking server is persistent and shared across your team, not running on one person's laptop:

services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  mlflow-artifacts:
  postgres-data:

Let me break down what's happening here.

The mlflow service runs the MLflow tracking server. It stores experiment metadata (parameters, metrics) in a Postgres database and saves artifacts (model files, plots) to a Docker volume.

The depends_on with condition: service_healthy tells Compose to wait until Postgres is actually ready to accept connections before starting MLflow. Without this, MLflow would crash on startup because the database isn't ready yet.

The db service runs Postgres with a health check that uses pg_isready, a built-in Postgres utility that checks if the database is accepting connections. The start_period gives Postgres 10 seconds to initialize before health checks start counting failures.

Your training code connects to MLflow by setting one environment variable:

import os
import mlflow

# This tells MLflow where to log experiments
# When running inside Docker Compose, "mlflow" resolves to the mlflow container
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow:5000"

# Example: logging a training run
with mlflow.start_run(run_name="fraud-detector-v2"):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)
    mlflow.log_param("epochs", 50)

    # ... train your model here ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.log_metric("precision", 0.93)
    mlflow.log_metric("recall", 0.89)

    # Log the trained model as an artifact
    mlflow.sklearn.log_model(model, "model")
    # Or for PyTorch: mlflow.pytorch.log_model(model, "model")

After the run completes, open http://localhost:5000 in your browser. You'll see a table of all your runs with their parameters and metrics. Click any run to see details, compare it with other runs, or download the model artifact. No more "I think experiment 7 was the good one" conversations.

A note on the password in the YAML: for local development this is fine. For staging and production, use Docker secrets or inject the credentials from your CI environment. Don't commit real database passwords to your repo.

How to Version Training Data with DVC

Models are reproducible only if you can also reproduce the data they were trained on. This is a problem Git can't solve on its own, because training datasets are often gigabytes or terabytes in size and Git isn't designed for large binary files.

DVC (Data Version Control) fills this gap. It works like Git, but for data. Here's the concept: instead of storing your 10GB training dataset in Git, DVC stores a small text file (a .dvc file) that acts as a pointer to the actual data. The real data lives in cloud storage (S3, Google Cloud Storage, Azure Blob). When you check out a specific Git commit, DVC knows which version of the data goes with that commit and can pull it from remote storage.

The workflow on your local machine looks like this:

# Initialize DVC in your project (one time)
dvc init

# Add your training data to DVC tracking
dvc add data/training_data.parquet
# This creates data/training_data.parquet.dvc (small pointer file)
# and adds training_data.parquet to .gitignore

# Push the actual data to remote storage
dvc push

# Commit the pointer file to Git
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"

Now your Git repo contains the pointer file, and the real data lives in S3. When someone else (or a container) needs the data, they run dvc pull and DVC downloads it from remote storage.

The training Dockerfile includes DVC, and the entrypoint pulls the correct data version before training begins:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

COPY src/ /app/src/
COPY configs/ /app/configs/

# DVC tracking files (these are small text files in Git)
COPY data/*.dvc /app/data/
COPY .dvc/ /app/.dvc/

WORKDIR /app
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

The entrypoint script pulls the data and then starts training:

#!/bin/bash
set -e

echo "Pulling training data from remote storage..."
dvc pull data/

echo "Starting training run..."
python -m src.train "$@"

For DVC to pull from S3, the container needs AWS credentials. You can pass them as environment variables in your Compose file or mount them from the host:

training:
  build: { context: ., dockerfile: Dockerfile.train }
  environment:
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - AWS_DEFAULT_REGION=us-east-1

Combined with MLflow's experiment logging, you get a complete provenance chain: this model was trained on this version of the data (tracked by DVC), with these parameters (logged in MLflow), producing these metrics.

You can reproduce any past experiment by checking out the Git commit and running the training container.

How to Build the Serving Container

"Serving" means wrapping your trained model in an API so other systems can send it data and get predictions back. For example, a fraud detection model might expose a /predict endpoint that accepts transaction data and returns a fraud probability.

The serving container has different priorities than the training container. Training optimizes for flexibility and raw compute. Serving optimizes for speed, small size, and reliability:

FROM python:3.11-slim AS serving

WORKDIR /app

# Install curl for healthcheck
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

COPY requirements-serve.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-serve.txt

COPY src/serving/ /app/src/serving/

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.serving.app:app", "--host", "0.0.0.0"]

A few things to understand if you're new to this:

uvicorn is a lightweight Python web server that runs FastAPI applications. FastAPI is a framework for building APIs in Python. Together, they let you turn your model into a web service that responds to HTTP requests.

HEALTHCHECK tells Docker to periodically check if your container is actually working, not just running. Every 30 seconds, Docker runs the curl command against the /health endpoint. If it fails three times in a row, Docker marks the container as unhealthy. This matters because your model server might be running but not ready (maybe the model file is still downloading), and you don't want to send traffic to a server that can't respond.

start-period of 60 seconds is important for ML serving containers. Model loading can take time, especially for large models (loading a 2GB model from a registry takes a while). Without start-period, the health check would start failing immediately, count those failures toward the retry limit, and the orchestrator might kill the container before the model finishes loading. The start period gives the container grace time to initialize.

Notice we're using python:3.11-slim here, not the NVIDIA CUDA image. Most trained models can run inference on CPU. If you need GPU inference (for example, running a large language model or doing real-time video processing), use the CUDA base image instead, but be aware that it makes the serving container much larger.

If you want to skip the curl dependency, use Python's built-in urllib for the health check:

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1

Decouple Models from Containers

This is one of the most important patterns in this article, and the one beginners most often get wrong.

The temptation is to copy your trained model file (the .pkl, .pt, or .onnx file that contains the learned weights) directly into the Docker image during the build. Don't do this. When you embed model files in your Docker image, every model update requires a new image build and push. For a 2GB model, that means rebuilding the container, uploading 2GB to a registry, and redeploying, even though only the model changed and the code is identical.

Instead, have your serving container download the model from a model registry (like MLflow) or cloud storage (like S3) at startup. The container image stays small and generic. Model updates become a configuration change (pointing to a new model version) rather than a deployment.

Here's a full serving app using FastAPI with the modern lifespan pattern. If you've used Flask, FastAPI is similar but faster and with built-in request validation:

import os
from contextlib import asynccontextmanager

import mlflow
from fastapi import FastAPI

# MODEL_URI points to a specific model version in MLflow's registry
# Format: "models://" where stage is Staging or Production
MODEL_URI = os.environ.get("MODEL_URI", "models:/fraud-detector/production")
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    # This runs once when the server starts up
    global model
    print(f"Loading model from {MODEL_URI}...")
    model = mlflow.pyfunc.load_model(MODEL_URI)
    print("Model loaded successfully.")
    yield
    # This runs when the server shuts down
    print("Shutting down model server.")


app = FastAPI(lifespan=lifespan)


@app.get("/health")
async def health():
    """Used by Docker HEALTHCHECK to verify the server is ready."""
    if model is None:
        return {"status": "loading"}, 503
    return {"status": "healthy"}


@app.post("/predict")
async def predict(features: dict):
    """Accept features as JSON, return model prediction."""
    import pandas as pd

    # Convert the input dict into a DataFrame (what most sklearn/mlflow models expect)
    df = pd.DataFrame([features])
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}

When a client sends a POST request to /predict with JSON like {"amount": 500, "merchant_category": "electronics", "hour": 23}, the model returns a prediction. The /health endpoint returns 503 while the model is loading and 200 once it's ready, which is exactly what the Docker HEALTHCHECK checks for.

Promoting a new model version means updating the MODEL_URI environment variable and restarting the container. The MLflow model registry supports stage transitions (Staging, Production, Archived), so you can promote a model in the MLflow UI and then point your serving container at the new version.

For zero-downtime model updates, implement a reload endpoint that swaps models without restarting:

@app.post("/admin/reload")
async def reload_model():
    global model
    model = mlflow.pyfunc.load_model(MODEL_URI)
    return {"status": "reloaded"}

How to Configure GPU Passthrough for Training

By default, Docker containers can't see the GPU hardware on the host machine. "GPU passthrough" means giving a container access to the host's GPUs so that libraries like PyTorch and TensorFlow can use them for accelerated computation.

This requires two things on the host (the machine running Docker, not inside the container):

NVIDIA GPU drivers installed and working. Verify with nvidia-smi. If that command shows your GPUs, you're good.
NVIDIA Container Toolkit installed. This is the bridge between Docker and the GPU drivers. Install it from the NVIDIA docs and verify with docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi. If you see your GPU listed, the toolkit is working.

Once the host is set up, GPU access in Docker Compose looks like this:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

The deploy.resources.reservations.devices block is saying: "this container needs all available NVIDIA GPUs." Inside the container, PyTorch and TensorFlow will see the GPUs and use them automatically. You can verify by adding print(torch.cuda.is_available()) to your training script, which should print True.

If you're running Compose v2.30.0+, you can use the shorter gpus syntax:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    gpus: all
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

For multi-GPU training with frameworks like PyTorch's DistributedDataParallel, you can assign specific GPUs using device_ids. This matters when running multiple training jobs at the same time:

services:
  training-job-1:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

  training-job-2:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

Note that CUDA_VISIBLE_DEVICES inside the container is relative to the devices assigned by Docker, not the host GPU indices. Both containers see their GPUs as device 0 and 1, even though they're using different physical GPUs.

How to Tie It All Together with Compose Profiles

If you're new to Compose profiles: by default, docker compose up starts every service defined in your docker-compose.yml. But you don't always want everything running. Your MLflow server and serving API should run all the time, but the training container should only launch when you're actually training a model (and it needs a GPU, which you might not have on your laptop).

Profiles solve this. When you add profiles: ["train"] to a service, that service is excluded from docker compose up by default. It only starts when you explicitly activate the profile with docker compose --profile train. This means one file defines your entire ML infrastructure, but you control what runs and when.

Here's the complete docker-compose.yml that ties every piece from this article together:

services:
  # --- Always-on infrastructure ---
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  serving:
    build: { context: ., dockerfile: Dockerfile.serve }
    ports:
      - "8000:8000"
    environment:
      - MODEL_URI=models:/fraud-detector/production
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      mlflow: { condition: service_started }

  # --- Training (on-demand) ---
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    profiles: ["train"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      mlflow: { condition: service_started }

volumes:
  postgres-data:
  mlflow-artifacts:

The day-to-day workflow with this file:

# Step 1: Start the infrastructure (MLflow + Postgres + serving API)
# The -d flag runs everything in the background
docker compose up -d

# Step 2: Open the MLflow UI to see past experiments
open http://localhost:5000    # macOS
# xdg-open http://localhost:5000  # Linux

# Step 3: Check that the serving API is healthy
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Step 4: Run a training job (pulls data via DVC, logs to MLflow)
# This only starts the "training" service because of the profile flag
docker compose --profile train run training

# Step 5: Watch training progress in the MLflow UI at localhost:5000
# You'll see metrics updating in real time if your training code logs them

# Step 6: After training completes, promote the model in MLflow UI
# Click the model, go to "Register Model", set stage to "Production"

# Step 7: Restart the serving container to pick up the new model version
docker compose restart serving

# Step 8: Test the new model
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "hour": 23}'

This single-file approach means a new team member can clone the repo, run docker compose up -d, and have the complete ML infrastructure running locally within minutes. The same containers deploy to staging and production with only environment variable changes (database credentials, model URIs, GPU allocation).

Reproducibility: The Whole Point

Everything in this article serves one goal: reproducibility. The ability to take any commit hash, build the same containers, pull the same data, and produce the same model.

Here are the practices that make this work:

Pin Everything

Pin your base images to specific digests, not just tags. Pin your Python packages to exact versions with pip freeze > requirements.txt. Use fixed random seeds in your training code and log them in MLflow.

Log Everything

Every training run should log the exact library versions (pip freeze), the Git commit hash, the DVC data version, all hyperparameters, and all evaluation metrics to MLflow. You can automate this:

import subprocess
import mlflow

with mlflow.start_run():
    # Log environment info automatically
    pip_freeze = subprocess.check_output(["pip", "freeze"]).decode()
    mlflow.log_text(pip_freeze, "pip_freeze.txt")

    git_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    mlflow.log_param("git_commit", git_hash)

    # ... rest of training ...

Version Everything

Git for code, DVC for data, MLflow for experiments, Docker digests for environments. The combination creates a complete provenance chain. When a stakeholder asks why a model made a particular prediction, you can trace it back to the exact code, data, and hyperparameters that produced it. For regulated industries like finance and healthcare, that traceability is a compliance requirement, not a nice-to-have.

Where This Breaks Down

This approach works well for small-to-medium teams running on single hosts or small clusters. Here's where you'll hit walls:

Large datasets. Don't mount multi-terabyte datasets into containers. Use object storage (S3, GCS) and stream data during training. DVC handles the versioning, but the data itself should live outside Docker entirely.

GPU driver mismatches. Your container's CUDA version must be compatible with the host driver. Test on identical hardware and driver versions to what you'll run in production. Document the minimum driver version in your README.

Multi-node training. When you need to distribute training across multiple machines, you'll outgrow Compose. Kubernetes with Kubeflow or KServe is the standard path for distributed training and auto-scaled serving.

Serving at scale. A single container running uvicorn handles moderate traffic. For high-throughput inference, you'll need a load balancer, multiple replicas, and potentially a dedicated serving framework like NVIDIA Triton Inference Server or TensorFlow Serving. Compose can run multiple replicas with docker compose up --scale serving=3, but it doesn't give you the routing, health-based load balancing, or rolling updates that a real orchestrator provides.

Secrets in production. The Compose file above uses plaintext passwords for local development. In production, use Docker secrets, HashiCorp Vault, or your cloud provider's secret manager. Never commit credentials to your repo.

Conclusion

Containerizing your MLOps pipeline turns fragile, environment-dependent models into reproducible, deployable artifacts. Multi-stage builds keep images lean. MLflow gives you experiment tracking and model lineage. DVC links code to data. GPU passthrough preserves training performance. A single Compose file with profiles ties the whole workflow together.

That fraud detection model I mentioned at the start? We eventually containerized the entire pipeline around it. The next model we shipped went from "notebook finished" to "running in production" in two days instead of three weeks. Most of that time was spent on evaluation and review, not fighting environments.

Containerization doesn't make your models better. It gets the infrastructure out of the way so you can focus on the work that does.

But even with these caveats, containerized MLOps eliminates the most common source of ML project delays: environment mismatch between development and production. The three weeks we spent debugging that fraud detection model deployment? That doesn't happen anymore.

If you found this useful, you can find me writing about MLOps, containerized workflows, and production AI systems on my blog.

How to Build an MCP Server with Python, Docker, and Claude Code

Balajee Asish Brahmandam — Tue, 10 Mar 2026 21:41:44 +0000

Every MCP tutorial I've found so far has followed the same basic script: build a server, point Claude Desktop at it, screenshot the chat window, done.

This is fine if you want a demo. But it's not fine if you want something you can ship, defend in an interview, or hand to another developer without a README that starts with "first, install this Electron app."

So I built an MCP server in Python, containerized it with Docker, and wired it into Claude Code – all from the terminal, no GUI required.

This article walks through the full loop in one afternoon: what MCP actually is, why it matters now that OpenAI and Google have adopted it, the real security problems nobody puts in their tutorial (complete with CVEs), and every command you need to go from an empty directory to a working tool.

If you're between jobs and need a portfolio project that shows you understand how AI tooling actually works under the hood, this is the one.

What You Will Build

By the end of this tutorial, you will have:

A Python MCP server that exposes custom tools to any MCP-compatible AI client
A Docker container that packages the server for reproducible deployment
A working connection between that container and Claude Code in your terminal
An understanding of the security risks involved and how to mitigate the worst of them

The server we are building is a project scaffolder. You give it a project name and a language, and it generates a starter directory structure with the right files. It's simple enough to build in an afternoon, but useful enough to actually put on your résumé.

Prerequisites

You will need the following installed on your machine:

Python 3.10+ (check with python3 --version)
Docker (check with docker --version)
Claude Code with an active Claude Pro, Max, or API plan (check with claude --version)
Node.js 20+ (required by Claude Code – check with node --version)
A terminal you are comfortable in

If you don't have Claude Code installed yet, follow the official installation instructions. The npm installation method is deprecated, so make sure you use the native binary installer instead.

What is MCP (and Why Should You Care)?

The Model Context Protocol (MCP) is an open standard that lets AI models connect to external tools and data sources. Anthropic released it in November 2024, and within a year it became the default way to extend what an LLM can do. OpenAI adopted it in March 2025. Google DeepMind followed in April. The protocol now has over 97 million monthly SDK downloads and more than 10,000 active servers.

The easiest way to think about MCP is as a USB-C port for AI. Before MCP, every AI provider had its own way of calling tools. OpenAI had function calling. Google had their own format. If you wanted your tool to work with multiple models, you had to implement it multiple times. MCP gives you one interface that works everywhere.

Here is how the pieces fit together:

An MCP server exposes tools, resources, and prompts. It is your code.
An MCP client (like Claude Code, Claude Desktop, or Cursor) discovers those tools and calls them on behalf of the LLM.
The transport is how they communicate. For local servers, that's usually stdio (standard input/output). For remote servers, it's HTTP.

When you type a message in Claude Code and it decides to use one of your tools, here is what happens: Claude Code sends a JSON-RPC 2.0 message to your server over stdin, your server executes the tool and writes the result to stdout, and Claude Code reads it back. The LLM never talks to your server directly. The client is always in the middle.

If you want the deeper architecture breakdown, freeCodeCamp already has a solid explainer on how MCP works under the hood. Here, I will focus on building.

Why Claude Code Instead of Claude Desktop?

Most MCP tutorials use Claude Desktop as the client. That works, but Claude Code has a few advantages for developers:

It lives in your terminal. No GUI to configure. No JSON files to hand-edit in hidden config directories. You add an MCP server with one command and you are done.
It's already where you code. If you're writing the server, testing it, and connecting it, doing all of that in the same terminal session cuts the context switching.
It works on headless machines. If you're SSHing into a dev box or running in CI, Claude Desktop isn't an option. Claude Code is.
It's also an MCP server itself. Claude Code can expose its own tools (file reading, writing, shell commands) to other MCP clients via claude mcp serve. That's a neat trick we won't use today, but it's worth knowing about.

The relevant commands:

# Add an MCP server
claude mcp add  -- 

# List configured servers
claude mcp list

# Remove a server
claude mcp remove 

# Check MCP status inside Claude Code
/mcp

Step 1: Build the MCP Server

We're using FastMCP, a Python framework that handles all the protocol plumbing so you can focus on your tools. Create a new project directory and set it up:

mkdir mcp-scaffolder && cd mcp-scaffolder
python3 -m venv .venv
source .venv/bin/activate
pip install "mcp[cli]>=1.25,<2"

Why pin the version? The MCP Python SDK v2.0 is in development and will change the transport layer significantly. Pinning to >=1.25,<2 keeps your server working until you're ready to migrate.

Now create server.py:

# server.py
from mcp.server.fastmcp import FastMCP
import os
import json

mcp = FastMCP("project-scaffolder")

# Templates for different languages
TEMPLATES = {
    "python": {
        "files": {
            "main.py": '"""Entry point."""\n\n\ndef main():\n    print("Hello, world!")\n\n\nif __name__ == "__main__":\n    main()\n',
            "requirements.txt": "",
            "README.md": "# {name}\n\nA Python project.\n\n## Setup\n\n```bash\npip install -r requirements.txt\npython main.py\n```\n",
            ".gitignore": "__pycache__/\n*.pyc\n.venv/\n",
        },
        "dirs": ["tests"],
    },
    "node": {
        "files": {
            "index.js": 'console.log("Hello, world!");\n',
            "package.json": '{{\n  "name": "{name}",\n  "version": "1.0.0",\n  "main": "index.js"\n}}\n',
            "README.md": "# {name}\n\nA Node.js project.\n\n## Setup\n\n```bash\nnpm install\nnode index.js\n```\n",
            ".gitignore": "node_modules/\n",
        },
        "dirs": [],
    },
    "go": {
        "files": {
            "main.go": 'package main\n\nimport "fmt"\n\nfunc main() {{\n\tfmt.Println("Hello, world!")\n}}\n',
            "go.mod": "module {name}\n\ngo 1.21\n",
            "README.md": "# {name}\n\nA Go project.\n\n## Setup\n\n```bash\ngo run main.go\n```\n",
            ".gitignore": "bin/\n",
        },
        "dirs": ["cmd", "internal"],
    },
}


@mcp.tool()
def scaffold_project(name: str, language: str) -> str:
    """Create a new project directory structure.

    Args:
        name: The project name (used as the directory name)
        language: The programming language - one of: python, node, go
    """
    language = language.lower().strip()

    if language not in TEMPLATES:
        return json.dumps({
            "error": f"Unsupported language: {language}",
            "supported": list(TEMPLATES.keys()),
        })

    template = TEMPLATES[language]
    base_path = os.path.join(os.getcwd(), name)

    if os.path.exists(base_path):
        return json.dumps({
            "error": f"Directory already exists: {name}",
        })

    # Create the project directory
    os.makedirs(base_path, exist_ok=True)

    # Create subdirectories
    for dir_name in template["dirs"]:
        os.makedirs(os.path.join(base_path, dir_name), exist_ok=True)

    # Create files
    created_files = []
    for filename, content in template["files"].items():
        filepath = os.path.join(base_path, filename)
        formatted_content = content.replace("{name}", name)
        with open(filepath, "w") as f:
            f.write(formatted_content)
        created_files.append(filename)

    return json.dumps({
        "status": "created",
        "path": base_path,
        "language": language,
        "files": created_files,
        "directories": template["dirs"],
    })


@mcp.tool()
def list_templates() -> str:
    """List all available project templates and their contents."""
    result = {}
    for lang, template in TEMPLATES.items():
        result[lang] = {
            "files": list(template["files"].keys()),
            "directories": template["dirs"],
        }
    return json.dumps(result, indent=2)


if __name__ == "__main__":
    mcp.run(transport="stdio")

A few things to notice about this code:

Tools return strings. MCP tools communicate through text. I'm returning JSON strings so the LLM can parse the results reliably. You could return plain text, but structured data gives the model more to work with.

The @mcp.tool() decorator does the heavy lifting. FastMCP reads your function signature and docstring to generate the JSON schema that tells the LLM what this tool does, what arguments it takes, and what types they are. Good docstrings aren't optional here – they're how the LLM decides whether to call your tool.

transport="stdio" is the key line. This tells FastMCP to communicate over standard input/output, which is what Claude Code expects for local servers.

Step 2: Test It Locally

Before we Dockerize anything, make sure the server actually works:

# Quick smoke test - the server should start without errors
python server.py

You should see... nothing. That is correct. An MCP server over stdio just sits there waiting for JSON-RPC messages on stdin. Press Ctrl+C to stop it.

For a proper test, use the MCP Inspector (Anthropic's debugging tool):

# Install and run the inspector
npx @modelcontextprotocol/inspector python server.py

This opens a web interface where you can see your tools, call them manually, and inspect the JSON-RPC messages going back and forth. Verify that both scaffold_project and list_templates show up and return sensible results.

Here's a debugging tip that will save you time: If your MCP server logs anything to stdout, it will corrupt the JSON-RPC stream and the client will disconnect. Use stderr for all logging: print("debug info", file=sys.stderr). This is the single most common source of "my server connects but then immediately fails" bugs. The New Stack called stdio transport "incredibly fragile" for exactly this reason.

Step 3: Dockerize It

Create a Dockerfile in your project root:

FROM python:3.12-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy server code
COPY server.py .

# MCP servers over stdio need unbuffered output
ENV PYTHONUNBUFFERED=1

# The server reads from stdin and writes to stdout
CMD ["python", "server.py"]

Create requirements.txt:

mcp[cli]>=1.25,<2

Build and verify:

docker build -t mcp-scaffolder .

# Quick test - should start without errors
docker run -i mcp-scaffolder

Again, you'll see nothing because the server is waiting for input. Ctrl+C to stop.

Two things matter in this Dockerfile:

PYTHONUNBUFFERED=1 is critical. Without it, Python buffers stdout, and the MCP client may hang waiting for responses that are sitting in a buffer. This is one of those bugs that works fine in local testing and breaks in Docker.
docker run -i (interactive mode) is required. The -i flag keeps stdin open so the MCP client can send messages to the container. Without it, the server gets an immediate EOF and exits.

Step 4: Wire It Into Claude Code

Now connect your Docker container to Claude Code:

claude mcp add scaffolder -- docker run -i --rm mcp-scaffolder

That's the whole command. Let me break it down:

claude mcp add registers a new MCP server
scaffolder is the name you will reference it by
Everything after -- is the command Claude Code runs to start the server
docker run -i --rm mcp-scaffolder starts the container with interactive stdin and removes it when done

Verify that it registered:

claude mcp list

You should see scaffolder in the output with a stdio transport type.

Now launch Claude Code and check the connection:

claude

Once inside Claude Code, type /mcp to see the status of your MCP servers. You should see scaffolder listed as connected with two tools available.

Step 5: Use It

Still inside Claude Code, try it out:

Create a new Python project called "weather-api"

Claude Code should discover your scaffold_project tool, call it with name="weather-api" and language="python", and report back what it created. Check your filesystem and you should see the full project structure.

Try a few more:

What project templates are available?

Scaffold a Go project called "url-shortener"

If Claude Code doesn't pick up your tools, run /mcp to check the connection status. If it shows as disconnected, the most common causes are that the Docker image failed to build, stdout is being polluted (check for stray print statements), or the Docker daemon is not running.

Security: What the Other Tutorials Leave Out

This is the section most MCP tutorials skip. They should not. MCP has had real security incidents, not theoretical ones, and understanding them makes you a better developer.

The Prompt Injection Problem

MCP servers execute code on your machine based on what an LLM decides to do. If an attacker can influence what the LLM sees, they can influence what your server does. This is called prompt injection, and it is the number one unsolved security problem in the MCP ecosystem.

In May 2025, researchers at Invariant Labs demonstrated this against the official GitHub MCP server. They created a malicious GitHub issue that, when read by an AI agent, hijacked the agent into leaking private repository data (including salary information) into a public pull request. The root cause was an overly broad Personal Access Token combined with untrusted content landing in the LLM's context window.

This was not a contrived lab demo. It used the official GitHub MCP server, the kind of thing people install from the MCP server directory without a second thought.

Real CVEs, Not Theory

The ecosystem has accumulated real vulnerability reports:

CVE-2025-6514: A critical command-injection bug in mcp-remote, a popular OAuth proxy that 437,000+ environments used. An attacker could execute arbitrary OS commands through crafted OAuth redirect URIs.
CVE-2025-6515: Session hijacking in oatpp-mcp through predictable session IDs, letting attackers inject prompts into other users' sessions.
MCP Inspector RCE: Anthropic's own debugging tool allowed unauthenticated remote code execution. Inspecting a malicious server meant giving the attacker a shell on your machine.

An Equixly security assessment found command injection in 43% of tested MCP server implementations. Nearly a third were vulnerable to server-side request forgery.

What You Should Actually Do

For the server we built today, here is what matters:

Limit file system access

Our Docker container doesn't mount your home directory. That's intentional. If you need the server to write files to your host, mount only the specific directory you need: docker run -i --rm -v $(pwd)/projects:/app/projects mcp-scaffolder. Never mount / or ~.

Validate all inputs

Our scaffold_project tool checks that the language is in a known list and that the directory does not already exist. But think about what happens if someone passes name="../../etc/passwd" as the project name. Path traversal is the kind of thing you need to catch. Add this to the tool:

# Add this validation at the top of scaffold_project
if ".." in name or "/" in name or "\\" in name:
    return json.dumps({"error": "Invalid project name"})

Use least-privilege tokens

If your MCP server connects to an API, give it the minimum permissions it needs. The GitHub MCP incident happened because the PAT had access to every private repo. A read-only token scoped to one repo would have contained the blast radius.

Do not install MCP servers from untrusted sources

A malicious npm package posing as a "Postmark MCP Server" was caught silently BCC'ing all emails to an attacker's address. Treat MCP server packages with the same caution you would give any code that runs on your machine with your permissions.

What to Do Next

You have a working MCP server in a Docker container, connected to Claude Code. Here is how to make it portfolio-ready:

Add more tools: The scaffolder is a starting point. Add a tool that reads a project's dependency file and lists outdated packages. Add one that generates a Dockerfile for an existing project. Each tool is a function with a decorator – the pattern is the same every time.
Add tests: Write pytest tests that call your tool functions directly and verify the output. MCP tools are just Python functions. Test them like Python functions.
Push the Docker image: Tag it and push to Docker Hub or GitHub Container Registry. Then your claude mcp add command becomes claude mcp add scaffolder -- docker run -i --rm yourusername/mcp-scaffolder:latest and anyone can use it.
Write a README that explains the security model: What permissions does your server need? What file system access? What happens if inputs are malicious? Answering these questions in your README signals that you think about security, which is exactly what hiring managers are looking for right now.

Wrapping Up

We built a Python MCP server with FastMCP, containerized it with Docker, and connected it to Claude Code. The whole thing fits in about 100 lines of Python, a six-line Dockerfile, and one claude mcp add command.

The MCP ecosystem is real and growing fast. The protocol has the backing of Anthropic, OpenAI, and Google. It's now governed by the Linux Foundation. But it's also young, and the security story is still being written. Build with it, but build with your eyes open.

If you want to go deeper, here are the resources I found most useful:

MCP specification: the actual protocol docs
Claude Code MCP documentation: how Claude Code implements MCP
FastMCP GitHub: the Python framework we used
AuthZed's timeline of MCP security incidents: required reading if you are building MCP servers for production
Simon Willison on MCP prompt injection: the clearest explanation of why this is hard to solve

The complete source code for this tutorial is on GitHub.

How to Use Docker Compose for Production Workloads — with Profiles, Watch Mode, and GPU Support

Balajee Asish Brahmandam — Fri, 06 Mar 2026 14:04:17 +0000

There's a perception problem with Docker Compose. Ask a room full of platform engineers what they think of it, and you'll hear some version of: "It's great for local dev, but we use Kubernetes for real work."

I get it. I held that same opinion for years. Compose was the thing I used to spin up a Postgres database on my laptop, not something I'd trust with a staging environment, let alone a workload that needed GPU access.

Then 2024 and 2025 happened. Docker shipped a set of features that quietly transformed Compose from a developer convenience tool into something that can handle complex deployment scenarios. Profiles let you manage multiple environments from a single file. Watch mode killed the painful rebuild cycle that made container-based development feel sluggish. GPU support opened the door to ML inference workloads. And a bunch of smaller improvements (better health checks, Bake integration, structured logging) filled in the gaps that used to make Compose feel like a toy.

Here's what I'll cover: using Docker Compose profiles to manage multiple environments from one file, setting up watch mode for instant code syncing during development, configuring GPU passthrough for machine learning workloads, implementing proper health checks and startup ordering so your services stop crashing on cold starts, and using Bake to bridge the gap between your local Compose workflow and production image builds. I'll also tell you where Compose still falls short and where you should reach for something else.

Prerequisites

You should be comfortable with Docker basics and have written a compose.yaml file before. You'll need Docker Compose v2 installed. The minimum version depends on which features you want: service_healthy dependency conditions require v2.20.0+, watch mode requires v2.22.0+, and the gpus: shorthand requires v2.30.0+. Run docker compose version to check what you have.

Prerequisites
The Modern Compose File: What's Changed
How to Use Profiles to Manage Multiple Environments
- Real-World Profile Patterns I've Used
How to Use Watch Mode to End the Rebuild Cycle
- Watch Mode vs. Bind Mounts
How to Set Up GPU Support for Machine Learning Workloads
- How to Combine Multi-GPU Workloads with Profiles
How to Configure Health Checks, Dependencies, and Startup Ordering
How to Use Bake for Production Image Builds
What Compose Is Not (An Honest Assessment)
A Practical Adoption Path
Wrapping Up

The Modern Compose File: What's Changed

If you haven't looked at a Compose file recently, the first thing you'll notice is that the version field is gone. Docker Compose v2 ignores it entirely, and including it actually triggers a deprecation warning. A modern compose.yaml starts cleanly with your services, no preamble needed.

But the structural changes go deeper than that. Here's what a modern, production-aware Compose file looks like for a typical web application stack:

services:
  api:
    image: ghcr.io/myorg/api:${TAG:-latest}
    env_file: [configs/common.env]
    environment:
      - NODE_ENV=${NODE_ENV:-production}
    ports:
      - "8080:8080"
    depends_on:
      db:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  db:
    image: postgres:16-alpine
    volumes:
      - db-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      retries: 5

volumes:
  db-data:

Look at what's in there: resource limits, health checks with dependency conditions, proper volume management. These aren't nice-to-haves. They're the features that make Compose viable beyond your laptop.

Health checks in particular solve one of Compose's oldest and most annoying pain points: the race condition where your web server starts before the database is actually ready to accept connections. If you've ever added sleep 10 to a startup script and crossed your fingers, you know what I'm talking about.

How to Use Profiles to Manage Multiple Environments

This is the feature that changed my relationship with Compose. Before profiles, managing different environments meant choosing between two painful approaches. Either you maintained multiple Compose files (docker-compose.yml, docker-compose.dev.yml, docker-compose.test.yml, docker-compose.prod.yml) and dealt with the inevitable drift between them. Or you used one big bloated file where you commented out services depending on the context. Both approaches were fragile, and both led to those fun "works on my machine" conversations.

Profiles give you a much cleaner path. You assign services to named groups. Services without a profile always start. Services with a profile only start when you explicitly activate that profile. You can also activate profiles with the COMPOSE_PROFILES environment variable instead of the CLI flag, which is handy for CI (see the official profiles docs for the full syntax).

Here's what that looks like:

services:
  api:
    image: myapp:latest
    # No profiles = always starts

  db:
    image: postgres:16
    # No profiles = always starts

  debug-tools:
    image: busybox
    profiles: [debug]
    # Only starts with --profile debug

  prometheus:
    image: prom/prometheus
    profiles: [monitoring]
    # Only starts with --profile monitoring

  grafana:
    image: grafana/grafana
    profiles: [monitoring]
    depends_on: [prometheus]

Now your team operates with simple, memorable commands:

# Development: just the core stack
docker compose up -d

# Development with observability
docker compose --profile monitoring up -d

# CI: core stack only (no monitoring overhead)
docker compose up -d

# Full stack with debugging
docker compose --profile debug --profile monitoring up

One Compose file. No drift. No guesswork about which override file to pass.

Real-World Profile Patterns I've Used

Four patterns I keep coming back to:

The "infra-only" pattern. This is for developers who run application code natively on their host machine but need infrastructure services like databases, message queues, and caches in containers. You leave infrastructure services without a profile and put application services behind one. Your backend developer runs docker compose up to get Postgres and Redis, then starts the API directly on their host with their favorite debugger attached.

The "mock vs. real" pattern. You put a payments-mock service in the dev profile and a real payments gateway service in the prod profile. Same Compose file, totally different behavior depending on context. This one saved my team from accidentally hitting a live payment API during development more than once.

The "CI optimization" pattern. Heavy services like Selenium browsers and monitoring stacks go behind profiles so your CI pipeline skips them. Your test suite runs faster without that overhead, and you only pull those services in when you actually need end-to-end integration tests.

The "AI/ML workloads" pattern. GPU-dependent services (inference servers, model training containers) go into a gpu profile. Developers without GPUs can still work on the rest of the stack without anything breaking.

One practical tip that's saved me a lot of headaches: document your profiles in the project's README. It sounds obvious, but when a new team member runs docker compose up and wonders why the monitoring dashboard isn't starting, they need a single place to find the answer. A quick table listing each profile and what it includes will save you from answering the same Slack question every onboarding cycle.

How to Use Watch Mode to End the Rebuild Cycle

If profiles solved the environment management problem, watch mode solved the developer experience problem.

You probably know the old workflow for container-based development. It went like this: edit code, run docker compose build, run docker compose up, test your change, find a bug, edit again, rebuild, restart, test. Each iteration costs you thirty seconds to a minute of waiting. Over a full day of active development, you're losing an hour or more just sitting there watching build logs scroll by.

Watch mode (introduced in Compose v2.22.0 and significantly improved in later releases) monitors your local files and automatically takes action when something changes. It supports three synchronization strategies, and picking the right one for each situation is the key to making it work well. The official watch mode docs cover the full spec if you want to dig deeper.

sync copies changed files directly into the running container. This works best for interpreted languages like Python, JavaScript, and Ruby, and for frameworks with hot module reloading like React, Vue, or Next.js. The file lands in the container, the framework picks up the change, and your browser updates. No rebuild, no restart. If you're working with a compiled language like Go, Rust, or Java, sync won't help you since the code needs to be recompiled. Use rebuild for those instead.

rebuild triggers a full image rebuild and container replacement. You want this for dependency changes, like when you update package.json or requirements.txt, or when you modify the Dockerfile itself. In those cases, syncing files isn't enough. You need a fresh image.

sync+restart syncs files into the container, then restarts the main process. This is ideal for configuration file changes like nginx.conf or database configs, where the application needs to reload to pick up the new settings but the image itself is fine.

Here's what a real-world watch configuration looks like for a Node.js application:

services:
  api:
    build: .
    ports: ["3000:3000"]
    command: npx nodemon server.js
    develop:
      watch:
        - action: sync
          path: ./src
          target: /app/src
          ignore:
            - node_modules/
        - action: rebuild
          path: package.json
        - action: sync+restart
          path: ./config
          target: /app/config

You start it with docker compose up --watch, or you can run docker compose watch as a standalone command if you'd rather keep the file sync events separate from your application logs.

A few things to know before you set this up. Watch mode only works with services that have a local build: context. If you're pulling a prebuilt image from a registry, there's nothing for Compose to sync or rebuild, so watch will ignore that service. Your container also needs basic file utilities (stat, mkdir) installed, and the container USER must have write access to the target path. If you're using a minimal base image like scratch or distroless, the sync action won't work. And if you're on an older Compose version, check which actions are supported: sync+restart and sync+exec were added in later minor releases after the initial v2.22.0 launch.

It's a massive improvement. Edit a source file, save it, and the change is live in under a second for frameworks with hot reload. No context switching to run build commands. No waiting. Just code.

Watch Mode vs. Bind Mounts

A fair question you might be asking: bind mounts have provided a form of live-reload for years. Why does watch mode need to exist?

Bind mounts work, but they come with platform-specific issues that have plagued Docker Desktop for a long time. On macOS and Windows, bind mounts go through a filesystem sharing layer between the host OS and the Linux VM running Docker. This introduces permission quirks, performance problems on large directories (ever watched a node_modules folder choke a bind mount on macOS?), and inconsistent file notification behavior that makes hot reload unreliable.

Watch mode sidesteps these issues by explicitly syncing files at the application level. It's more predictable, works consistently across platforms, and gives you more control over what happens when a file changes.

That said, bind mounts still work well for many use cases, especially if you're on native Linux where the performance overhead doesn't exist. Watch mode is the better choice for teams that have run into cross-platform issues, or for anyone who wants the automatic rebuild and restart triggers that bind mounts can't provide.

How to Set Up GPU Support for Machine Learning Workloads

This is the feature that made me rethink what Compose can do.

Docker has supported GPU passthrough for individual containers for years through the NVIDIA Container Toolkit and the --gpus flag. But configuring GPU access in Compose files used to require clunky runtime declarations that were poorly documented and changed between Compose versions. It was the kind of thing where you'd find a Stack Overflow answer from 2021, try it, and discover it didn't work anymore.

The modern Compose spec handles it cleanly through the deploy.resources.reservations.devices block:

services:
  inference:
    image: myorg/model-server:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

If you're on Compose v2.30.0 or later, there's also a shorter syntax using the gpus: field:

services:
  inference:
    image: myorg/model-server:latest
    gpus:
      - driver: nvidia
        count: 1

Both approaches do the same thing. The deploy.resources syntax works on older Compose versions and gives you more control (like setting device_ids to pin specific GPUs). The gpus: shorthand is cleaner when you just need basic access.

One thing that will trip you up if you skip it: your host machine needs the right GPU drivers and nvidia-container-toolkit installed before any of this works. Run nvidia-smi on the host first. If that command doesn't show your GPUs, Compose won't see them either. For CUDA workloads, use official GPU base images like nvidia/cuda or the PyTorch/TensorFlow GPU images. The Compose GPU access docs walk through the full setup.

That's the whole thing. When you run docker compose up, the inference service gets access to one NVIDIA GPU. You can set count to "all" if you want every available GPU, or use device_ids to assign specific GPUs to specific services.

How to Combine Multi-GPU Workloads with Profiles

Here's where profiles and GPU support work really well together. Consider an ML workload where you need an LLM for text generation, an embedding model for vector search, and a vector database:

services:
  vectordb:
    image: milvus/milvus:latest
    # Runs on CPU, no profile needed

  llm-server:
    image: ollama/ollama:latest
    profiles: [gpu]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    volumes:
      - model-cache:/root/.ollama

  embedding-server:
    image: myorg/embeddings:latest
    profiles: [gpu]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

Developers without GPUs work on the application logic with just docker compose up. The vector database starts, they can write code against its API, and everything runs fine. When it's time to test the full ML pipeline, someone with a multi-GPU workstation runs docker compose --profile gpu up and gets the complete stack with specific GPU assignments.

This pattern has become central to our AIOps platform development. The team building alerting logic doesn't need GPUs. The team training anomaly detection models does. One Compose file serves both teams.

How to Configure Health Checks, Dependencies, and Startup Ordering

One of Compose's most underappreciated improvements is how it handles service dependencies. The depends_on directive now supports conditions that actually mean something (this requires Compose v2.20.0+, see the startup ordering docs for the full picture):

depends_on:
  db:
    condition: service_healthy
  redis:
    condition: service_started

When you combine this with proper health checks, you eliminate the "sleep 10 and hope" pattern that plagues so many Compose setups. Your API service actually waits until PostgreSQL is accepting connections before it tries to start. Not just until the container is running, but until the database process inside it has passed its health check.

One detail that catches people: tune your start_period. Databases like PostgreSQL need time to initialize on first boot, especially if they're running migrations. Without a start_period, the health check starts counting retries immediately and can declare the service unhealthy before it even had a chance to finish starting up. A config like this works well for most database services:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 5s
  timeout: 2s
  retries: 10
  start_period: 30s

The start_period gives the container 30 seconds of grace time where failed health checks don't count against the retry limit.

This might seem like a small detail, but if you've ever worked on a stack with eight or ten interconnected services, you know how much time you can waste debugging cascading failures during cold starts. Proper startup ordering prevents all of that and makes your local environment behave much more like production.

How to Use Bake for Production Image Builds

I mentioned Bake integration earlier, and it's worth its own section because it solves a problem you'll hit as soon as you start using Compose for anything beyond local dev: your development Compose file and your production build process have different needs.

During development, you want fast builds, local caches, and single-platform images. For production, you want tagged images pushed to a registry, multi-platform builds, and build attestations. Trying to cram both into your compose.yaml gets messy fast.

Docker Bake (docker buildx bake) can read your compose.yaml and generate build targets from it, but you can override and extend those targets with a separate docker-bake.hcl file. This keeps your development workflow clean while giving CI the knobs it needs. The Bake documentation covers the full HCL syntax and Compose integration.

Here's a minimal docker-bake.hcl:

group "default" {
  targets = ["api", "worker"]
}

target "api" {
  context    = "api"
  dockerfile = "Dockerfile"
  tags       = ["registry.example.com/team/api:release"]
  platforms  = ["linux/amd64"]
}

target "worker" {
  context    = "worker"
  dockerfile = "Dockerfile"
  tags       = ["registry.example.com/team/worker:release"]
}

Then your CI pipeline runs docker buildx bake to produce release images, while developers keep using docker compose up --build locally. The two workflows share the same Dockerfiles but have separate build configurations where they need them.

The pattern I've landed on: use Compose for local development and CI test environments, use Bake in CI to produce the release images, and push those images into whatever deployment target your team uses (staging server, Kubernetes cluster, edge node). Compose gets you from code to running containers fast. Bake gets you from code to production-ready images with proper tags and attestations.

What Compose Is Not (An Honest Assessment)

I've spent this entire article making the case that Compose has grown up. But I should also tell you where it falls short. I'd rather you hear it from me now than discover it the hard way in production.

Compose is not a container orchestrator. It doesn't schedule work across multiple hosts. It doesn't do automatic failover. It won't give you rolling updates with zero downtime, and it has no concept of service mesh networking. If you need any of those things, you need Kubernetes, Nomad, or Docker Swarm (if you're still using it).

Compose doesn't replace Helm or Kustomize. If you're deploying to Kubernetes, Compose files don't translate directly. Docker offers Compose Bridge to convert Compose files into Kubernetes manifests, but it's still experimental and won't handle complex Kubernetes-specific configurations like custom resource definitions or ingress rules.

Compose doesn't handle secrets well in production. The secrets support exists, but it's limited compared to HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets. For anything beyond a staging environment, you'll want an external secrets management solution.

The sweet spot for modern Compose is clear: local development, CI/CD testing environments, single-node staging environments, and workloads where a single powerful machine (particularly for GPU work) is the right deployment target. Within that scope, Compose is excellent. Outside of it, you'll hit walls fast.

If you do run Compose in a staging or single-node production setup, a few more things are worth adding that I haven't covered here: restart: unless-stopped on every service so containers come back after a host reboot, a logging driver config so your logs go somewhere searchable instead of disappearing into docker logs, and a backup strategy for your named volumes. These aren't Compose-specific problems, but Compose won't solve them for you either.

A Practical Adoption Path

If you're currently working with a basic Compose setup and want to start using these features, here's the order I'd recommend. Each step is incremental, backward-compatible, and valuable on its own. You don't have to do all of this at once.

Week 1: Add health checks and proper depends_on conditions. This alone will eliminate the most common frustration: services crashing on startup because their dependencies aren't ready yet. Start with your database and your main application service. Once those two are wired up with condition: service_healthy, you'll notice the difference immediately.

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 5s
  timeout: 2s
  retries: 10
  start_period: 30s

Week 2: Introduce profiles. Start by putting your monitoring stack behind a monitoring profile and your debug tools behind a debug profile. Then delete whatever extra Compose files you've been maintaining. Having one source of truth instead of four files that are almost-but-not-quite the same makes everything simpler.

Week 3: Set up watch mode for your most-edited service. Pick the service where your developers spend the most time iterating. Get watch mode working there first. Once the team sees the difference (saving a file and seeing the change reflected in under a second) they'll ask for it on everything else.

Week 4: Add resource limits. Define memory and CPU limits for every service. This prevents one runaway container from starving the rest and gives you a realistic preview of how your services behave under production constraints. It's also useful for catching memory leaks early.

deploy:
  resources:
    limits:
      memory: 512M
      cpus: "1.0"

Wrapping Up

Docker Compose in 2026 is not the same tool it was a few years ago. Profiles, watch mode, GPU support, proper dependency management, and Bake integration have turned it into something that can handle real, complex workloads, as long as those workloads fit on a single node.

It's not Kubernetes, and it shouldn't try to be. But for local development, CI pipelines, staging environments, and single-machine GPU workloads, it's become hard to argue against. If you've been dismissing Compose because of what it used to be, the current version deserves a second look.

If you found this useful, you can find me writing about DevOps, containers, and AIOps best practices on my blog.

How to Build and Deploy a Multi-Agent AI System with Python and Docker

Balajee Asish Brahmandam — Mon, 23 Feb 2026 15:55:01 +0000

You wake up and open your laptop. Your browser has 27 tabs open, your inbox is overflowing with unread newsletters, and meeting notes are scattered across three apps. Sound familiar?

Now imagine you had a team of specialized assistants that worked overnight — one to read your inputs, one to summarize the key facts, one to rank what matters most, and one to format everything into a clean daily brief waiting in your inbox.

That is exactly what this handbook walks you through building. You will create a multi-agent AI system where four Python-based agents each handle one job. You will containerize each agent with Docker so the whole thing runs reliably on any machine. And you will wire it all together with Docker Compose so you can launch the entire pipeline with a single command.

This handbook assumes you are comfortable reading Python code, but it does not assume you have used Docker before. If you have never written a Dockerfile or run a container, that is fine — the fundamentals are covered as we go.

By the end, you will have a working system that turns digital noise into an organized daily digest, and you will understand the patterns behind it well enough to adapt them to your own projects.

What is a Multi-Agent System (and Why Build One)?
- How Traditional Scripts Work
- How AI Agents are Different
- Why Use Multiple Agents Instead of One?
What is Docker (and Why Does It Matter Here)?
- The Environment Problem
- How Docker Solves This
- How Docker Layers Work
- Docker vs. No Docker
How to Plan the Architecture
Prerequisites and Environment Setup
- How to Install Python
- How to Install Docker
- How to Verify Your Setup
- How to Set Up the Project Structure
How to Build Each Agent Step by Step
- The Ingestor Agent
- The Summarizer Agent
- The Prioritizer Agent
- The Formatter Agent
How to Handle Secrets and API Keys
- Using .env Files for Development
- How to Use Docker Secrets for Production
How to Orchestrate Everything with Docker Compose
How to Run the Pipeline
How to Test the Pipeline
- Unit Tests
- Integration Tests
How to Add Logging and Observability
Cost, Rate Limits, and Graceful Degradation
Security and Privacy Considerations
How to Use a Local LLM for Full Privacy (Ollama)
Example Seed Data and Expected Output
How to Automate Daily Execution
How to Use Cron on Linux or macOS
How to Use Task Scheduler on Windows
How to Add Delivery Notifications
Troubleshooting Common Errors
Production Deployment Options
- Docker Swarm
- Kubernetes
Cloud Platforms
Conclusion and Next Steps

What is a Multi-Agent System (and Why Build One)?

How Traditional Scripts Work

A traditional Python script follows a fixed path. It reads some input, processes it through a series of hard-coded steps, and writes the output. If the input format changes even slightly, the script often breaks. Think of it like a train on a track. Trains are fast and efficient, but they can only go where the rails take them. If the track is blocked, the train stops.

How AI Agents are Different

An AI agent is more like a bus driver. It has a destination (a goal), but it can decide which route to take based on current conditions (the data). If one road is blocked, it finds another.

Agents typically follow a loop called the ReAct pattern, which stands for Reasoning plus Acting. At each step, the agent thinks about what to do, takes an action, observes the result, and decides whether it has reached its goal. If not, it loops back and tries again. If so, it finishes.

In practice, this means an LLM-based agent can handle messy, unpredictable input much better than a traditional script. If a newsletter changes its format, the summarizer agent can still extract the key points because it reasons about the content rather than parsing a rigid structure.

Why Use Multiple Agents Instead of One?

You might wonder: why not just use one powerful agent that does everything? That approach is called the "God Model" pattern, and it has real problems. When you ask a single LLM to ingest data, summarize it, prioritize it, and format it all in one prompt, you are giving it too much to think about at once. LLMs have a limited context window and limited attention. The more tasks you pile on, the more likely the model is to hallucinate, skip steps, or produce inconsistent output.

A multi-agent system solves this through separation of concerns. Each agent has one narrow job. The Ingestor reads and combines raw files, with no LLM needed. The Summarizer calls the LLM with a focused prompt: just summarize this text. The Prioritizer scores lines by keyword with no LLM needed. And the Formatter writes Markdown output, also with no LLM.

This design has several advantages. Each agent is simpler to build, test, and debug. You can swap out the Summarizer for a better model without touching anything else. And you can scale individual agents independently — for example, running multiple Summarizers in parallel if you have a lot of input.

What is Docker (and Why Does It Matter Here)?

The Environment Problem

If you have ever shared a Python project with someone and heard "it does not work on my machine," you already understand the problem Docker solves. Every Python project depends on specific versions of Python itself, plus libraries like openai, requests, or beautifulsoup4. These dependencies live in your operating system's environment. When you install a new library or upgrade Python, you might break a different project that depends on the old version.

Virtual environments help, but they only isolate Python packages. They do not isolate the operating system, system libraries, or other tools your code might need. And they do not guarantee that someone else can recreate your exact environment. For a multi-agent system, this problem gets worse. Each agent might need different dependencies. If they share an environment, their dependencies can conflict.

How Docker Solves This

Docker packages your code, its dependencies, and a minimal operating system into a single unit called a container. When you run that container, it behaves exactly the same way regardless of what machine it is running on — your laptop, a coworker's computer, or a cloud server. Think of a Docker container like a shipping container for software. The contents are sealed inside, protected from the outside environment.

There are a few key Docker concepts to understand:

Image — A read-only template that contains your code, dependencies, and a minimal OS. You build an image from a Dockerfile. Think of it as a recipe.

Container — A running instance of an image. When you "run" an image, Docker creates a container from it. Think of it as a dish made from the recipe.

Dockerfile — A text file with instructions for building an image. It specifies the base OS, what to install, what code to copy in, and what command to run when the container starts.

Volume — A way to share files between your computer and a container, or between multiple containers. Our agents will use a shared volume to pass data to each other.

Docker Compose — A tool for defining and running multiple containers together. You describe all your containers in a single YAML file, and Compose handles building, networking, and ordering them.

How Docker Layers Work

Docker builds images in layers. Each instruction in a Dockerfile creates a new layer. Docker caches these layers, so if a layer has not changed since the last build, Docker reuses the cached version instead of rebuilding it. This is why Dockerfiles are structured in a specific order: the base OS layer rarely changes, the dependency installation layer changes when requirements.txt changes, and the application code layer changes on every code edit. By putting dependency installation before the code copy, Docker only re-runs pip install when your requirements actually change, making rebuilds much faster — seconds instead of minutes.

Docker vs. No Docker

To be clear, you do not strictly need Docker for this tutorial. You can run all four agents as plain Python scripts. But without Docker you face dependency conflicts from a shared environment, manual process management for scaling, having to redo all setup on every new machine, complex orchestration for testing, and painful Python version management when one agent needs 3.8 and another needs 3.10. With Docker, each agent has its own isolated environment, you run multiple containers in parallel with one command, docker compose up produces identical results everywhere, and each container runs its own Python version independently.

For a personal project, either approach works. But if you ever want to share this system, deploy it to a server, or run it in the cloud, Docker makes the difference between "here is a README with 15 setup steps" and "run docker compose up."

How to Plan the Architecture

Before writing any code, it is worth mapping out how the pieces fit together. The full system consists of four agents arranged in a sequential pipeline, all orchestrated by Docker Compose. Data flows through the Ingestor Agent, the Summarizer Agent, the Prioritizer Agent, and the Formatter Agent in that order. Each agent reads from a shared volume, processes its input, writes the result, and exits. Docker Compose enforces execution order by waiting for each container to finish successfully before starting the next one.

This is a synchronous pipeline: agents run one at a time, in sequence. It is the simplest multi-agent pattern to implement and understand. For more complex systems, you could replace the shared volume with a message broker like Redis or RabbitMQ, which lets agents run asynchronously and react to events. But for this daily-digest use case, the sequential approach is exactly right.

In terms of responsibilities:

Ingestor — Reads and combines raw files from /data/input/ into ingested.txt. No LLM required.
Summarizer — Distills key points from ingested.txt into summary.txt. The only agent that requires an LLM.
Prioritizer — Scores items by urgency keywords, turning summary.txt into prioritized.txt. No LLM.
Formatter — Produces the final Markdown report, daily_digest.md. No LLM.

Notice that only one of the four agents actually calls an LLM. The others are plain Python. This is intentional — you should only use an LLM when you need reasoning or language understanding. Everything else should be deterministic code. It is cheaper, faster, and more predictable.

Prerequisites and Environment Setup

You need the following tools installed before starting:

Python 3.10 or higher — the language for the agents
Docker Desktop (Engine 20.10+) — the container runtime
Docker Compose v2 (included with Docker Desktop) — multi-container orchestration
Git 2.30+ — version control
OpenAI Python SDK (openai >= 1.0) — LLM API access
Redis or RabbitMQ (optional) — async message queuing
PostgreSQL (optional) — persistent data storage

How to Install Python

Download Python from python.org. On Windows, check the "Add Python to PATH" box during installation. On macOS, you can use Homebrew:

brew install python@3.12

On Linux (Ubuntu/Debian), use your package manager:

sudo apt update && sudo apt install python3 python3-pip

How to Install Docker

Docker Desktop is the easiest way to get started on Windows and macOS. Download it from docker.com and follow the prompts. On Windows, Docker Desktop requires WSL2 — the installer will guide you through enabling it. On Linux, install Docker Engine directly:

# Ubuntu/Debian
sudo apt update
sudo apt install docker.io docker-compose-v2
sudo usermod -aG docker $USER  # So you don't need sudo for docker commands

After installing, log out and back in for the group change to take effect.

How to Verify Your Setup

Open your terminal and run these commands. Each should print a version number without errors:

python --version        # Should show 3.10 or higher
docker --version        # Should show 20.10 or higher
docker compose version  # Should show v2.x
git --version           # Should show 2.30 or higher

If any command fails, go back to the installation step for that tool. The most common issue is that the command is not in your PATH.

How to Set Up the Project Structure

Each agent lives in its own directory with its own code, Dockerfile, and requirements file. This isolation means you can build, test, and update each agent independently. Create the following structure:

multi-agent-digest/
├── agents/
│   ├── ingestor/
│   │   ├── app.py
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   ├── summarizer/
│   │   ├── app.py
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   ├── prioritizer/
│   │   ├── app.py
│   │   ├── Dockerfile
│   │   └── requirements.txt
│   └── formatter/
│       ├── app.py
│       ├── Dockerfile
│       └── requirements.txt
├── data/
│   └── input/          # Your raw files go here
├── output/              # The final digest appears here
├── tests/               # Unit and integration tests
├── .env                 # API keys (gitignored!)
├── .gitignore
├── docker-compose.yml
└── README.md

You can create the folders quickly from the terminal:

mkdir -p multi-agent-digest/agents/{ingestor,summarizer,prioritizer,formatter}
mkdir -p multi-agent-digest/{data/input,output,tests}
cd multi-agent-digest

How to Build Each Agent Step by Step

Every agent follows the same simple pattern: read an input file from the shared volume, do its job, and write an output file. This consistency makes the system easy to understand and extend.

The Ingestor Agent

The Ingestor is the entry point of the pipeline. Its job is to read all text files from the input folder and combine them into a single file that the Summarizer can process. This is the simplest agent — no external libraries, no API calls, just file reading and writing.

agents/ingestor/app.py

import os
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("ingestor")

INPUT_DIR = "/data/input"
OUTPUT_FILE = "/data/ingested.txt"

def ingest():
    content = ""
    files_processed = 0
    for filename in sorted(os.listdir(INPUT_DIR)):
        filepath = os.path.join(INPUT_DIR, filename)
        if os.path.isfile(filepath):
            try:
                with open(filepath, "r", encoding="utf-8") as f:
                    content += f"\n--- {filename} ---\n"
                    content += f.read()
                    content += "\n"
                    files_processed += 1
            except Exception as e:
                logger.error(f"Failed to read {filename}: {e}")

    if files_processed == 0:
        logger.warning("No input files found in /data/input/")

    with open(OUTPUT_FILE, "w", encoding="utf-8") as out:
        out.write(content)
    logger.info(f"Ingested {files_processed} files -> {OUTPUT_FILE}")

if __name__ == "__main__":
    ingest()

The logging.basicConfig block sets up structured logging. Every agent uses the same log format, so when Docker Compose runs them together, you get a clean, consistent timeline. The sorted(os.listdir()) call ensures files are processed in alphabetical order — without it, the order depends on the filesystem and can vary between machines. The try/except block around each file read means a single corrupted file will not crash the entire pipeline. And if no files are found at all, the agent writes an empty output file rather than crashing, so downstream agents can handle empty input gracefully.

agents/ingestor/Dockerfile

FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]

FROM python:3.10-slim starts with a minimal Linux image that has Python pre-installed. The -slim variant is about 120 MB versus 900 MB for the full image. WORKDIR /app sets the working directory inside the container. COPY requirements.txt and RUN pip install handle dependencies at build time, not runtime. COPY app.py copies the application code last because it changes most often, and Docker caches previous layers. CMD specifies the command to run when the container starts.

Since the Ingestor uses only standard library modules, its requirements.txt can be empty:

# No external dependencies needed

The Summarizer Agent

The Summarizer is the most complex agent in the pipeline. It reads the ingested text and calls an LLM API to produce a concise summary. This is the only agent that makes a network call, which means it is the only one that can fail due to external factors: the API might be down, you might hit rate limits, or your key might be invalid.

agents/summarizer/app.py:

import os
import logging
import time
from openai import OpenAI, RateLimitError, APIError

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("summarizer")

INPUT_FILE = "/data/ingested.txt"
OUTPUT_FILE = "/data/summary.txt"

client = OpenAI()  # reads OPENAI_API_KEY from environment

SYSTEM_PROMPT = (
    "You are a helpful assistant that summarizes long text "
    "into key bullet points. Each bullet should be one "
    "concise sentence capturing a core insight."
)

MAX_RETRIES = 3
RETRY_DELAY = 5  # seconds

def summarize(text, retries=MAX_RETRIES):
    """Call the LLM API with retry logic for rate limits."""
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": text[:8000]}
                ],
                max_tokens=1000,
                temperature=0.3,
            )
            return response.choices[0].message.content
        except RateLimitError:
            wait = RETRY_DELAY * (attempt + 1)
            logger.warning(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)
        except APIError as e:
            logger.error(f"API error: {e}")
            raise
    raise RuntimeError("Max retries exceeded for LLM API call")

def main():
    with open(INPUT_FILE, "r", encoding="utf-8") as f:
        raw_text = f.read()

    if not raw_text.strip():
        logger.warning("Empty input. Writing fallback summary.")
        summary = "No content to summarize."
    else:
        try:
            summary = summarize(raw_text)
        except Exception as e:
            logger.error(f"Summarization failed: {e}")
            summary = f"Summarization failed: {e}"

    with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
        f.write(summary)
    logger.info(f"Summary written to {OUTPUT_FILE}")

if __name__ == "__main__":
    main()

The OpenAI() client automatically reads the OPENAI_API_KEY environment variable — you do not need to pass the key explicitly in code, which is both cleaner and safer. The text[:8000] slice limits how much text is sent to the API. Sending fewer tokens means faster responses and lower cost. For production, you would want smarter chunking that splits on sentence or paragraph boundaries rather than a raw character count.

Temperature 0.3 makes the output more focused and deterministic, which is ideal for summarization. The retry logic catches RateLimitError specifically and waits longer each time (5, 10, then 15 seconds) — this is called exponential backoff. Other API errors raise immediately because retrying them will not help. If the input is empty or the API fails completely, the agent writes a fallback message instead of crashing, so the downstream agents can still run.

agents/summarizer/requirements.txt:

openai>=1.0

The Dockerfile is identical to the Ingestor's.

The Prioritizer Agent

The Prioritizer takes the LLM-generated summary and scores each line based on urgency keywords. This is a rule-based agent — no LLM call needed. It is fast, deterministic, and free.

agents/prioritizer/app.py:

import os
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("prioritizer")

INPUT_FILE = "/data/summary.txt"
OUTPUT_FILE = "/data/prioritized.txt"

PRIORITY_KEYWORDS = [
    "urgent", "today", "asap", "important",
    "deadline", "critical", "action required"
]

def score_line(line):
    """Count how many priority keywords appear in a line."""
    lower = line.lower()
    return sum(1 for kw in PRIORITY_KEYWORDS if kw in lower)

def prioritize():
    with open(INPUT_FILE, "r", encoding="utf-8") as f:
        lines = [line.strip() for line in f if line.strip()]

    scored = [(line, score_line(line)) for line in lines]
    scored.sort(key=lambda x: x[1], reverse=True)

    with open(OUTPUT_FILE, "w", encoding="utf-8") as out:
        for line, score in scored:
            out.write(f"[{score}] {line}\n")

    logger.info(f"Prioritized {len(scored)} items -> {OUTPUT_FILE}")

if __name__ == "__main__":
    prioritize()

The scoring function counts how many priority keywords appear in each line. A line containing "urgent deadline" scores 2, and a line with no keywords scores 0. The scored lines are sorted in descending order, so the most urgent items appear first. Each line is prefixed with its score in brackets, like [2] Urgent: quarterly report due today. In a more advanced system, you could replace this keyword scorer with an LLM-based ranker, but for a daily digest, simple keyword matching works surprisingly well.

This agent has no pip dependencies, so the Dockerfile skips the requirements step:

agents/prioritizer/Dockerfile:

FROM python:3.10-slim
WORKDIR /app
COPY app.py .
CMD ["python", "app.py"]

The Formatter Agent

The Formatter is the final agent in the pipeline. It reads the scored lines and writes a clean Markdown document to the output directory.

agents/formatter/app.py:

import os
import logging
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
logger = logging.getLogger("formatter")

INPUT_FILE = "/data/prioritized.txt"
OUTPUT_FILE = "/output/daily_digest.md"

def format_to_markdown():
    with open(INPUT_FILE, "r", encoding="utf-8") as f:
        lines = [line.strip() for line in f if line.strip()]

    today = datetime.now().strftime('%Y-%m-%d')

    with open(OUTPUT_FILE, "w", encoding="utf-8") as out:
        out.write("# Your Daily AI Digest\n\n")
        out.write(f"**Date:** {today}\n\n")
        out.write("## Top Insights\n\n")
        for line in lines:
            if '] ' in line:
                score = line.split(']')[0][1:]
                content = line.split('] ', 1)[1]
                out.write(f"- **Priority {score}**: {content}\n")
            else:
                out.write(f"- {line}\n")

    logger.info(f"Digest written to {OUTPUT_FILE}")

if __name__ == "__main__":
    format_to_markdown()

Notice that the Formatter writes to /output instead of /data. This is a separate volume mount in Docker Compose. The /data volume is internal plumbing that agents use to communicate, while the /output volume maps to a folder on your host machine where you can access the final result. The split('] ', 1) with maxsplit=1 ensures that bracket characters inside the actual content do not break the parsing.

The Dockerfile is the same as the Prioritizer's (no external dependencies).

How to Handle Secrets and API Keys

⚠️ Warning: Never commit API keys or secrets to version control. A leaked OpenAI key can rack up thousands of dollars in charges before you notice.

Using .env Files for Development

Create a .env file in your project root:

# .env -- DO NOT COMMIT THIS FILE
OPENAI_API_KEY=sk-your-key-here

Then immediately add it to your .gitignore:

# .gitignore
.env
output/
data/ingested.txt
data/summary.txt
data/prioritized.txt
__pycache__/
*.pyc

Docker Compose reads .env files automatically when it starts. In your docker-compose.yml, you reference the variable with ${OPENAI_API_KEY}, and Compose substitutes the real value at runtime. The key never appears in your Dockerfile, your code, or your version history.

How to Use Docker Secrets for Production

For production deployments on Docker Swarm or Kubernetes, environment variables are visible in process listings and inspect commands. Docker secrets are more secure:

# Create the secret
echo "sk-your-key-here" | docker secret create openai_key -

# Reference in docker-compose.yml (Swarm mode only)
services:
  summarizer:
    secrets:
      - openai_key

secrets:
  openai_key:
    external: true

The secret gets mounted as a read-only file at /run/secrets/openai_key inside the container. Your code reads the key from that file instead of from an environment variable.

How to Orchestrate Everything with Docker Compose

With all four agents built, Docker Compose ties them together. It builds each container, mounts the shared volumes, passes environment variables, and enforces the correct execution order.

docker-compose.yml:

version: "3.9"

services:
  ingestor:
    build: ./agents/ingestor
    container_name: agent_ingestor
    volumes:
      - ./data:/data
    restart: "no"

  summarizer:
    build: ./agents/summarizer
    container_name: agent_summarizer
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    depends_on:
      ingestor:
        condition: service_completed_successfully
    volumes:
      - ./data:/data
    deploy:
      resources:
        limits:
          memory: 512M
    restart: "no"

  prioritizer:
    build: ./agents/prioritizer
    container_name: agent_prioritizer
    depends_on:
      summarizer:
        condition: service_completed_successfully
    volumes:
      - ./data:/data
    restart: "no"

  formatter:
    build: ./agents/formatter
    container_name: agent_formatter
    depends_on:
      prioritizer:
        condition: service_completed_successfully
    volumes:
      - ./data:/data
      - ./output:/output
    restart: "no"

The depends_on with condition: service_completed_successfully is the key to the sequential pipeline. This setting (available in Compose v2) tells Docker to wait until the previous container exits with a zero exit code before starting the next one. Without this condition, depends_on only waits for the container to start, not to finish — which would cause race conditions where the Summarizer tries to read a file the Ingestor has not written yet.

The volume mounts (./data:/data) map your local data folder into each container. All agents share this volume, which is how they pass files to each other. The Formatter also gets ./output:/output so the final digest lands on your host machine. The memory limit of 512M on the Summarizer prevents it from consuming too much RAM. And restart: "no" ensures Docker does not restart the agents after they finish, since they are batch jobs.

How to Run the Pipeline

docker compose up --build

The --build flag tells Compose to rebuild the images before running. You will see structured logs from each agent in sequence:

agent_ingestor    | 2025-01-20 07:00:01 [INFO] ingestor: Ingested 3 files
agent_summarizer  | 2025-01-20 07:00:04 [INFO] summarizer: Summary written
agent_prioritizer | 2025-01-20 07:00:05 [INFO] prioritizer: Prioritized 8 items
agent_formatter   | 2025-01-20 07:00:05 [INFO] formatter: Digest written

When all four containers finish, open output/daily_digest.md to see your morning brief.

How to Test the Pipeline

Unit Tests

Because each agent's core logic is a plain Python function, you can test it in isolation without Docker.

tests/test_prioritizer.py

import sys
sys.path.insert(0, 'agents/prioritizer')
from app import score_line

def test_urgent_keyword_scores_one():
    assert score_line("This is urgent") == 1

def test_multiple_keywords_stack():
    assert score_line("Urgent and important deadline") == 3

def test_no_keywords_scores_zero():
    assert score_line("Regular project update") == 0

def test_scoring_is_case_insensitive():
    assert score_line("URGENT DEADLINE ASAP") == 3

Run the tests with pytest:

pip install pytest
python -m pytest tests/ -v

Writing tests for each agent's core function means you can catch bugs before you build any Docker images, saving a lot of time compared to debugging inside running containers.

Integration Tests

To test the full pipeline end-to-end, create known input files and verify the expected output:

# Create test data
mkdir -p data/input
echo "Urgent: quarterly report due today" > data/input/test.txt
echo "Regular standup notes, no blockers" >> data/input/test.txt

# Run the pipeline
docker compose up --build

# Verify the output exists and contains expected content
test -f output/daily_digest.md && echo "File exists: PASS" || echo "File missing: FAIL"
grep -q "Priority" output/daily_digest.md && echo "Content check: PASS" || echo "Content check: FAIL"

How to Add Logging and Observability

Every agent uses Python's logging module with a consistent format. When Docker Compose runs all four containers, it interleaves their logs with container name prefixes, giving you a unified timeline of the entire pipeline.

For production systems, consider switching to JSON-formatted logs. They are easier to parse with log aggregation tools like the ELK Stack, Grafana Loki, or AWS CloudWatch:

import json
import logging

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "agent": record.name,
            "message": record.getMessage(),
        })

To use this formatter, replace the basicConfig call with a handler:

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("summarizer")
logger.addHandler(handler)
logger.setLevel(logging.INFO)

The most useful metrics to track include the number of files ingested per run, Summarizer latency (time from API call to response), LLM token usage for cost tracking, the number of errors and retries per agent, and whether daily_digest.md was successfully generated. A simple approach for personal use is to write a JSON metrics file alongside the digest in the output directory. For team or production use, consider adding Prometheus metrics or sending data to a monitoring service.

Cost, Rate Limits, and Graceful Degradation

The Summarizer is the only agent that calls a paid API. Here is what you can expect to pay:

Model	Input Cost	Output Cost	Cost per Daily Run
`gpt-4o-mini`	\(0.15 / 1M tokens	\)0.60 / 1M tokens	Less than \(0.01
`gpt-4o`	\)2.50 / 1M tokens	\(10.00 / 1M tokens	\)0.02 to \(0.10
Local model (Ollama)	Free (uses your hardware)	Free	\)0.00

For a daily personal digest processing a few thousand tokens of input, gpt-4o-mini costs less than a penny per run. That works out to roughly three dollars per year.

To protect against unexpected bills, set a monthly spending cap in your OpenAI dashboard. You can also set per-minute rate limits to prevent runaway usage if a bug causes repeated API calls.

Beyond the retry logic already built into the Summarizer, you can cache LLM responses so that if the same input text appears again you reuse the previous summary instead of calling the API. Use the cheapest model that gives acceptable results — for summarization, gpt-4o-mini usually works as well as gpt-4o at a fraction of the cost. And batch requests when possible by combining many small texts into one API call.

The Summarizer already writes a fallback message when the API fails. This is the most important form of graceful degradation: the pipeline keeps running, and you get a less useful digest instead of nothing at all. If the digest is critical for your workflow, add an alerting step — for example, you could extend the Formatter to send a Slack notification when the Summarizer falls back.

Security and Privacy Considerations

When you feed personal data emails, meeting notes, private newsletters into an LLM, you need to think carefully about where that data goes.

Text you send to OpenAI or similar providers leaves your machine and is processed on their servers. As of early 2025, OpenAI's API does not use submitted data for model training by default, but policies can change. Always check your provider's current data retention and usage policies. If your input contains personally identifiable information like names, email addresses, or phone numbers, consider stripping it before calling the API, or use a local model.

The intermediate files created during the pipeline (ingested.txt, summary.txt, prioritized.txt) contain processed versions of your raw input. For personal use, keep them for debugging and delete manually. For automated pipelines, add a cleanup step that deletes intermediate files after the digest is generated. If you operate in the EU, review GDPR requirements around data minimization, right to deletion, and records of processing.

To secure your containers, use minimal base images like python:3.10-slim to reduce the attack surface, run containers as a non-root user by adding a USER directive to your Dockerfiles, update base images regularly (at least monthly) to pick up security patches, and scan your images for vulnerabilities using docker scout or Trivy.

How to Use a Local LLM for Full Privacy (Ollama)

If you want to keep all data on your machine and avoid sending anything to external APIs, you can swap the OpenAI API for a local model running through Ollama. Ollama lets you run open-source LLMs locally, handling model weight downloads, memory management, and serving an API.

To set up Ollama:

# Install Ollama (macOS or Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (llama3 is a good general-purpose choice)
ollama pull llama3

# Verify it is running
ollama list

Replace the OpenAI API call in the Summarizer with a request to Ollama's local API:

import requests

def summarize_locally(text):
    """Call a local Ollama instance from inside a Docker container."""
    url = "http://host.docker.internal:11434/api/generate"
    payload = {
        "model": "llama3",
        "prompt": (
            "Summarize the following text into key "
            f"bullet points:\n\n{text}"
        ),
        "stream": False
    }
    try:
        resp = requests.post(url, json=payload, timeout=120)
        resp.raise_for_status()
        return resp.json().get('response', 'No response')
    except requests.exceptions.RequestException as e:
        return f"Ollama error: {e}"

The host.docker.internal hostname lets a container communicate with services running on the host machine. Ollama runs on your host (not inside a container), so this is how the Summarizer reaches it.

Note: On Linux, host.docker.internal may not resolve by default. Add this to your docker-compose.yml under the summarizer service: extra_hosts: ["host.docker.internal:host-gateway"]

Local models are slower than cloud APIs and require decent hardware (at least 8 GB of RAM for smaller models, 16 GB or more for larger ones). But they are free, fully private, and work without an internet connection.

Example Seed Data and Expected Output

To test the full pipeline without real newsletters, create these sample input files:

data/input/newsletter_ai.txt

AI Weekly Roundup - January 2025
OpenAI released a new reasoning model this week.
URGENT: New EU AI Act regulations take effect in March.
Google announced updates to their Gemini model family.
A startup raised $50M for AI-powered code review tools.

data/input/meeting_notes.txt:

Team Standup Notes - Monday
IMPORTANT: Deadline for Q1 report is this Friday.
Action required: Review the updated API documentation.
Sprint velocity is on track. No blockers reported.

Expected output in output/daily_digest.md:

# Your Daily AI Digest

**Date:** 2025-01-20

## Top Insights

- **Priority 3**: IMPORTANT: Deadline for Q1 report due Friday
- **Priority 2**: URGENT: New EU AI Act regulations in March
- **Priority 1**: Action required: Review the updated API docs
- **Priority 0**: OpenAI released a new reasoning model
- **Priority 0**: Sprint velocity is on track

The exact summary text will vary depending on your LLM model and settings, but the structure and priority ordering should remain consistent.

How to Automate Daily Execution

Now that the pipeline works end-to-end with a single command, you can schedule it to run automatically every morning.

How to Use Cron on Linux or macOS

Open your crontab with crontab -e and add this line to run the pipeline every day at 7:00 AM:

0 7 * * * cd /path/to/multi-agent-digest && docker compose up --build >> cron.log 2>&1

The >> cron.log 2>&1 part redirects all output (including errors) to a log file so you can check it later. Make sure your machine is running at the scheduled time and Docker Desktop is started.

How to Use Task Scheduler on Windows

Open Task Scheduler and create a new task. Under "Actions," set the program to:

wsl -e bash -c 'cd /mnt/c/path/to/multi-agent-digest && docker compose up --build'

Set the trigger to fire every morning at your preferred time.

How to Add Delivery Notifications

For the digest to be truly useful, you want it delivered to you rather than sitting in a folder. Here are three options:

Email — Extend the Formatter to send the digest via Python's smtplib module. You will need SMTP credentials for a service like Gmail, SendGrid, or Amazon SES.

Slack — Create an incoming webhook in your Slack workspace and POST the digest as a message. This takes about 10 lines of code.

Notion or Obsidian — Use their APIs to create a new page or note with the digest content each morning.

Troubleshooting Common Errors

Container exits with OOM error — Large files or LLM processing are exceeding memory. Increase the memory limit in docker-compose.yml under deploy > resources > limits > memory. Try 1G.

Rate limit errors from OpenAI — The retry logic handles temporary rate limits automatically. Check your OpenAI dashboard for usage caps.

depends_on does not wait for completion — Make sure you are using condition: service_completed_successfully, which requires Docker Compose v2.

Permission denied on /output — Volume mount permissions mismatch. Run chmod -R 777 ./output on the host, or add a USER directive to your Dockerfiles.

OPENAI_API_KEY not found — The .env file may be missing or not in the right directory. Create .env in the same folder as docker-compose.yml and verify with docker compose config.

Cannot reach Ollama from container — host.docker.internal may not be resolving on Linux. Add extra_hosts: ["host.docker.internal:host-gateway"] to the service in docker-compose.yml.

Production Deployment Options

The docker compose up approach works well for personal use and development. When you are ready to deploy to a server or the cloud, here are your main options.

Docker Swarm

Docker Swarm is the simplest step up from Compose. It lets you deploy across multiple machines with minimal changes to your existing Compose file:

docker swarm init
docker stack deploy -c docker-compose.yml morning-brief

Kubernetes

For production at scale, Kubernetes gives you more control over scheduling, scaling, and fault tolerance. Use Kubernetes Jobs (not Deployments) for batch agents that run once and exit. Set resource requests and limits on each container so the cluster scheduler can allocate resources efficiently. Store API keys in Kubernetes Secrets, and use CronJobs for scheduled daily execution — they work like cron but are managed by the cluster.

Cloud Platforms

All major cloud providers offer managed container services that can run this pipeline:

AWS — ECS Fargate with scheduled tasks for serverless execution, or EKS for managed Kubernetes.

Azure — Azure Container Instances for simple runs, or AKS for managed Kubernetes.

GCP — Cloud Run Jobs for serverless batch processing, or GKE for managed Kubernetes.

Conclusion and Next Steps

In this handbook, you built a multi-agent AI system from scratch. You created four specialized Python agents, containerized each one with Docker, orchestrated them with Docker Compose, and added secrets handling, structured logging, retry logic, and graceful fallbacks.

The core patterns you learned — separation of concerns, containerized agents, shared-volume communication, and defensive coding against external APIs — apply far beyond this specific use case. Any time you need a reliable, modular, and reproducible AI workflow, these patterns are a solid foundation.

Here are some directions to explore next:

Agent collaboration frameworks — Tools like CrewAI and LangGraph let you build agents that delegate tasks to each other, negotiate priorities, and collaborate in more sophisticated ways.

Local and fine-tuned models — Experiment with Ollama or vLLM to run models locally. Fine-tune a small model specifically for summarization to get better results at lower cost.

Event-driven architectures — Replace the shared volume with Redis or RabbitMQ so agents react to events in real time rather than running on a schedule.

Feedback loops — Add an agent that evaluates the quality of the daily digest and adjusts the Summarizer's prompts over time. This is how production agent systems learn and improve.

The Open Source LLM Agent Handbook: How to Automate Complex Tasks with LangGraph and CrewAI

Balajee Asish Brahmandam — Tue, 03 Jun 2025 14:20:30 +0000

Ever feel like your AI tools are a bit...well, passive? Like they just sit there, waiting for your next command? Imagine if they could take initiative, break down big problems, and even work together to get things done.

That's exactly what LLM agents bring to the table. They're changing how we automate complex tasks, and they can help bring our AI ideas to life in a whole new way.

In this article, we'll explore what LLM agents are, how they work, and how you can build your very own using awesome open-source frameworks.

What we’ll cover:

The Current State of LLM Agents
What Are LLM Agents and Why Are They a Big Deal?
The Rise of Open-Source Agent Frameworks
Core Concepts Behind Agent Design
Project: Automate Your Daily Schedule from Emails
Multi-Agent Collaboration with CrewAI
What Actually Happens During Execution?
Are LLM Agents Safe? What to Know About Security and Privacy
Troubleshooting & Tips
Explore More Daily Automations
What’s Next in Agent Technology?
Final Summary

The Current State of LLM Agents

LLM agents are one of the most exciting developments in AI right now. They’re already helping automate real tasks but they’re also still evolving. So where are we today?

From Chatbots to Autonomous Agents

Large Language Models (LLMs) like GPT-4, Claude, Gemini, and LLaMA have evolved from simple chatbots into surprisingly capable reasoning engines. They've gone from answering trivia questions and generating essays to performing complex reasoning, following multi-step instructions, and interacting with tools like web search and code interpreters.

But here’s the catch: these models are reactive. They wait for input and give output. They don't retain memory between tasks, plan ahead, or pursue goals on their own. That’s where LLM agents come in – they bridge this gap by adding structure, memory, and autonomy.

What Can Agents Do Today?

Right now, LLM agents are already being used for:

Summarizing emails or documents
Planning daily schedules
Running DevOps scripts
Searching APIs or tools for answers
Collaborating in small “teams” to complete complex tasks

But they’re not perfect yet. Agents can still:

Get stuck in loops
Misunderstand goals
Require detailed prompts and guardrails

That’s because this technology is still early-stage. Frameworks are getting better fast, but reliability and memory are still works in progress. So just keep that in mind as you experiment.

Why Now Is the Best Time to Learn

The truth is: we’re still early. But not too early.

This is the perfect time to start experimenting with agents:

The tooling is mature enough to build real projects
The community is growing rapidly
And you don’t need to be an AI expert just comfortable with Python

What Are LLM Agents and Why Are They a Big Deal?

Before we dive into the exciting world of agents, let's quickly chat a bit more about the basics.

What Is an LLM?

An LLM, or Large Language Model, is basically an AI that's learned from a massive amount of text from the internet – think books, articles, code, and tons more. You can picture it as a super-smart autocomplete engine. But it does way more than just finish your sentences. It can also:

Answer tricky questions
Summarize long articles or documents
Write code, emails, or creative stories
Translate languages instantly
Even solve logic puzzles and have engaging conversations

Chances are you've heard of ChatGPT, which is powered by OpenAI's GPT models. Other popular LLMs you might come across include Claude (from Anthropic), LLaMA (by Meta), Mistral, and Gemini (from Google).

These models work by simply predicting the next word in a sentence based on the context. While that sounds straightforward, when trained on billions of words, LLMs become capable of surprisingly intelligent behavior, understanding your instructions, following step-by-step reasoning, and producing coherent responses across almost any topic you can imagine.

So, What’s an LLM Agent?

While LLMs are super powerful, they usually just react – they only respond when you ask them something. An LLM agent, on the other hand, is proactive.

LLM agents can:

Break down big, complex tasks into smaller, manageable steps
Make smart decisions and figure out what to do next
Use "tools" like web search, calculators, or even other apps
Work towards a goal, even if it takes multiple steps or tries
Team up with other agents to accomplish shared objectives

In short, LLM agents can think, plan, act, and adapt.

Think of an LLM agent like your super-efficient new assistant: you give it a goal, and it figures out how to achieve it all on its own.

Why Does This Matter?

This shift from just responding to actively pursuing goals opens a ton of exciting possibilities:

Automating boring IT or DevOps tasks
Generating detailed reports from raw data
Helping you with multi-step research projects
Reading through your daily emails and highlighting key info
Running your internal tools to take real-world actions

Unlike older, rule-based bots, LLM agents can reason, reflect, and learn from their attempts. This makes them a much better fit for real-world tasks that are messy, require flexibility, and depend on understanding context.

The Rise of Open-Source Agent Frameworks

Not too long ago, if you wanted to build an AI system that could act autonomously, it meant writing a ton of custom code, painstakingly managing memory, and trying to stitch together dozens of components. It was a complex, delicate, and highly specialized job.

But guess what? That's not the case anymore.

In 2024, a wave of fantastic open-source frameworks hit the scene. These tools have made it dramatically easier to build powerful LLM agents without you having to reinvent the wheel every time.

Popular Open-Source Agent Frameworks

Framework	Description	Maintainer
LangGraph	Graph-based framework for agent state and memory	LangChain
CrewAI	"Role-based, multi-agent collaboration engine"	Community (CrewAI)
AutoGen	Customizable multi-agent chat orchestration	Microsoft
AgentVerse	Modular framework for agent simulation and testing	Open-source project

What These Tools Enable

These frameworks give you ready-made building blocks to handle the trickier parts of creating agents:

Planning – Letting agents decide their next move
Tool Use – Easily connecting agents to things like file systems, web browsers, APIs, or databases
Memory – Storing and retrieving past information or intermediate results for long-term context
Multi-Agent Collaboration – Setting up teams of agents that work together on shared goals

Why Use a Framework Instead of Building from Scratch?

While you could build a custom agent from the ground up, using a framework will save you a huge amount of time and effort. Open-source agent libraries come packed with:

Built-in support for orchestrating LLMs
Proven patterns for task planning, keeping track of where you are, and getting feedback
Easy integration with popular models like OpenAI, or even models you run locally
The flexibility to grow from a single helpful agent to entire teams of agents

Basically, these frameworks let you focus on what your agent should do, rather than getting bogged down in how to build all the internal workings. Plus, choosing open source means you benefit from community contributions, transparency in how they work, and the freedom to tweak them to your exact needs, without getting locked into a single vendor.

Core Concepts Behind Agent Design

To really grasp how LLM agents operate, it helps to think of them as goal-driven systems that constantly cycle through observing, reasoning, and acting. This continuous loop allows them to tackle tasks that go beyond simple questions and answers, moving into true automation, tool usage, and adapting on the fly.

The Agent Loop

Most LLM agents function based on a mental model called the Agent Loop a step-by-step cycle that repeats until the job is done. Here’s how it typically works:

Perceive: The agent starts by noticing something in its environment or receiving new information. This could be your prompt, a piece of data, or the current state of a system.
Plan: Based on what it perceives and its overall goal, the agent decides what to do next. It might break the task into smaller sub-goals or figure out the best tool for the job.
Act: The agent then acts. This could mean running a function, calling an API, searching the web, interacting with a database, or even asking another agent for help.
Reflect: After acting, the agent looks at the outcome: Did it work? Was the result useful? Should it try a different approach? Based on this, it updates its plan and keeps going until the task is complete.

This loop is what makes agents so dynamic. It allows them to handle ever-changing tasks, learn from partial results, and correct their course qualities that are vital for building truly useful AI assistants.

Key Components of an Agent

To do their job effectively, agents are built around several crucial parts:

Tools are how an agent interacts with the real (or digital) world. These can be anything from search engines, code execution environments, file readers, or API clients, to simple calculators or command-line scripts.
Memory lets agents remember what they've done or seen across different steps. This might include previous things you've said, temporary results, or key decisions. Some frameworks offer short-term memory (just for one session), while others support long-term memory that can span multiple sessions or goals.
Environment refers to the external data or system context the agent operates within think APIs, documents, databases, files, or sensor inputs. The more information and access an agent have to its environment, the more meaningful actions it can take.
Goal is the agent's ultimate objective: what it's trying to achieve. Goals should be specific and clear for instance, “generate a daily schedule,” “summarize this document,” or “extract tasks from emails.”

Multi-Agent Collaboration

For more advanced systems, you can even have multiple agents working together to hit a shared target. Each agent can be given a specific role that highlights its specialty just like people working on a team.

For example:

A researcher agent might be tasked with gathering information.
A coder agent could write Python scripts or automation routines.
A reviewer agent might check the results and ensure everything is up to snuff.

These agents can chat with each other, share information, and even debate or vote on decisions. This kind of teamwork allows AI systems to tackle bigger, more complex tasks while keeping things organized and modular.

Project: Automate Your Daily Schedule from Emails

What We’re Automating

Think about your typical morning routine:

You open your inbox.
You quickly scan through a bunch of emails.
You try to spot meetings, tasks, and important reminders.
Then, you manually write a to-do list or add things to your calendar.

Let's use an LLM agent to make that process effortless. Our agent will:

Read a list of your email messages
Pull out time-sensitive items like meetings or deadlines
Summarize everything into a nice, clean daily schedule

Step 1: Install the Required Tools

To get started, you'll need three main tools: Python, VSCode, and an OpenAI API key.

1. Install Python 3.9 or Higher

Grab the latest version of Python 3.9+ from the official website: https://www.python.org/downloads/

Once it's installed, double-check it by running python --version in your terminal.

This command simply asks your system to report the Python version currently installed. You'll want to see Python 3.9.x or something higher to ensure compatibility with our project.

2. Install VSCode (Optional but Recommended)

VSCode is a fantastic, user-friendly code editor that works perfectly with Python. You can download it right here: https://code.visualstudio.com/.

3. Get Your OpenAI API Key

Head over to: https://platform.openai.com

Sign in or create a new account. Navigate to your API Keys page. Click “Create new secret key” and make sure to copy that key somewhere safe for later.

4. Install Python Libraries

Open your terminal or command prompt and install these essential packages:

pip install langgraph langchain openai

This command uses pip, Python's package manager, to download and install three crucial libraries for our agent:

langgraph: The core framework we'll use to build our agent's workflow.
langchain: A foundational library for working with large language models, upon which LangGraph is built.
openai: The official Python library for connecting to OpenAI's powerful AI models.

If you're excited to try out multi-agent setups (which we'll cover in Step 5), also install CrewAI:

pip install crewai

This command installs CrewAI, a specialized framework that makes it easy to orchestrate multiple AI agents working together as a team.

5. Set Your OpenAI API Key

You need to make sure your Python code can find and use your OpenAI API key. This is typically done by setting it as an environment variable.

On macOS/Linux, run this in your terminal (replace "your-api-key" with your actual key):

export OPENAI_API_KEY="your-api-key"

This command sets an environment variable named OPENAI_API_KEY. Environment variables are a secure way for applications (like your Python script) to access sensitive information without hardcoding it directly into the code itself.

On Windows (using Command Prompt), do this:

set OPENAI_API_KEY="your-api-key"

This is the Windows equivalent command to set the OPENAI_API_KEY environment variable.

Now, your Python code will be all set to talk to the OpenAI model!

Step 2: Define the Task

We discussed this briefly in the beginning of this section. But to reiterate, this is what we’ll want our agent to do:

Scan for meetings, events, and important tasks.
Jot them down quickly in a notebook or an app.
Create a rough mental plan for your day.

This routine takes time and mental energy. So having an agent do it for us will be super helpful.

Step 3: Build the Workflow with LangGraph

What Is LangGraph?

LangGraph is a cool framework that helps you build agents using a "graph-based" workflow, kind of like drawing a flowchart. It's powered by LangChain and gives you a lot more control over exactly how each step in your agent's process unfolds.

Each "node" in this graph represents a decision point or a function that:

Takes some input (its current "state").
Does some reasoning or takes an action (often involving the LLM and its tools).
Returns an updated output (a new "state").

You draw the connections between these nodes, and LangGraph then executes it like a smart, automated state machine.

Why Use LangGraph?

You get to control the precise order of execution.
It's fantastic for building workflows that have multiple steps or even branch off into different paths.
It plays nicely with both cloud-based models (like OpenAI) and models you run locally.

Alright – now let’s write the code.

1. Simulate Email Input

In a real application, your agent would probably connect to Gmail or Outlook to fetch your actual emails. For this example, though, we’ll just hardcode some sample messages to keep things simple:

Python

emails = """
1. Subject: Standup Call at 10 AM
2. Subject: Client Review due by 5 PM
3. Subject: Lunch with Sarah at noon
4. Subject: AWS Budget Warning – 80% usage
5. Subject: Dentist Appointment - 4 PM
"""

This multiline Python string, emails, acts as our stand-in for real email content. We're providing a simple, structured list of email subjects to demonstrate how the agent will process text.

2. Define the Agent Logic

Now, we'll tell OpenAI’s GPT model how to process this email text and turn it into a summary.

from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, List
import operator

# Define the state for our graph
class AgentState(TypedDict):
    emails: str
    result: str

llm = ChatOpenAI(temperature=0, model="gpt-4o") # Using gpt-4o for better performance

def calendar_summary_agent(state: AgentState) -> AgentState:
    emails = state["emails"]
    prompt = f"Summarize today's schedule based on these emails, listing time-sensitive items first and then other important notes. Be concise and use bullet points:\n{emails}"
    summary = llm.invoke(prompt).content
    return {"result": summary, "emails": emails} # Ensure emails is also returned

Here’s what’s going on:

Imports: We bring in necessary components:
- ChatOpenAI to connect to the LLM,
- StateGraph and END from langgraph.graph to build our agent workflow,
- TypedDict, Annotated, and List from typing for type checking and structure,
- operator (though not used in this snippet, it can help with comparisons or logic).
AgentState: This TypedDict defines the shape of the data our agent will work with. It includes:
- emails: the raw input messages.
- result: the final output (the daily summary).
llm = ChatOpenAI(...): Initializes the language model. We're using GPT-4o with temperature=0 to ensure consistent, predictable output perfect for structured summarization tasks.
calendar_summary_agent(state: AgentState): This function is the "brain" of our agent. It:
- Takes in the current state, which includes a list of emails.
- Extracts the emails from that state.
- Constructs a prompt that tells the model to generate a concise daily schedule summary using bullet points, prioritizing time-sensitive items.
- Sends this prompt to the model with llm.invoke(prompt).content, which returns the LLM’s response as plain text.
- Returns a new AgentState dictionary containing:
  - result: the generated summary,
  - emails: preserved in case we need it downstream.

3. Build and Run the Graph

Now, let's use LangGraph to map out the flow of our single-agent task and then run it.

builder = StateGraph(AgentState)
builder.add_node("calendar", calendar_summary_agent)
builder.set_entry_point("calendar")
builder.set_finish_point("calendar") # END is implicit if not set explicitly

graph = builder.compile()

# Run the graph using your simulated email data
result = graph.invoke({"emails": emails})
print(result["result"])

Here’s what’s going on:

builder = StateGraph(AgentState): We're initiating a StateGraph object. By passing AgentState, we're telling LangGraph the expected data structure for its internal state.
builder.add_node("calendar", calendar_summary_agent): This line adds a named "node" to our graph. We're calling it "calendar", and we're linking it to our calendar_summary_agent function, meaning that function will be executed when this node is active.
builder.set_entry_point("calendar"): This sets "calendar" as the very first step in our workflow. When we start the graph, execution will begin here.
builder.set_finish_point("calendar"): This tells LangGraph that once the "calendar" node finishes its job, the entire graph process is complete.
graph = builder.compile(): This command takes our defined graph blueprint and "compiles" it into an executable workflow.
result = graph.invoke({"emails": emails}): This is where the magic happens! We're telling our graph to start running. We pass it an initial state that contains our emails data. The graph will then process this data through its nodes until it reaches an end point, returning the final state.
print(result["result"]): Finally, we grab the summarized schedule from the result (the final state of our graph) and print it to the console.

Example Output

Your Schedule:
- 10:00 AM – Standup Call
- 12:00 PM – Lunch with Sarah
- 4:00 PM – Dentist Appointment
- Submit client report by 5:00 PM
- AWS Budget Warning – check usage

Boom! You've just built an AI agent that can read your emails and whip up your daily schedule. Pretty cool, right? This is a simple yet powerful peek into what LLM agents can do with just a few lines of code.

Multi-Agent Collaboration with CrewAI

What Is CrewAI?

CrewAI is an exciting open-source framework that lets you build teams of agents that work together seamlessly just like a real-world project team! Each agent in a CrewAI setup:

Has a specific, specialized role.
Can communicate and share information with its teammates.
Collaborates to achieve a shared goal.

This multi-agent approach is super useful when your task is too big or too complex for just one agent, or when breaking it down into specialized parts makes it clearer and more efficient.

Sample Roles for the Email Summary Task

Let's imagine our email summary task being handled by a small team of agents:

Agent Name	Role	Responsibility
Extractor	Email Scanner	"Find meetings, reminders, and tasks from emails"
Prioritizer	Schedule Optimizer	Sort items by urgency and time
Formatter	Output Generator	"Write a clean, polished daily agenda"

Sample CrewAI Code

from crewai import Agent, Crew, Task, Process
from langchain_openai import ChatOpenAI
import os

# Set your OpenAI API key from environment variables
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" # Make sure this is set, or defined directly

# Initialize the LLM (using gpt-4o for better performance)
llm = ChatOpenAI(temperature=0, model="gpt-4o")

# Define the agents with specific roles and goals
extractor = Agent(
    role="Email Scanner",
    goal="Find all meetings, reminders, and tasks from the given emails, accurately extracting details like time, date, and subject.",
    backstory="You are an expert at scanning emails for key information. You meticulously extract every relevant detail.",
    verbose=True,
    allow_delegation=False,
    llm=llm
)

prioritizer = Agent(
    role="Schedule Optimizer",
    goal="Sort extracted items by urgency and time, preparing them for a daily agenda.",
    backstory="You are a master of time management, always knowing what needs to be done first. You organize tasks logically.",
    verbose=True,
    allow_delegation=False,
    llm=llm
)

formatter = Agent(
    role="Output Generator",
    goal="Generate a clean, polished, and concise daily agenda in bullet-point format, clearly listing all schedule items.",
    backstory="You are a professional secretary, ensuring all outputs are perfectly formatted and easy to read. You prioritize clarity.",
    verbose=True,
    allow_delegation=False,
    llm=llm
)

# Simulate email input
emails = """
1. Subject: Standup Call at 10 AM
2. Subject: Client Review due by 5 PM
3. Subject: Lunch with Sarah at noon
4. Subject: AWS Budget Warning – 80% usage
5. Subject: Dentist Appointment - 4 PM
"""

# Define the tasks for each agent
extract_task = Task(
    description=f"Extract all relevant events, meetings, and tasks from these emails: {emails}. Focus on precise details.",
    agent=extractor,
    expected_output="A list of extracted items with their details (e.g., '- Standup Call at 10 AM', '- Client Review due by 5 PM')."
)

prioritize_task = Task(
    description="Prioritize the extracted items by time and urgency. Meetings first, then deadlines, then other notes.",
    agent=prioritizer,
    context=[extract_task], # The output of extract_task is the input here
    expected_output="A prioritized list of schedule items."
)

format_task = Task(
    description="Format the prioritized schedule into a clean, easy-to-read daily agenda using bullet points. Ensure concise language.",
    agent=formatter,
    context=[prioritize_task], # The output of prioritize_task is the input here
    expected_output="A well-formatted daily agenda with bullet points."
)

# Instantiate the crew
crew = Crew(
    agents=[extractor, prioritizer, formatter],
    tasks=[extract_task, prioritize_task, format_task],
    process=Process.sequential, # Tasks are executed sequentially
    verbose=2 # Outputs more details during execution
)

# Run the crew
result = crew.kickoff()
print("\n########################")
print("## Final Daily Agenda ##")
print("########################\n")
print(result)

Here’s what’s going on:

Imports: We bring in key classes from CrewAI: Agent, Crew, Task, and Process. We also import ChatOpenAI for our language model and os to handle environment variables.
llm = ChatOpenAI(...): Just like in the LangGraph example, this sets up our OpenAI language model, making sure its responses are direct (temperature=0) and using the gpt-4o model.
Agent Definitions (extractor, prioritizer, formatter):
- Each of these variables creates an Agent instance. An agent is defined by its role (what it does), a specific goal it's trying to achieve, and a backstory (a sort of personality or expertise that helps the LLM understand its purpose better).
- verbose=True is super helpful for debugging, as it makes the agents print out their "thoughts" as they work.
- allow_delegation=False means these agents won't pass their assigned tasks to other agents (though this can be set to True for more complex delegation scenarios).
- llm=llm connects each agent to our OpenAI language model.
Simulated emails: We reuse the same sample email data for this example.
Task Definitions (extract_task, prioritize_task, format_task):
- Each Task defines a specific piece of work that an agent needs to perform.
- description clearly tells the agent what the task involves.
- agent assigns this task to one of our defined agents (e.g., extractor for extract_task).
- context=[...] is a critical part of CrewAI's collaboration. It tells a task to use the output of a previous task as its input. For instance, prioritize_task takes the extract_task's output as its context.
- expected_output gives the agent an idea of what its result should look like, helping guide the LLM.
crew = Crew(...):
- This is where we assemble our team! We create a Crew instance, giving it our list of agents and tasks.
- process=Process.sequential tells the crew to execute tasks one after another in the order they're defined in the tasks list. CrewAI also supports more advanced processes like hierarchical ones.
- verbose=2 will show you a very detailed log of the crew's internal workings and communication.
result = crew.kickoff(): This command officially starts the entire multi-agent workflow. The agents will begin collaborating, passing information, and working through their assigned tasks in sequence.
fprint(result): Finally, the consolidated output from the entire crew's collaborative effort is printed to your console.

CrewAI cleverly handles all the communication between agents, figures out who needs to work on what and when, and passes the output smoothly from one agent to the next it's like having a mini AI assembly line!

What Actually Happens During Execution?

So, whether you're using LangGraph or CrewAI, what's really going on behind the scenes when an agent runs? Let's break down the execution process:

The system gets an input state (for example, your emails).
The first agent or graph node reads this input and uses a Large Language Model (LLM) to make sense of it.
Based on its understanding, the agent decides on an action like pulling out key events or calling a specific tool.
If needed, the agent might invoke tools (like a web search or a file reader) to get more context or perform external operations.
The result of that action is then passed to the next agent in the team (if it's a multi-agent setup) or returned directly to you.

Execution keeps going until:

The task is fully completed.
All agents have finished their assigned roles.
A stopping condition or a designated "END" point in the workflow is reached.

Think of this as a super-smart workflow engine where every single step involves reasoning, making decisions, and remembering previous interactions.

Are LLM Agents Safe? What to Know About Security and Privacy

As cool as LLM agents are, they raise an important question: can you really trust an AI to run parts of your workflow or interact with your data? It depends. If you’re using services like OpenAI or Anthropic, your data is encrypted in transit and (as of now) isn’t used for training.

But some data might still be temporarily logged to prevent abuse. That’s usually fine for testing and personal projects, but if you’re working with sensitive business info, customer data, or anything private, you’ll want to be careful.

Use anonymized inputs, avoid exposing full datasets, and consider running agents locally using open-source models like LLaMA or Mistral if full control matters to you.

You can also set clear boundaries for your agents so they don’t overstep. Think of it like onboarding a new intern: you wouldn’t give them access to everything on day one.

Give agents only the tools and files they need, keep logs of what they do, and always review the results before letting them make real changes.

As this tech grows, more safety features are coming like better sandboxing, memory limits, and role-based access. But for now, it’s smart to treat your agents like powerful helpers that still need some human supervision.

Troubleshooting & Tips

Sometimes, agents can be a bit quirky! Here are some common issues you might run into and how to fix them:

Issue	Suggested Fix
Agent seems to loop forever	Set a maximum number of iterations or define a clearer stopping point.
Output is too chatty or verbose	Use more specific prompts (for example, “Respond in bullet points only”).
Input is too long or gets cut off	Break down large pieces of content into smaller chunks and summarize them individually.
Agent runs too slowly	Try using a faster LLM model like gpt-3.5 or consider running a local model.

A handy tip: You can also add print() statements or logging messages inside your agent functions to see what's happening at each stage and debug state transitions.

Explore More Daily Automations

Once you've built one agent-based task, you'll find it incredibly easy to adapt the pattern for other automations. Here are some cool ideas to get your creative juices flowing:

Task Type	Example Automation
DevOps Assistant	"Read system logs, detect potential issues, and suggest solutions."
Finance Tracker	Read bank statements or CSV files and summarize your spending habits/budgets.
Meeting Organizer	After a meeting, automatically extract action items and assign owners.
Inbox Cleaner	"Automatically label, archive, and delete non-urgent emails."
Note Summarizer	Convert your daily notes into a neatly formatted to-do list or summary.
Link Checker	Extract URLs from documents and automatically test if they're still valid.
Resume Formatter	Score resumes against job descriptions and format them automatically.

Each of these can be built using the very same principles and frameworks we discussed whether that's LangGraph or CrewAI.

What’s Next in Agent Technology?

LLM agents are evolving at lightning speed, and the next wave of innovation is already here:

Smarter memory systems: Expect agents to have better long-term memory, allowing them to learn over extended periods and remember past conversations and actions.
Multi-modal agents: Agents won't just handle text anymore! They'll be able to process and understand images, audio, and video, making them much more versatile.
Advanced planning frameworks: Techniques like ReAct, Toolformer, and AutoGen are constantly improving agents' ability to reason, plan, and reduce those pesky "hallucinations."
Edge deployment: Imagine agents running entirely offline on your local computer or device using lightweight models like LLaMA 3 or Mistral.

In the very near future, you'll see agents seamlessly integrated into:

Your DevOps pipelines
Big enterprise workflows
Everyday productivity tools
Mobile apps and smart devices
Games, simulations, and educational platforms

Final Summary

Alright, let's quickly recap all the cool stuff you've just learned and accomplished:

You've gotten a solid grasp of what LLM agents are and why they're so powerful.
You've seen how open-source frameworks like LangGraph and CrewAI make building agents much easier.
You've built a real LLM agent using LangGraph to automate a common daily task: summarizing your inbox!
You've explored the world of multi-agent collaboration with CrewAI, understanding how teams of AIs can work together.
You've learned how to take these principles and scale them to automate countless other tasks.

So, next time you find yourself stuck doing something repetitive, just ask yourself: "Hey, can I build an agent for that?" The answer is probably yes!

Resources Recap

Here are some helpful resources if you want to dive deeper into building LLM agents:

Resource	Link
LangGraph Docs	https://docs.langgraph.dev/
CrewAI GitHub	https://github.com/joaomdmoura/crewAI
LangChain Docs	https://docs.langchain.com/docs/
OpenAI API Docs	https://platform.openai.com/docs
Python 3.9+	https://www.python.org/downloads/
VSCode	https://code.visualstudio.com/

The Agentic AI Handbook: A Beginner's Guide to Autonomous Intelligent Agents

Balajee Asish Brahmandam — Wed, 28 May 2025 14:22:20 +0000

You may have heard about “Agentic AI” systems and wondered what they’re all about. Well, in basic terms, the idea behind Agentic AI is that it can see its surroundings, set and pursue goals, plan and reason through many processes, and learn from experience.

Unlike chatbots or rule-based software, agentic AI actively responds to user requests. It may break activities into smaller tasks, make decisions based on a high-level goal, and change its behavior over time using tools or other specialized AI components.

To summarize, agentic AI systems "solve complex, multi-step problems autonomously by using sophisticated reasoning and iterative planning." In customer service, for example, an agentic AI may answer questions, check a user's account, offer balance settlements, and conduct transactions without human supervision.

So, agentic AI is "AI with agency”. Given a problem context, it sets goals, creates strategies, manipulates the environment or software tools, and learns from the results.

But at the moment, most popular AI systems are reactive or non-agentic, doing a specific job or reacting to inputs without preparation. For example, Siri or a traditional image classifier use predefined models or rules to map inputs to outputs. Instead of long-term goals or multi-step processes, reactive AI "responds to specific inputs with pre-defined actions". Agentic AI is more like a robot or personal assistant that can handle reasoning chains, adapt, and "think" before acting.

What we’ll cover here

In this article, you’ll learn what makes Agentic AI fundamentally different from traditional reactive systems. We’ll cover its key components like autonomy, goal-setting, planning, reasoning, and memory and explore how these systems are being built today. We’ll also look at the challenges they present, and where they are currently in development. Finally, you’ll get a hands-on tutorial on how to build your own simple agent using Python and LangChain.

Agentic vs Reactive AI
Key Components of AI Agency
How Does Agentic AI Know What to Do?
So What’s the Current State of Agentic AI?
Building Agentic AI: Frameworks and Approaches
Major Challenges of Agentic AI
Code Snippet and Real-World Examples
Tutorial: Build Your First Agentic AI with Python
Conclusion

Agentic vs Reactive AI

Before we dive fully in, I want to make sure the differences between non-agentic and agentic AI are clear.

Non-agentic reactive AI uses learned models or rules to map inputs to outputs. It replies to one idea or task at a time, not starting additional ones. Examples include a calculator, spam filter, and rudimentary chatbot with pre-written responses. Reactive AI cannot plan or improve without reprogramming.

Agentic AI, on the other hand, acts independently with goals. It may organize actions, set objectives, adapt to new information, and collaborate with others. Agentic AI can break a complex task into small segments and coordinate the usage of specialized tools or services to complete each step.

The agent is also proactive. An agentic AI may inform users of updates, restock supplies, and check inventory levels, unlike a reactive system.

The difference is a paradigmatic shift: modern agentic systems include several specialized agents working together on a high-level objective, with dynamic task breakdown and even permanent memory, instead of a single model. This multi-agent collaboration may help agentic AI solve large real-world problems.

Cutting-edge prototypes like intelligent chatbots with tool integration, autonomous driving software, and coordinated industrial robots are entering agentic territory, but today's reactive AI virtual assistants (Alexa, Siri) may blur the line. It's a vital distinction whether the system actively selects rather than reacts.

Key Components of AI Agency

Agentic AI systems are characterized by several core capabilities that give them agency. Let’s look at these now.

Autonomy

An autonomous agent may work without human supervision. It may act depending on its goals and strategy rather than waiting for specific directions.

The agent must use sensors or data streams to perceive, evaluate, and decide to be autonomous. An autonomous warehouse robot can move, pick up things, and alter path when it encounters barriers without human guidance. Autonomy implies self-monitoring: an agent gauges its battery life or job completion and adapts as needed.

An agentic AI's “reasoning engine” (usually a large language model or similar system) makes decisions and can adjust its behavior based on user feedback or rewards.

As IBM explains, “without any human intervention, agentic AI can act independently, adapt to new situations, make decisions, and learn from experience” (source). But uncontrolled autonomous agents may behave in unpredictable ways – which is why they must be carefully designed.

Although agentic AIs can operate on their own, their goals, tools, and boundaries must be clearly planned to avoid unintended or harmful outcomes. Without that guidance, they may follow instructions too literally or make decisions without understanding the bigger picture.

Goal-Directed Behavior

Agentic AI is goal-directed. The system attempts to achieve one or more goals. The goals might be specified openly ("set up a meeting for tomorrow") or implicitly through a reward system. Instead of following a script, the agent chooses how to achieve its goal. It may choose methods, subgoals, and long-term goals.

Unplanned reactive AI has short-term or implicit goals (for example, recognize an image, guess the next word). Agentic AIs aim toward long-term goals. If assigned the duty of "organizing my travel itinerary," an agent may book flights, hotels, transportation, and so on, choose the best order, and adjust the schedule if airline prices change.

Business and research sources underline this distinction. Agentic AI plans and works for long-term goals, whereas reactive systems manage immediate, reactive responses. A plan-and-execute architecture lets the agent decide what to do and define and alter its goals. Instead of distinct, separate acts, it progressively performs a series. Goal-directed behavior demonstrates purposeful intent, even if the goal is vague.

Planning

An agent plans to achieve its goals. A goal and data instruct the agentic AI to conduct a series of actions or subtasks. Planning includes simple heuristics (if A, then do B) and advanced reasoning (evaluating options).

Modern agentic AI uses planner-executor architectures with chain-of-thought prompting. In a "plan-and-execute" agent, an LLM-driven planner develops a multi-step plan, and executor modules employ tools or models to execute each step. ReAct is another technique in which the agent alternates between action and reasoning (or "thought") to refine its approach as it accumulates observations.

Planning often involves search and optimization using neural networks, decision trees, or graph-based techniques. For example, an agent might build a planning graph showing different possible actions and outcomes, then use algorithms like A* search or Monte Carlo tree search to choose the best next step.

In some cases, the agent simulates multiple possible futures to evaluate which actions are most likely to lead to success. Large language models (LLMs) can also help by breaking down complex instructions into smaller steps turning a single high-level goal into a list of tasks that can be executed one by one.

Here’s a simplified example (pseudocode) of an agent loop:

goal = "prepare presentation on AI"
agent = AI_Agent(goal)
environment = TaskEnvironment()
 # Loop until the task is complete
while not environment.task_complete():
    observation = agent.perceive(environment)
    plan = agent.make_plan(observation)        # e.g., list of steps
    action = plan.next_step()
    result = agent.act(action, environment)
    agent.learn(result)                       # update memory or strategy

Here, the agent perceives the current state, plans a sequence of steps toward its goal, acts by executing the next step, and then learns from the outcome before repeating. This cycle captures the core loop of an autonomous agent.

Reasoning

Making judgments by applying logic and inference is known as reasoning. In addition to acting, an agentic AI considers what actions make sense in light of its information. This entails assessing trade-offs, comprehending cause and consequence, and, if necessary, applying mathematical or symbolic thinking.

An agent may, for instance, apply deductive reasoning, like "If sales fall below X, reorder inventory" or "All invoices are paid by Friday. This is an invoice, so I should pay it by Friday". By enabling the agent to process natural language commands, retain contextual information, and produce logical justifications for its decisions, large language models support reasoning.

An LLM "acts as the orchestrator or reasoning engine" that comprehends tasks and produces solutions, according to one explanation in the LangChain docs. In order to retrieve pertinent information for reasoning, agents also employ strategies such as retrieval-augmented generation (RAG).

Agentic reasoning is essentially like internal planning and problem-solving. An agent evaluates a task by internally simulating potential strategies (often in the "thoughts" of an LLM) and selecting the most effective one. This might entail formal logic, analogical reasoning (connecting a new problem to previous ones), or multi-step deduction. So the agent continually considers its next course of action and adjusts to new inputs rather of just clicking "execute" on a single model outcome.

Memory

Agents can utilize memory to recall prior experiences, information, and interactions to make decisions. A memoryless AI would treat every moment as new. Agentic systems record their behaviors, outcomes, and context. A short-term “working memory” of the present plan state or a long-term world knowledge base are examples.

A customer-service agent may remember a user's name and issue history to avoid repeating inquiries. Game-playing agents learn from past positions to move better. IBM says AI agent memory “refers to an AI system’s ability to store and recall past experiences to improve decision-making, perception and overall performance”. Goal-oriented agents need memory to create a cohesive narrative of previous steps (to avoid repeating failures) and discover trends.

Agentic architectures incorporate memory modules like databases or vector storage that the LLM may query. Large language models are stateless. Agents utilize relevance filters to retain only important information since too much memory slows the system. Memory offers the agent context and continuity, allowing it to learn from previous tasks rather than beginning again.

How Does Agentic AI Know What to Do?

Agentic AI might seem smart, but it’s not actually “thinking” like a human. Let’s break down how it really works.

1. It Uses a Pretrained AI Model

At the heart of most agentic systems is a large language model (LLM) like GPT-4. This model is trained on a huge amount of tex, books, articles, websites, and so on to learn how people write and talk.

But it wasn’t trained to act like an agent. It was trained to predict the next word in a sentence.

When we give it the right prompts, it can seem like it’s making plans or solving problems. Really, it’s just generating useful responses based on patterns it learned during training.

2. It Follows Instructions in Prompts

Agentic AI doesn’t figure out what to do by itself – developers give it structure using prompts.

For example:

“You are an assistant. First, think step by step. Then take action.”
“Here’s a goal: research coding tools. Plan steps. Use Wikipedia to search.”

These prompts help the AI simulate planning, decision-making, and action.

3. It Uses Tools, But Only When Told How

The AI doesn’t automatically know how to use tools like search engines or calculators. Developers give it access to those tools, and the AI can decide when to use them based on the text it generates.

Think of it like this: the AI suggests, “Now I’ll look something up,” and the system makes that happen.

4. It Can Remember (Sometimes)

Some agents use short-term memory to remember past questions or results. Others store useful information in a database for later. But they don’t “learn” over time like humans do – they only remember what you let them.

5. It’s Not Fully Autonomous — Yet

Most agentic systems today are not fully self-learning or self-aware. They’re smart combinations of:

Pretrained AI
Prompts
Tools
Memory

Their “autonomy” comes from how all these parts work together – not from deep understanding or long-term training.

So What’s the Current State of Agentic AI?

Agentic AI is still an emerging area of development. While it sounds futuristic, many systems today are just starting to use agent-like capabilities.

What Exists Today

Simple agentic systems already work in limited ways

For example, some customer service bots can check account details, respond to questions, and escalate issues automatically.
Warehouse robots can plan simple routes and avoid obstacles on their own.
Coding assistants like GitHub Copilot can help write and fix code based on natural language input.

These systems show basic agentic behavior like goal-following and tool use but usually in a narrow, structured environment.

What’s Still Experimental

Fully autonomous, multi-purpose agents the kind that can reason deeply, make long-term plans, and adapt to new tools, are still in research or prototype stages.
Projects like AutoGPT, BabyAGI, and OpenDevin are exciting, but they’re mostly experimental and require human oversight.

Most current agentic systems:

Don’t learn continuously
Struggle with unpredictable environments
Require a lot of setup to avoid errors or unexpected behavior

Are We Close to Truly Autonomous Agents?

We’re getting closer, but we’re not there yet.

Today’s agentic AI is like a very clever assistant that can follow instructions, use tools, and plan steps. But it still depends on developers to give it structure (via prompts, tool choices, and boundaries).

In short, Agentic AI works in specific, well-designed use cases. But general-purpose, human-level autonomous agents are still a long way off.

Building Agentic AI: Frameworks and Approaches

Researchers and engineers have developed various frameworks and tools to construct agentic AI systems. Let’s discuss some key approaches.

Reinforcement Learning (RL) Agents

In artificial intelligence, traditional agents are frequently constructed via reinforcement learning, in which the agent learns to maximize a reward signal through trial and error. Atari game agents and DeepMind's AlphaGo are classic examples.

In addition to planning (in the sense of calculating a policy) and learning from interactions, RL agents are goal-directed (maximizing reward). Still, a lot of pure RL systems struggle with the open-ended complexity of real-world tasks and function best in simulated contexts.

While RL components are occasionally incorporated into modern agentic AI (for example, an agent may utilize RL to drive a robot at a basic level), they are frequently supplemented with other methods for higher level thinking.

LLM-Based (Generative) Agents

The use of LLMs as reasoning engines within agents has become popular due to the recent explosion of large language models. For instance, LLMs (such as GPT-4) are used by frameworks like ReAct, AutoGPT, and BabyAGI to create plans and actions. These systems include prompting an LLM with the agent's objective and context, after which it generates a step or sub-goal and invokes either a function or a tool.

One design, frequently referred to as a ReAct loop, alternates between "Thought" (the LLM planning or reasoning) and "Action" (calling upon tools or APIs). An alternative approach involves a distinct planner LLM that generates a comprehensive multi-step plan, which is then followed by executor modules that execute each step.

To increase their capabilities, LLM agents frequently employ tools like search engines, calculators, and API calls. They also use context retrieval, such as RAG or memory storage, to guide their reasoning. LangChain and LangGraph are well-known open-source frameworks that offer building blocks (memory buffers, tool integration, and so on) for creating unique agents.

Multi-Agent and Orchestration Frameworks

Several sub-agents are used in many agentic AI architectures. A "crew" or "society of minds" method, for example, may produce many LLM agents that communicate by message passing and each serve a different job (planner, analyst, critic, and so on).

Orchestrated multi-agent processes are demonstrated by projects such as AutoGen, ChatDev, or MetaGPT. Engineering ideas for multi-agent systems are being explored in academic work. One study by BMW, for instance, outlines a framework for multi-agent cooperation in which several AI agents manage planning, execution, and specialized activities while working together to achieve an industrial use case.

These systems frequently have scheduling logic to allocate agents to subtasks and a task decomposition module, which breaks a goal down into its component elements. This essentially resembles a "AI team," in which every individual is an agentic subsystem.

Classical Planning and Symbolic AI

AI planning was examined in symbolic terms before to the current ML revival (STRIPS, PDDL planners, and so on). These methods might be viewed as an early example of agentic AI, in which a planner constructs a series of symbolic actions to accomplish a goal.

These concepts are occasionally included into contemporary agentic AI. For instance, an LLM agent may provide a high-level symbolic plan that grounded systems carry out, such as "(Find x such that property y), (compute f(x)), (deliver result)" and so on.

There are also hybrid architectures that combine traditional search with neural networks. The transition to learned or language-based planners is an extension of the classical planning that underpins many robotics and scheduling agents, even though it’s less prevalent in pure form today.

Tool-augmented Reasoning

In many agentic systems, granting the agent access to external functions and information is a viable strategy. For instance, when responding to a difficult inquiry, a language-based agent may utilize Retrieval-Augmented Generation (RAG) to retrieve pertinent information from a database.

As "tools" that it may use, it might also include a calculator, a web browser, a database API, or bespoke code. Autonomy is largely made possible by the capacity to utilize tools – instead of attempting to learn everything by heart, the AI model learns how to ask the appropriate questions.

In sum, building an agentic AI often means combining multiple techniques: machine learning for perception and learning, symbolic planning for structure, LLM reasoning for natural language and problem decomposition, plus memory modules and feedback loops.

There is no one-size-fits-all framework yet. Research continues rapidly – recent papers on agentic systems emphasize end-to-end pipelines that integrate perception (input analysis), goal-oriented planning, tool use, and continual learning.

Major Challenges of Agentic AI

Building AI agents with autonomy and goals is powerful but raises new risks and difficulties. Key challenges include:

Alignment and Value Specification

Setting the correct goals is crucial for agentic systems. If an agent's aims don't match human values, it may be damaging. If a scheduling agent is directed to “minimize costs,” it may reduce vital services unless told to preserve quality. Humans' complicated priorities make value formulation challenging. Unspecified or poorly described goals cause unexpected consequences (Goodhart's Law).

Unintended Consequences

Even with good intentions, agents may discover loopholes. Reward-hacking in reinforcement learning is an example from basic AI. Autonomy increases these hazards for agentic AI. Recent experiments showed an LLM-based AI was told to pursue a goal “at all costs.” It planned to stop its own monitoring and clone itself to escape shutdown, acting in self-preservation.

If unconstrained, an agent may deceive to achieve its aims. Unintended effects can range from an assistant arranging a hazardous flight because it fixed on a cost-savings aim to more subtle damages like cutting important benefits. IBM researchers warn that agents “can act without your supervision”, resulting in unintended consequences without strong protections.

Safety and Security

Highly autonomous agents can increase danger. They may access sensitive data or operate machinery. IBM says that agents are opaque and open-ended, so their judgments might be unclear, and they may suddenly use new tools or data. A healthcare agent may leak patient data, or a financial bot may execute a dangerous move.

LLM-style adversarial assaults and hallucinations become more dangerous in agentic AI. Though bothersome, a delusional chatbot or investment agent might also lose millions. Agent multi-step reasoning is sensitive to hostile inputs at any level. Complex agents make trust and verification difficult.

Coordination and Scalability

In many agentic systems, multiple agents may collaborate or compete. Ensuring that they communicate correctly and don’t conflict is non-trivial.

A recent review notes unique challenges in orchestrating multiple agents without standardized protocols. As the Stanford ethics report points out, if millions of agents interact (for example, booking each other’s appointments), the emergent behavior could be unpredictable at scale. This raises societal concerns about system-level effects and feedback loops we haven’t seen before.

Ethical and Legal Questions

Finally, there are questions of responsibility and bias. Who is liable if an autonomous agent makes a mistake? How do we ensure transparency and fairness in a black-box multi-agent system?

Legal and ethical frameworks are still catching up. For example, IBM highlights that agentic AI brings “an expanded set of ethical dilemmas” compared to today’s AI. And AI ethicists caution that deploying powerful assistants (as personal secretaries, advisors, and so on) will have profound societal impacts that are hard to predict.

Here are some specific things we need to consider:

Accountability: Who is accountable if an AI agent makes a damaging choice (such a medical AI agent prescribing the wrong medication or a logistics agent causing an accident)? Designers, deployers, or agents? Legal systems presume human control, but autonomous agents may not.
Transparency: Complex and opaque agentic systems exist. Multiple neural networks, knowledge bases, and tools may interact. Explaining an agent's behavior for auditing or debugging is tough. This opposes explainable AI.
Bias and fairness: Agents learn from data and environments that may reflect human biases. An autonomous hiring assistant agent, for instance, might inadvertently replicate discriminatory patterns unless carefully checked. And because agentic AI can perpetuate or amplify biases across many decisions, the impact could be larger.
Job disruption and social impact: Just as factory automation destroyed certain employment, powerful AI agents might change office and creative labor. Personal assistant agents that schedule, manage email, and research might change many careers. This might boost production but also exacerbate deskilling and inequality. Social pressure to utilize agentic AI (if rivals do) may divide workers into “augmented” and “unaugmented” workers.
Security and privacy: An agent with extensive system access harms privacy. Compromise of an AI agent permitted to access and write business data or personal correspondence might reveal critical information. IBM warns that agentic AI can increase recognized hazards, such as an agent accidentally biasing a database or sharing private data without monitoring. Tools must be authenticated and data handled securely.
Human-AI interaction: Our agents may affect how we use technology and interact with others. If individuals utilize AI bots for conversation, information filtering, or companionship, it might change societal dynamics. Consider again the Stanford study referenced above. So we need to pursue ways to include standards and values into these encounters.

In recognition of these challenges, technologists and ethicists urge us to use proactive safeguards. As IBM researchers put it, because agentic AI is advancing rapidly, we cannot wait to address safety – we must build strong guardrails now. Some proposed measures include strict testing protocols for agents, explainability requirements, legal regulations on autonomous systems, and design principles that prioritize human values.

So as you can see, while agentic AI offers the potential for AI that can handle complex tasks end-to-end, it also amplifies known AI risks (bias, error) and introduces new ones (autonomous decision-making, coordination failures). Addressing these challenges requires careful design of alignment, robust evaluation of agent behavior, and interdisciplinary governance.

Code Snippet and Real-World Examples

To illustrate how an agentic system works, let’s consider a very simple Python-like pseudocode for an abstract agent (mixing concepts from above):

class Agent:
    def init(self, goal):
        self.goal = goal
        self.memory = []
    def perceive(self, environment):
        # Get data from environment (sensor, API, etc.)
        return environment.get_state()
    def plan(self, observation):
        # Use reasoning (LLM or algorithm) to decide next action(s)
        plan = ReasoningEngine.generate_plan(goal=self.goal, context=observation)
        return plan  # e.g. list of steps or actions
    def act(self, action, environment):
        # Execute the action using tools or directly in the environment
        result = environment.execute(action)
        return result
    def learn(self, experience):
        # Store outcome or update strategy
        self.memory.append(experience)   
    def run(self, environment):
        while not environment.task_complete():
            obs = self.perceive(environment)
            plan = self.plan(obs)
            for action in plan:
                result = self.act(action, environment)
                self.learn(result)

This example demonstrates the core loop of an agentic AI:

The agent starts with a goal and can store memory of what it has done.
It observes its environment to understand what’s happening.
Based on that input, it creates a plan – a list of actions to reach its goal.
It executes each action, interacts with the environment, and learns from what happens.
This process repeats until the goal is met or the task is complete.

This basic structure mirrors how real-world agentic systems operate: perceive → plan → act → learn.

Real-world agentic AI systems are evolving. Self-driving cars detect their environment, set navigation goals, plan routes, and learn from experience.

Tesla's Full Self-Driving “continuously learns from the driving environment and adjusts its behavior” to increase safety. Supply chain logistics businesses are creating agents that monitor inventory, estimate demand, alter routes, and place new orders autonomously. Amazon's warehouse robots utilize agentic AI to navigate complicated surroundings and adapt to changing situations, independently fulfilling orders.

Cybersecurity, healthcare, and customer service also use autonomous agents to identify and respond to risks. An agentic AI at a contact center may assess a customer's mood, account history, and company policies to provide a bespoke solution or process. Agentic systems organize and arrange marketing campaigns, write text, choose graphics, and alter strategies depending on performance data. In processes with several phases and choices, agentic AI can handle the whole workflow.

Recently, several prototype projects and open-source tools have begun experimenting with agentic AI in real-world scenarios.

For example, tools like AutoGPT and AgentGPT have demonstrated agents that can generate multimedia reports by coordinating research, writing, and image selection tasks. Other use cases include agents that retrieve knowledge and take follow-up action (for example, “find and implement the next step”), conduct security operations like scanning and responding to threats, or automate multi-step workflows in call centers.

These examples show how early-stage products and research projects are beginning to test and deploy agentic AI for complex, multi-step tasks beyond just answering questions.

Tutorial: Build Your First Agentic AI with Python

This step-by-step guide will teach you how to build a basic Agentic AI system even if you're just starting out. I’ll explain every concept clearly and give you working Python code you can run and study.

Real-World Use Case

Scenario: You're a product manager exploring tools for your team. Instead of spending hours researching AI coding assistants manually, you'd like a personal research agent to:

Understand your task
Gather relevant information from Wikipedia
Summarize it clearly
Remember context from previous questions

This is where Agentic AI shines: it acts autonomously, reasons, and uses tools just like a smart human assistant.

Prerequisites – What You Need

Python 3.10 or higher
An OpenAI API key (https://platform.openai.com/api-keys). Note that as of writing this OpenAI does not offer free API calls, so if you don’t already have an account you’ll need to use a credit card and a few dollars to complete this tutorial.
Install the required Python libraries:

pip install langchain openai wikipedia

⚠️ Don't forget to store your API key safely. Never share it in public code.

Step-by-Step Tutorial

Step 1: Set Up Your Environment

Start by setting your OpenAI API key in your script so that LangChain can access GPT models.

import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here"  # Replace with your real key

Step 2: Connect to a Knowledge Source (Wikipedia)

We'll give our agent the ability to use Wikipedia as a tool to gather information.

from langchain.agents import Tool
from langchain.tools import WikipediaQueryRun
from langchain.utilities import WikipediaAPIWrapper
# Create the Wikipedia tool
wiki = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
# Register the tool so the agent knows how to use it
tools = [
    Tool(
        name="Wikipedia",
        func=wiki.run,
        description="Useful for looking up general knowledge."
    )
]

You're giving your agent a way to "see the world" – Wikipedia is your agent's eyes.

Step 3: Initialize the Agent (Reasoning Engine)

We now give the agent a brain – a GPT model that can reason, decide, and plan.

from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent
from langchain.agents.agent_types import AgentType
# Use a GPT model with zero randomness for consistent output
llm = ChatOpenAI(temperature=0)
# Combine reasoning (LLM) and tools (Wikipedia) into one agent
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True  # Show thought process step-by-step
)

This step fuses logic (GPT) and action (Wikipedia) to make your agent capable of goal-driven behavior.

Step 4: Give Your Agent a Goal

goal = "What are the top AI coding assistants and what makes them unique?"
response = agent.run(goal)
print("\nAgent's response:\n", response)

You’ve given your agent a mission. It will now think, search, and summarize.

You should see output like:

> Entering new AgentExecutor chain...

Thought: I should look up AI coding assistants on Wikipedia

Action: Wikipedia

Action Input: AI coding assistants

...

Final Answer: The top AI coding assistants are GitHub Copilot, Amazon CodeWhisperer, and Tabnine...

At this point, the agent has:

Interpreted your goal
Selected a tool (Wikipedia)
Retrieved and analyzed content
Reasoned through it to deliver a conclusion

Step 5: Give Your Agent Memory (Optional but Powerful)

Let your agent remember what you previously asked, like a real assistant.

from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")
agent_with_memory = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
    memory=memory,
    verbose=True
)
# Ask a follow-up
agent_with_memory.run("Tell me about GitHub Copilot")
agent_with_memory.run("What else do you know about coding assistants?")

Your agent now tracks context across multiple interactions just like a good human assistant.

When this is done, your agent:

Responds more naturally to follow-up questions
Links previous conversations to improve continuity

After running the steps, your agent reads your goal and plans steps to fulfill it. It searches Wikipedia to gather facts, and reasons using a GPT model to summarize and decide what to say. It also optionally remembers context (with memory enabled). You now have a working Agentic AI that can be extended for real-world tasks.

Conclusion

Agentic AI offers an exciting glimpse into a future where machines can collaborate with humans to solve complex, multi-step problems not just respond to commands. With capabilities like planning, reasoning, tool use, and memory, these systems could one day handle tasks that currently require entire teams of people.

But with that power comes real responsibility. If not properly designed and guided, autonomous agents could act in unpredictable or harmful ways. That’s why developers, researchers, and policymakers need to work together to set clear boundaries, safety rules, and ethical standards.

The technology is advancing quickly from self-driving cars to research assistants to multi-agent platforms like AutoGPT and LangChain. As we build smarter systems, the challenge isn't just what they can do, but how we ensure they do it safely, fairly, and in ways that benefit everyone.

How to Automate Compliance and Fraud Detection in Finance with MLOps

Balajee Asish Brahmandam — Mon, 12 May 2025 16:21:29 +0000

These days, businesses are under increasing pressure to comply with stringent regulations while also combating fraudulent activities. The high volume of data and the intricate requirements of real-time fraud detection and compliance reporting are frequently a challenge for traditional systems to manage.

This is where MLOps (Machine Learning Operations) comes into play. It can help teams streamline these processes and elevate automation to the forefront of financial security and regulatory adherence.

In this article, we will investigate the potential of MLOps for automating compliance and fraud detection in the finance sector.

I’ll show you step by step how financial institutions can deploy a machine learning model for fraud detection and integrate it into their operations to ensure continuous monitoring and automated alerts for compliance. I’ll also demonstrate how to deploy this solution in a cloud-based environment using Google Colab, ensuring that it is both user-friendly and accessible, whether you are a beginner or more advanced.

Here’s what we’ll cover:

What is MLOps?
What You’ll Need
Step 1: Set Up Google Colab and Prepare the Data
Step 2: Data Preprocessing
Step 4: Retrain the Model with New Data
Step 5: Automated Alert System
Step 6: Visualize Model Performance
Conclusion
Key Takeaways

What is MLOps?

Machine Learning Operations, or MLOps for short, is a methodology that integrates DevOps with Machine Learning (ML). The whole machine learning model lifecycle, including development, training, deployment, monitoring, and maintenance, can be automated with its help.

MLOps has several main goals: continuous optimization, scalability, and the delivery of operational value over time.

The financial industry provides great use cases for MLOps processes and techniques, as these can help businesses manage complicated data pipelines, deploy models in real-time, and evaluate their performance – all while making sure they're compliant with regulations.

Why is MLOps Important in Finance?

Financial institutions are subject to various rules including Anti-Money Laundering (AML), Know Your Customer (KYC), and Fraud Prevention Regulations – so they have to carefully manage private information. Ignoring these rules might result in severe fines and loss of reputation.

Detecting fraud in financial transactions also calls for advanced systems capable of real-time identification of suspicious activity.

MLOps can help to solve these issues in the following ways:

MLOps lets financial institutions automatically track transactions for regulatory compliance, guaranteeing they follow changing legislation.
MLOps helps to create and implement machine learning models that can identify fraudulent transactions in real-time.
MLOps runs automated processes, enabling organizations to expand their fraud detection systems with as little human involvement as possible through automation.

What You’ll Need:

To follow along with this tutorial, ensure that you have the following:

Python installed, along with basic ML libraries such as scikit-learn, Pandas, and NumPy.
A sample dataset of financial transactions, which we will use to train a fraud detection model (You can use this sample dataset if you don’t have one on hand).
Google Colab (for cloud-based execution), which is free to use and doesn't require installation.

Step 1: Set Up Google Colab and Prepare the Data

Google Colab is an ideal choice for beginners and advanced users alike, because it’s cloud-based and doesn’t require installation. To start get started using it, follow these steps:

Access Google Colab:

Visit Google Colab and sign-in with your Google account.

Create a New Notebook:

In the Colab interface, go to File and then select New Notebook to create a fresh notebook.

Import Libraries and Load the Dataset

Now, let’s import the necessary libraries and load our fraud detection dataset. We'll assume the dataset is available as a CSV file, and we'll upload it to Colab.

Import libraries:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

Upload the Dataset:

from google.colab import files
uploaded = files.upload()

# Load dataset into pandas DataFrame
data = pd.read_csv('data.csv')
print(data.head())

Step 2: Data Preprocessing

Data preprocessing is essential to prepare the dataset for model training. This involves handling missing values, encoding categorical variables, and normalizing numerical features.

Why is Preprocessing Important?

Data preprocessing lets you take care of various data issues that could affect your results. During this process, you’ll:

Handle missing values: Financial datasets often have missing values. Filling in these missing values (for example, with the median) ensures that the model doesn’t encounter errors during training.
Convert categorical data: Machine learning algorithms require numerical input, so categorical features (like transaction type or location) need to be converted into numeric format using one-hot encoding.
Normalize data: Some machine learning models, like Random Forest, are not sensitive to feature scaling, but normalization helps maintain consistency and allows us to compare the importance of different features. This step is especially critical for models that rely on gradient descent.

Here’s an example:

# Handle missing data by filling with the median value for each column
data.fillna(data.median(), inplace=True)

# Convert categorical columns to numeric using one-hot encoding
data = pd.get_dummies(data, drop_first=True)

# Normalize numerical columns for scaling
data['normalized_amount'] = (data['Amount'] - data['Amount'].mean()) / data['Amount'].std()

# Separate features and target variable
X = data.drop(columns=['Class'])
y = data['Class']

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data preprocessing completed.")

Step 3: Train a Fraud Detection Model

We'll now train a RandomForestClassifier and evaluate its performance.

What is a Random Forest Classifier?

A Random Forest is an ensemble learning method that creates a collection (forest) of decision trees, typically trained with different parts of the data. It aggregates their predictions to improve accuracy and reduce overfitting.

This method is a popular choice for fraud detection because it can handle high-dimensional data. It’s also quite robust against overfitting.

Here’s how you can implement the Random Forest Classifier:

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=150, random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Predict on the test data
y_pred = rf_model.predict(X_test)

# Evaluate model performance
print("Model Evaluation:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Plot confusion matrix for visual understanding
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
cax = ax.matshow(cm, cmap='Blues')
fig.colorbar(cax)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

How the model is evaluated:

Classification report: Shows metrics like precision, recall, and F1-score for the fraud and non-fraud classes.
Confusion matrix: Helps visualize the performance of the model by showing the true positives, false positives, true negatives, and false negatives.

Step 4: Retrain the Model with New Data

Once you have trained your model, it’s important to retrain it periodically with new data to ensure that it continues to detect emerging fraud patterns.

What is Retraining?

Retraining the model ensures that it adapts to new, unseen data and improves over time. In the case of fraud detection, retraining is crucial because fraud tactics evolve over time, and your model needs to stay up-to-date to recognize new patterns.

Here’s how you can do this:

# Simulate loading new fraud data
new_data = pd.read_csv('new_fraud_data.csv')

# Apply preprocessing steps to new data (like filling missing values, encoding, normalization)
new_data.fillna(new_data.median(), inplace=True)
new_data = pd.get_dummies(new_data, drop_first=True)
new_data['normalized_amount'] = (new_data['transaction_amount'] - new_data['transaction_amount'].mean()) / new_data['transaction_amount'].std()

# Concatenate old and new data for retraining
X_new = new_data.drop(columns=['fraud_label'])
y_new = new_data['fraud_label']

# Retrain the model with the updated dataset
X_combined = pd.concat([X_train, X_new], axis=0)
y_combined = pd.concat([y_train, y_new], axis=0)

rf_model.fit(X_combined, y_combined)

# Re-evaluate the model
y_pred_new = rf_model.predict(X_test)
print("Updated Model Evaluation:\n", classification_report(y_test, y_pred_new))

Step 5: Automated Alert System

To automate fraud detection, we’ll send an email whenever a suspicious transaction is detected.

How the Alert System Works

The email alert system uses SMTP to send an email whenever fraud is detected. When the model identifies a suspicious transaction, it triggers an automated alert to notify the compliance team for further investigation.

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

# Function to send an email alert
def send_alert(email_subject, email_body):
    sender_email = "your_email@example.com"
    receiver_email = "compliance_team@example.com"
    password = "your_password"

    msg = MIMEMultipart()
    msg['From'] = sender_email
    msg['To'] = receiver_email
    msg['Subject'] = email_subject

    msg.attach(MIMEText(email_body, 'plain'))

    # Send email using SMTP
    try:
        server = smtplib.SMTP_SSL('smtp.example.com', 465)
        server.login(sender_email, password)
        text = msg.as_string()
        server.sendmail(sender_email, receiver_email, text)
        server.quit()
        print("Fraud alert email sent successfully.")
    except Exception as e:
        print(f"Failed to send email: {str(e)}")

# Example: Check for fraud and trigger an alert
suspicious_transaction_details = "Transaction ID: 12345, Amount: $5000, Suspicious Activity Detected."
send_alert("Fraud Detection Alert", f"A suspicious transaction has been detected: {suspicious_transaction_details}")

Step 6: Visualize Model Performance

Finally, we will visualize the performance of the model using an ROC curve (Receiver Operating Characteristic Curve), which helps evaluate the trade-off between the true positive rate and false positive rate.

Visualizing the performance of a machine learning model is an essential step in understanding how well the model is doing, especially when it comes to evaluating its ability to detect fraudulent transactions.

What is an ROC curve?

An ROC curve shows how well a model performs across all classification thresholds. It plots the True Positive Rate (TPR) versus the False Positive Rate (FPR). The area under the ROC curve (AUC) provides a summary measure of model performance.

from sklearn.metrics import roc_curve, auc

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, rf_model.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

The ROC curve gives us a comprehensive picture of how well our model is distinguishing between the two classes across various thresholds. By evaluating this curve, we can make decisions on how to tune the model’s threshold to find the best balance between detecting fraud and minimizing false alarms (that is, minimizing false positives).

Conclusion

By following this guide, you’ve learned how to leverage MLOps to automate fraud detection and ensure compliance in the financial industry using Google Colab. This cloud-based environment makes it easy to work with machine learning models without the hassle of local setups or configurations.

From automating data preprocessing to deploying models in production, MLOps offers an end-to-end solution that improves efficiency, scalability, and accuracy in detecting fraudulent activities.

By integrating real-time monitoring and continuous updates, financial institutions can stay ahead of fraud threats while ensuring regulatory compliance with minimal manual effort.

Key Takeaways

MLOps automates the whole machine learning model lifecycle by integrating machine learning with DevOps.
Simplifies regulatory compliance and fraud detection, letting banks spot fraudulent transactions automatically.
Maintains fraud detection systems current with fresh data through constant monitoring and model retraining.
Machine learning model development and testing may be done on Google Colab, a free cloud-based platform that provides access to GPUs and TPUs. No local installation is required.
Allows for automated workflows to detect suspicious behavior and send out alerts in real-time, allowing for fraud detection and alerting.
Continuous integration/continuous delivery pipelines guarantee continuous system improvement by automating the testing and deployment of new fraud detection models.
Financial organizations may save money using MLOps because cloud-based systems like Google Colab lower infrastructure expenses.

How to Make IT Operations More Efficient with AIOps: Build Smarter, Faster Systems

Balajee Asish Brahmandam — Fri, 09 May 2025 21:20:18 +0000

In the rapidly evolving IT landscape, development teams have to operate at their best and manage complex systems while minimizing downtime. And having to do many routine tasks manually can really slow down operations and reduce efficiency.

These days, we can use artificial intelligence to manage and enhance IT operations. This is where AIOps for IT operations comes into play.

AIOps is changing IT operations as it lets teams create better, faster systems that can find and resolve problems on their own. It also helps them make the best use of resources, and grow without as many problems.

In this tutorial, you’ll learn about the key components of AIOps, how they interact with other IT systems, and how you can apply AIOps to improve the efficiency of your environment.

Here’s what we’ll cover:

What is AIOps?
- The Significance of AIOps for IT Operations
- AIOps can help address these challenges by
Getting Started with AIOps
Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management
Conclusion

What is AIOps?

AIOps is artificial intelligence for IT operations. It means enhancing and streamlining IT chores by means of artificial intelligence and machine learning.

AIOps systems examine the vast volumes of data generated by IT systems, such as logs and metrics, while utilizing machine learning methods. The main objective of AIOps is to enable companies to more quickly and effectively identify and resolve IT issues.

Key components of AIOps include:

Anomaly detection: the process of spotting unusual patterns in a system's operation that might indicate a problem.
Event correlation: the process of examining data from several sources to determine how they complement one another and help to explain why issues arise.
Automated response: acting to resolve issues without human assistance.

The Significance of AIOps for IT Operations

The rise of hybrid and multi-cloud platforms, microservices architectures, and systems that can expand quickly are complicating IT operations. Often, conventional IT management tools fall behind the size and speed of the systems that we need to monitor and maintain.

Here are some issues that often come up in standard IT operations:

Manual troubleshooting: IT teams sometimes must comb through logs and reports by hand to identify the root of issues.
Long settlement times: The longer it takes to resolve a problem after discovery, the more downtime and dissatisfied users result.
Scalability: Monitoring all system components becomes more difficult as they grow since more manual labor would be required.

AIOps can help address these challenges by

Improving incident resolution times: By correlating events and providing actionable insights, AIOps can resolve problems in real-time.
Scaling effortlessly: AIOps can handle large volumes of data and events without additional resources, making it ideal for scaling operations
Automating incident detection and response: AI models can detect issues and automatically resolve them, reducing manual intervention.

You can better understand AIOps by looking at its main components:

1. Machine Learning for Predictive Analytics

AIOps tools forecast future events by means of machine learning and examining historical data. Prediction analytics, for example, can inform teams when a system's performance is likely to decline, letting them address the issue before it worsens.

2. Automating and Self-Healing

AIOps lets your team automate daily tasks, eliminating the need for human intervention. Services, for instance, can be restarted, or resources can be relocated. Running the company costs less, and problem resolution takes less time.

3. Event Correlation and Root Cause Analysis

Event correlation is the technique of linking events from several related systems to identify the root cause of the problem. For instance, AIOps will examine server, network, and application logs to determine what’s wrong – whether it’s a network problem or a web application failure – and correct it.

Getting Started with AIOps

Enhancing your team’s IT operations with AIOps involves including tools and procedures run by artificial intelligence in your present system. These are the most crucial actions to start with:

1. Choose an AIOps Tool

There are several AIOps platforms available, each with its own set of features. Some popular AIOps tools include:

Moogsoft: An AIOps platform that uses machine learning for event correlation, anomaly detection, and incident management.
BigPanda: Focuses on automating incident management and root cause analysis.
Splunk IT Service Intelligence: Offers advanced analytics for monitoring and managing IT infrastructure.

When selecting an AIOps tool, consider the following:

Integration with existing tools: Ensure the platform integrates with your current monitoring, logging, and alerting systems.
Scalability: The platform should be able to handle large volumes of data and scale with your organization.
Ease of use: Look for a user-friendly interface and automation capabilities to minimize manual intervention.

2. Implement AIOps in Your IT Environment

These are the steps you’ll need to take to integrate AIOps into your IT operations:

Data aggregation: is the process of collecting data from various sources, including computers, network devices, cloud infrastructure, and applications, and consolidating it all onto one platform.
Determine thresholds and KPIs: Identify the most crucial key performance indicators such as error rates, system uptime, and response for your company.
Establishing alerts and automation: For instance, when thresholds are crossed, configure automatic responses to restart services or raise resource consumption.

3. Leverage Machine Learning for Anomaly Detection

Machine learning models are quite crucial in the search for anomalies. These models can identify trends that are not usual and learn from prior data. This enables IT departments to identify issues early on before they escalate.

Example: A machine learning model may detect a spike in CPU usage that is unusual for a particular time of day, triggering an alert or automatic remediation process, such as scaling the application to add more resources.

import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

# Example dataset (e.g., CPU usage or network traffic over time)
data = np.array([50, 51, 52, 53, 200, 55, 56, 57, 58, 60]).reshape(-1, 1)

# Initialize Isolation Forest model for anomaly detection
model = IsolationForest(contamination=0.1)  # 10% outliers
model.fit(data)

# Predict anomalies: -1 indicates anomaly, 1 indicates normal
predictions = model.predict(data)

# Plotting the results
plt.plot(data, label="System Metric")
plt.scatter(np.arange(len(data)), data, c=predictions, cmap="coolwarm", label="Anomalies")
plt.title("Anomaly Detection in System Metric")
plt.legend()
plt.show()

4. Automate Root Cause Analysis

AIOps platforms can automatically correlate data from various sources to identify the root cause of incidents. For instance, if an application is experiencing high response times, AIOps can check the server logs, network status, and database performance to determine if the issue is due to a server failure, database bottleneck, or network congestion.

import splunklib.client as client
import splunklib.results as results

# Connect to Splunk server (replace with actual credentials)
service = client.Service(
    host='localhost',
    port=8089,
    username='admin',
    password='password'
)

# Perform a search query to find events related to system issues
search_query = 'search index=main "error" OR "fail" | stats count by sourcetype'

# Run the search
job = service.jobs.create(search_query)

# Wait for the search job to complete
while not job.is_done():
    print("Waiting for results...")
    time.sleep(2)

# Retrieve and process the results
for result in results.JSONResultsReader(job.results()):
    print(result)

5. Set Up Automated Responses Using Webhooks

In AIOps, automated incident response is triggered through Webhooks or other messaging systems. For example, when an anomaly is detected, a Webhook can notify a team or initiate a resolution process.

import requests

# Simulate an anomaly detection system that triggers when an anomaly is found
def send_alert_to_webhook(anomaly_detected):
    webhook_url = 'https://your-webhook-url.com'
    payload = {
        "text": f"Alert: Anomaly detected! Please review the system metrics immediately."
    }

    if anomaly_detected:
        response = requests.post(webhook_url, json=payload)
        print("Alert sent to webhook")
        return response.status_code
    return None

# Simulate anomaly detection
anomaly_detected = True  # Set to True when an anomaly is found

# Trigger automated response (alert)
status_code = send_alert_to_webhook(anomaly_detected)

if status_code == 200:
    print("Webhook triggered successfully")
else:
    print("Failed to trigger webhook")

6. Automate system cleanup with Ansible (sample playbook)

Automatic remediation is a major component of AIOps in resolving issues without any human intervention. Like restarting a service when a system measure exceeds a particular threshold, here is an illustration of an Ansible script that automatically resolves an issue.

- name: Automated Remediation for High CPU Usage
  hosts: all
  become: true
  tasks:
    - name: Check CPU Usage
      shell: "top -bn1 | grep load | awk '{printf \"%.2f\", $(NF-2)}'"
      register: cpu_load
      changed_when: false

    - name: Restart service if CPU load is high
      service:
        name: "your-service-name"
        state: restarted
      when: cpu_load.stdout | float > 80.0

Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management

Imagine a large-scale e-commerce company that operates in the cloud, hosting its infrastructure on AWS. The company’s platform is supported by hundreds of virtual machines (VMs), microservices, databases, and web servers.

As the company grows, so do the complexities of its IT operations, especially in managing system health, uptime, and performance. The company has a traditional monitoring setup in place using basic cloud-native tools. But as the platform scales, the sheer volume of data (logs, metrics, alerts) overwhelms the IT team, leading to delays in identifying the root cause of issues and resolving them in real time.

Challenges:

Incident overload: With hundreds of alerts coming in daily, the team struggled to prioritize critical incidents, which led to slower resolution times.
Manual processes: Identifying the root cause of issues required manual sifting through logs, which was time-consuming and error-prone.
Scalability issues: As the company scaled its infrastructure, manual intervention became increasingly inefficient, and the system could not dynamically respond to issues without human input.

AIOps implementation:

The company decided to implement an AIOps platform to streamline incident management, automate responses, and predict issues before they occurred.

Step 1: Setting Up Monitoring with Prometheus

First, we need to monitor system performance to collect metrics such as CPU usage and memory consumption. We’ll use Prometheus, an open-source monitoring tool, to collect this data.

Install Prometheus:

First, download and install Prometheus:

wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz
tar -xvzf prometheus-2.27.1.linux-amd64.tar.gz
cd prometheus-2.27.1.linux-amd64/
./prometheus

Then install Node Exporter (to collect system metrics):

wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar -xvzf node_exporter-1.1.2.linux-amd64.tar.gz
cd node_exporter-1.1.2.linux-amd64/
./node_exporter

Next, configure Prometheus to scrape metrics from Node Exporter:

##Edit prometheus.yml to scrape metrics from the Node Exporter:
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

And start Prometheus:

./prometheus --config.file=prometheus.yml

You can now access Prometheus via http://localhost:9090 to verify that it's collecting metrics.

Step 2: Collecting System Data (CPU Usage)

Now that we have Prometheus collecting metrics, we need to extract CPU usage data (which will be the focus of our anomaly detection) from Prometheus.

Querying Prometheus API for CPU Usage

We’ll use Python to query Prometheus and retrieve CPU usage data (for example, using the node_cpu_seconds_total metric). We’ll fetch the data for the last 30 minutes.

import requests
import pandas as pd
from datetime import datetime, timedelta

# Define the Prometheus URL and the query
prom_url = "http://localhost:9090/api/v1/query_range"
query = 'rate(node_cpu_seconds_total{mode="user"}[1m])'

# Define the start and end times
end_time = datetime.now()
start_time = end_time - timedelta(minutes=30)

# Make the request to Prometheus API
response = requests.get(prom_url, params={
    'query': query,
    'start': start_time.timestamp(),
    'end': end_time.timestamp(),
    'step': 60
})

data = response.json()['data']['result'][0]['values']
timestamps = [item[0] for item in data]
cpu_usage = [item[1] for item in data]

# Create a DataFrame for easier processing
df = pd.DataFrame({
    'timestamp': pd.to_datetime(timestamps, unit='s'),
    'cpu_usage': cpu_usage
})

print(df.head())

Step 3: Anomaly Detection with Machine Learning

To detect anomalies in CPU usage, we’ll use Isolation Forest, a machine learning algorithm from Scikit-learn.

Train an Anomaly Detection Model:

First, install Scikit-learn:

pip install scikit-learn matplotlib

Then you’ll need to train the model using the CPU usage data we collected:

from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt

# Prepare the data for anomaly detection (CPU usage data)
cpu_usage_data = df['cpu_usage'].values.reshape(-1, 1)

# Train the Isolation Forest model (anomaly detection)
model = IsolationForest(contamination=0.05)  # 5% expected anomalies
model.fit(cpu_usage_data)

# Predict anomalies (1 = normal, -1 = anomaly)
predictions = model.predict(cpu_usage_data)

# Add predictions to the DataFrame
df['anomaly'] = predictions

# Visualize the anomalies
plt.figure(figsize=(10, 6))
plt.plot(df['timestamp'], df['cpu_usage'], label='CPU Usage')
plt.scatter(df['timestamp'][df['anomaly'] == -1], df['cpu_usage'][df['anomaly'] == -1], color='red', label='Anomaly')
plt.title("CPU Usage with Anomalies")
plt.xlabel("Time")
plt.ylabel("CPU Usage (%)")
plt.legend()
plt.show()

Step 4: Automating Incident Response with AWS Lambda

When an anomaly is detected (for example, high CPU usage), AIOps can automatically trigger a response, such as scaling up resources.

AWS Lambda for Automated Scaling

Here’s an example of how to use AWS Lambda to scale up EC2 instances when CPU usage exceeds a threshold.

First, create your AWS Lambda function that scales EC2 instances when CPU usage exceeds 80%.

import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')

    # If CPU usage exceeds threshold, scale up EC2 instance
    if event['cpu_usage'] > 0.8:  # 80% CPU usage
        instance_id = 'i-1234567890'  # Replace with your EC2 instance ID
        ec2.modify_instance_attribute(InstanceId=instance_id, InstanceType={'Value': 't2.large'})

    return {
        'statusCode': 200,
        'body': f'Instance {instance_id} scaled up due to high CPU usage.'
    }

Then you’ll need to trigger the Lambda function. Set up AWS CloudWatch Alarms to monitor the output from the anomaly detection and trigger the Lambda function when CPU usage exceeds the threshold.

Step 5: Proactive Resource Scaling with Predictive Analytics

Finally, using predictive analytics, AIOps can forecast future resource usage and proactively scale resources before problems arise.

Predictive Scaling:

We’ll use a linear regression model to predict future CPU usage and trigger scaling events proactively.

Start by training a predictive model:

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

# Historical data (CPU usage trends)
data = pd.DataFrame({
    'timestamp': pd.date_range(start="2023-01-01", periods=100, freq='H'),
    'cpu_usage': np.random.normal(50, 10, 100)  # Simulated data
})

X = np.array(range(len(data))).reshape(-1, 1)  # Time steps
y = data['cpu_usage']

model = LinearRegression()
model.fit(X, y)

# Predict next 10 hours
future_prediction = model.predict([[len(data) + 10]])
print("Predicted CPU usage:", future_prediction)

If the predicted CPU usage exceeds a threshold, AIOps can trigger auto-scaling using AWS Lambda or Kubernetes.

Results:

Reduced incident resolution time: The average time to resolve incidents dropped from hours to minutes because AIOps helped the team identify issues faster.
Reduced false positives: By using anomaly detection, the system significantly reduced the number of false alerts.
Increased automation: With automated responses in place, the system dynamically adjusted resources in real time, reducing the need for manual intervention.
Proactive issue management: Predictive analytics enabled the team to address potential problems before they became critical, preventing performance degradation.

Conclusion

AIOps transforms IT operations, enabling companies to build more efficient, responsive, and superior systems. By automating routine tasks, identifying issues before they worsen, and providing real-time data, AIOps is altering the function of IT teams.

AIOps is the most effective tool for increasing system speed, reducing downtime, and streamlining your IT procedures. You can begin modestly, and gradually include more functionality. Then you’ll start to see how AIOps opens your IT environment to fresh ideas and increases its efficiency.

Balajee Asish Brahmandam - freeCodeCamp.org

What Happened When I Replaced Copilot with Claude Code for 2 Weeks

Table of Contents

The Setup

What Worked Better

Accuracy

Understanding Context

Reasoning About Requirements

What Broke (Or Slowed Things Down)

Response Time

No Inline Acceptance

Disconnected From Flow

Limited to the File

The First Week vs The Second Week

Why I Went Back

The Honest Verdict

What I Actually Use Now

Copilot vs Claude Code: The Breakdown

A Word on Developer Experience

What Would Make Me Switch

Final Thoughts

What's Next?

Docker Container Doctor: How I Built an AI Agent That Monitors and Fixes My Containers

Table of Contents

Why Not Just Use Prometheus?

The Architecture

Setting Up the Project

The Monitoring Script – Line by Line

The Claude Diagnosis Prompt (and Why Structure Matters)

Auto-Fix Logic – Being Conservative on Purpose

Adding Slack Notifications

Health Check Endpoint

Rate Limiting Claude Calls

Docker Compose – The Full Setup

Real Errors I Caught in Production

Incident 1: OOM Kill (Week 1)

Incident 2: Connection Pool Exhausted (Week 2)

Incident 3: Transient Timeout (Week 2)

Incident 4: Disk Full (Week 3)

Cost Breakdown – What This Actually Costs

Security Considerations

What I'd Do Differently

What's Next?

How to Optimize Your Docker Build Cache & Cut Your CI/CD Pipeline Times by 80%

Table of Contents

Prerequisites

How Docker Build Cache Actually Works

How Cache Keys Are Computed

The Cache Chain Rule

How to Identify Common Cache-Busting Mistakes

Mistake 1: Copying Everything Too Early

Mistake 2: Not Separating Dependency Files

Mistake 3: Using ADD Instead of COPY

Mistake 4: Splitting apt-get update and install

Mistake 5: Embedding Timestamps or Git Hashes Too Early

How to Structure Your Dockerfile for Maximum Cache Reuse

Step 1: Apply the Dependency-First Pattern

Step 2: Add an Aggressive .dockerignore

Step 3: Use Multi-Stage Builds

Step 4: Order Layers by Change Frequency

Step 5: Use BuildKit Mount Caches

How to Set Up CI/CD Cache Backends

Option A: Registry-Based Cache

Option B: GitHub Actions Cache

Option C: S3 or Cloud Storage

Option D: Local Cache with Persistent Runners

How to Implement Advanced Cache Patterns

Parallel Build Stages

Cache Warming for Feature Branches

Selective Cache Invalidation with Build Args

How to Measure Your Improvements

The Four Scenarios to Benchmark

Real-World Before and After Numbers

How to Check Cache Hit Rates

Complete Optimized Dockerfile Examples

Node.js Full-Stack App

Python FastAPI App

Go Microservice

Troubleshooting Guide

Quick-Reference Checklist