mlops - freeCodeCamp.org

How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP

Rasheedat Atinuke Jamiu — Wed, 22 Apr 2026 20:30:00 +0000

Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.

In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.

Prerequisites
Project Setup
Step 1: Install Packer
Step 2: Set Up Project Directory
Step 3: Install Packer's Plugins
Step 4: Define Your Source
Step 5: Writing the Build Template
Step 6: Writing the GPU Provisioning Script
Step 7:Assembling and Running the Build
Step 8: Test the Image and Verify the GPU Stack
Conclusion
References

Prerequisites

HashiCorp Packer >= 1.9
Google Compute Packer plugin (installed via packer init)
Optionally, the AWS Packer plugin can be used for EC2 builds by adding an amazon-ebs source to node.pkr.hcl
GCP project with Compute Engine API enabled (or AWS account with EC2 access)
GCP authentication (gcloud auth application-default login) or AWS credentials
Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)

Project Setup

Step 1: Install Packer

To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation guides).

First, you'll install the official Packer formula from the terminal.

Install the HashiCorp tap, a repository of all Hashicorp packages.

$ brew tap hashicorp/tap

Now, install Packer with hashicorp/tap/packer.

$ brew install hashicorp/tap/packer

Step 2: Set Up Project Directory

With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your packer_demo folder using the command below:

mkdir -p packer_demo/script && touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh

Your file directory should look like this:

packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script

Step 3: Install Packer's Plugins

In your plugins.pkr.hcl file,, define your plugins in the packer block. The packer {} block contains Packer settings, including specifying a required plugin version. You'll find the required_plugins block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin here.

packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~> 1"
    }
  }
}

Then, initialize your Packer plugin with the command below:

packer init .

Step 4: Define Your Source

With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your project ID, the zone where your machine will be created, the source_image_family (think of this as your base image, such as Debian, Ubuntu, and so on), and your source_image_project_id.

In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the machine type to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.

source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}

Setting on_host_maintenance = "TERMINATE" on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.

You'll define all your variables in the variable.pkr.hcl file, and set the values in the values.pkrvars.hcl. Remember to always add your values.pkrvars.hcl file to Gitignore.

variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}

values.pkrvars.hcl

image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1"

Step 5: Writing the Build Template

Create build.pkr.hcl. The build block creates a temporary instance, runs provisioners, and produces an image.

Provisioners in this template are organized as follows:

First provisioner runs system updates and upgrades.
Second provisioner reboots the instance (expect_disconnect = true).
Third provisioner waits for the instance to come back (pause_before), then runs script/base.sh. This provisioner sets max_retries to handle transient SSH timeouts and pass environment variables for DRIVER_VERSION and CUDA_VERSION.

Lastly, you have the post-processor to tell you the image ID and completion status:

build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}

Step 6: Writing the GPU Provisioning Script

Now we'll go through the base script, and break down some parts of it.

Section 1: Pre-Installation (Kernel Headers)

Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.

log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

Section 2: Installing NVIDIA's Apt Repository

This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.

log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

Section 3: Pinning NVIDIA Drivers Version

Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.

NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit

A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.

log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

Section 4: Installing the Driver

The libnvidia-compute installs only the compute‑related user‑space libraries (CUDA driver components), while the nvidia-dkms-open; installs the open‑source NVIDIA kernel module, built locally via DKMS.

Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.

Here, we're using NVIDIA’s compute‑only driver stack using the open‑source kernel modules, as it deliberately avoids installing any display-related components, which you don't need.

This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.

log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

Section 5: CUDA Toolkit Installation

This part of the script installs the CUDA Toolkit for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.

It adds CUDA binaries to PATH, so commands like nvcc, cuda-gdb, and cuda-memcheck work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.

log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat <<'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

Section 6: NVIDIA Container Toolkit

This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.

log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

Section 7: Installing DCGM (Data Center GPU Manager)

This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.

It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.

The script extracts the installed version and checks that it meets the minimum required version for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.

log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2>/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] && [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

Section 8: Enabling Persistence Mode

The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.

Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.

log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

Section 9: System Tuning for GPU Compute Workloads

This block applies a set of system‑level performance and stability tunings that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.

Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.

Swap and memory behavior: Disabling swap and setting vm.swappiness=0 prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.
Hugepages for large memory allocations: Setting vm.nr_hugepages=2048 allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.

CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.
CPU frequency governor: Installing cpupower and forcing the CPU governor to performance ensures the CPU stays at maximum frequency instead of scaling down.

GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.
NUMA and topology tools: Installing numactl, libnuma-dev, and hwloc provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.
Disabling irqbalance: Stopping and disabling irqbalance it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.

log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

Full base.sh script here:

#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" >&2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] && error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] && error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release && echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat <<'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2>/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] && [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"

Step 7: Assembling and Running the Build

Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.

packer validate -var-file=values.pkrvars.hcl .

If validation succeeds, you’ll see a short confirmation like The configuration is valid.. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:

packer build -var-file=values.pkrvars.hcl .

The build typically takes 15–20 minutes, depending on network speed and package installs. Watch the Packer log for three key checkpoints:

Instance creation — confirms the temporary VM was provisioned.
Provisioner output — shows each script step (updates, reboot, script/base.sh) and any errors.
Image creation — indicates the build finished and an image artifact was written.

If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.

googlecompute.gpu-node: output will be in this color.

==> googlecompute.gpu-node: Checking image does not exist...
==> googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==> googlecompute.gpu-node: no persistent disk to create
==> googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==> googlecompute.gpu-node: Creating instance...
==> googlecompute.gpu-node: Loading zone: us-central1-a
==> googlecompute.gpu-node: Loading machine type: g2-standard-4
==> googlecompute.gpu-node: Requesting instance creation...
==> googlecompute.gpu-node: Waiting for creation operation to complete...
==> googlecompute.gpu-node: Instance has been created!
==> googlecompute.gpu-node: Waiting for the instance to become running...
==> googlecompute.gpu-node: IP: 34.58.58.214
==> googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==> googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==> googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==> googlecompute.gpu-node:
==> googlecompute.gpu-node: No containers need to be restarted.
==> googlecompute.gpu-node:
==> googlecompute.gpu-node: User sessions running outdated binaries:
==> googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==> googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==> googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==> googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==> googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==> googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==> googlecompute.gpu-node: [BASE] Updating system packages...
==> googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==> googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==> googlecompute.gpu-node: [BASE] Installing DCGM...
==> googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==> googlecompute.gpu-node: [BASE] Applying system tuning...
==> googlecompute.gpu-node: vm.swappiness=0
==> googlecompute.gpu-node: vm.nr_hugepages=2048
==> googlecompute.gpu-node: Setting cpu: 0
==> googlecompute.gpu-node: Error setting new values. Common errors:
==> googlecompute.gpu-node: [BASE] ============================================
==> googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==> googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==> googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==> googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==> googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==> googlecompute.gpu-node: [BASE] ============================================
==> googlecompute.gpu-node: Deleting instance...
==> googlecompute.gpu-node: Instance has been deleted!
==> googlecompute.gpu-node: Creating image...
==> googlecompute.gpu-node: Deleting disk...
==> googlecompute.gpu-node: Disk has been deleted!
==> googlecompute.gpu-node: Running post-processor:  (type shell-local)
==> googlecompute.gpu-node (shell-local): Running local shell script: 
==> googlecompute.gpu-node (shell-local): === Image Build Complete ===
==> googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==> googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==> Wait completed after 17 minutes 55 seconds

==> Builds finished. The artifacts of successful builds are:
--> googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134

Step 8: Test the Image and Verify the GPU Stack

Confirm the image exists in the GCP Console: Compute → Storage → Images and locate your newly created OS image.

Create a test VM from the image:

gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING

Once the instance is RUNNING, verify the NVIDIA driver and GPU are visible:

The nvidia-smi output confirms:

Driver 590.48.01 loaded
CUDA 13.1 available
Persistence Mode is On
The L4 GPU is detected with 23GB VRAM
Zero ECC errors
No running processes (clean idle state).

This is exactly what a healthy base image should look like. Notice Disp.A: Off? That confirms our compute-only driver choice is working — no display adapter is active.

Confirm the installed CUDA toolkit by running. nvcc --version. You can see that version 13.1 was installed as specified.

Let's confirm DCGM installation by running dcgmi discovery -l. Successful output indicates DCGM is running and communicating with the driver.

Conclusion

You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.

From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.

The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.

References

NVIDIA Driver Installation Guide (Ubuntu): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/
NVIDIA CUDA Toolkit Documentation: https://docs.nvidia.com/cuda/
NVIDIA Container Toolkit Installation Guide: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
NVIDIA DCGM Documentation: https://docs.nvidia.com/datacenter/dcgm/latest/index.html
NVIDIA Persistence Daemon: https://docs.nvidia.com/deploy/driver-persistence/index.html
HashiCorp Packer Documentation: https://developer.hashicorp.com/packer/docs
Packer Google Compute Builder: https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute

Model Packaging Tools Every MLOps Engineer Should Know

Temitope Oyedele — Mon, 06 Apr 2026 15:00:08 +0000

Most machine learning deployments don’t fail because the model is bad. They fail because of packaging.

Teams often spend months fine-tuning models (adjusting hyperparameters and improving architectures) only to hit a wall when it’s time to deploy. Suddenly, the production system can’t even read the model file. Everything breaks at the handoff between research and production.

The good news? If you think about packaging from the start, you can save up to 60% of the time usually spent during deployment. That’s because you avoid the common friction between the experimental environment and the production system.

In this guide, we’ll walk through eleven essential tools every MLOps engineer should know. To keep things clear, we’ll group them into three stages of a model’s lifecycle:

Serialization: how models are stored and transferred
Bundling & Serving: how models are deployed and run
Registry: how models are tracked and versioned

Model Serialization Formats
Model Bundling and Serving Tools
Model Registries
Conclusion

Model Serialization Formats

Serialization is simply the process of turning a trained model into a file that can be stored and moved around. It’s the first step in the pipeline, and it matters more than people think. The format you choose determines how your model will be loaded later in production.

So, you want something that either works across different frameworks or is optimized for the environment where your model will eventually run.

Below are some of the most common tools in this space:

1. ONNX (Open Neural Network Exchange)

ONNX is basically the common language for model serialization. It lets you train a model in one framework, like PyTorch, and then deploy it somewhere else without running into compatibility issues. It also performs well across different types of hardware.

ONNX separates your training framework from your inference runtime and allows hardware-level optimizations like quantization and graph fusion. It’s also widely supported across cloud platforms and edge devices.

Key considerations: This format makes it possible to decouple training from deployment, while still enabling performance optimizations across different hardware setups.

When to use it: Use ONNX when you need portability – especially if different teams or environments are involved.

2. TorchScript

TorchScript lets you compile PyTorch models into a format that can run without Python. That means you can deploy it in environments like C++ or mobile without carrying the full Python runtime.

It supports two approaches: tracing (recording execution with sample inputs) and scripting (capturing full control flow).

Key considerations: Its biggest advantage is removing the Python dependency, which helps reduce latency and makes it suitable for more constrained environments.

When to use it: Best for high-performance systems where Python would be too heavy or introduce security concerns.

3. TensorFlow SavedModel

SavedModel is TensorFlow’s native format. It stores everything – the computation graph, weights, and serving logic – in a single directory.

It’s also the standard input format for TensorFlow Serving, TFLite, and Google Cloud AI Platform.

Key considerations: It keeps everything within the TensorFlow ecosystem intact, so you don’t lose any part of the model when moving to production.

When to use it: If your project is built on TensorFlow, this is the default and safest choice.

4. Pickle and Joblib

Pickle is Python’s built-in way of saving objects, and Joblib builds on top of it to better handle large arrays and models.

These are commonly used for scikit-learn pipelines, XGBoost models, and other traditional ML setups.

Key considerations: They’re simple and convenient, but come with real trade-offs. Pickle can execute arbitrary code when loading, which makes it unsafe in untrusted environments. It’s also tightly coupled to Python versions and library dependencies, so models can break when moved across environments.

When to use it: Best suited for controlled environments where everything runs in the same Python stack, such as internal tools, quick prototypes, or batch jobs.

It’s especially practical when you’re working with classical ML models and don’t need cross-language support or long-term portability. Avoid it for production systems that require security, reproducibility, or deployment across different environments.

5. Safetensors

Safetensors is a newer format developed by Hugging Face. It’s designed to be safe, fast, and straightforward.

It avoids arbitrary code execution and allows efficient loading directly from disk.

Key considerations: It’s both memory-efficient and secure, which makes it a strong alternative to older formats like Pickle.

When to use it: Ideal for modern workflows where speed and safety are important.

Model Bundling and Serving Tools

Once your model is saved, the next step is making it usable in production. That means wrapping it in a way that can handle requests and connect it to the rest of your system.

1. BentoML

BentoML allows you to define your model service in Python – including preprocessing, inference, and postprocessing – and package everything into a single unit called a “Bento.”

This bundle includes the model, code, dependencies, and even Docker configuration.

Key considerations: It simplifies deployment by packaging everything into one consistent artifact that can run anywhere.

When to use it: Great when you want to ship your model and all its logic together as one deployable unit.

2. NVIDIA Triton Inference Server

Triton is NVIDIA’s production-grade inference server. It supports multiple model formats like ONNX, TorchScript, TensorFlow, and more.

It’s built for performance, using features like dynamic batching and concurrent execution to fully utilize GPUs.

Key considerations: It delivers high throughput and efficiently uses hardware, especially GPUs, while supporting models from different frameworks.

When to use it: Best for large-scale deployments where performance, low latency, and GPU usage are critical.

3. TorchServe

TorchServe is the official serving tool for PyTorch, developed with AWS.

It packages models into a MAR file, which includes weights, code, and dependencies, and provides APIs for managing models in production.

Key considerations: It offers built-in features for versioning, batching, and management without needing to build everything from scratch.

When to use it: A solid choice for deploying PyTorch models in a standard production setup.

Model Registries

A model registry is essentially your source of truth. It stores your models, tracks versions, and manages their lifecycle from experimentation to production.

Without one, things quickly become messy and hard to track.

1. MLflow Model Registry

MLflow is one of the most widely used MLOps platforms. Its registry helps manage model versions and track their progression through stages like Staging and Production.

It also links models back to the experiments that created them.

Key considerations: It provides strong lifecycle management and makes it easier to track and audit models.

When to use it: Ideal for teams that need structured workflows and clear governance.

2. Hugging Face Hub

The Hugging Face Hub is one of the largest platforms for sharing and managing models.

It supports both public and private repositories, along with dataset versioning and interactive demos.

Key considerations: It offers a huge library of models and makes collaboration very easy.

When to use it: Perfect for projects involving transformers, generative AI, or anything that benefits from sharing and discovery.

3. Weights and Biases

Weights & Biases combines experiment tracking with a model registry.

It connects each model directly to the training run that produced it.

Key considerations: It gives you full traceability, so you always know how a model was created.

When to use it: Best when you want a strong link between experimentation and production artifacts.

Conclusion

Machine learning systems rarely fail because the models are bad. They fail because the path to production is fragile.

Packaging is what connects research to production. If that connection is weak, even great models won’t make it into real use.

Choosing the right tools across serialization, serving, and registry layers makes systems easier to deploy and maintain. Formats like ONNX and Safetensors improve portability and safety. Tools like Triton and BentoML help with reliable serving. Registries like MLflow and Hugging Face Hub keep everything organized.

The main idea is simple: don’t leave deployment as something to figure out later.

When packaging is planned early, teams move faster and avoid a lot of unnecessary problems.

In practice, success in MLOps isn’t just about building models. It’s about making sure they actually run in the real world.

How to Use MLflow to Manage Your Machine Learning Lifecycle

Temitope Oyedele — Mon, 23 Mar 2026 18:52:44 +0000

Training machine learning models usually starts out being organized and ends up in absolute chaos.

We’ve all been there: dozens of experiments scattered across random notebooks, and model files saved as model_v2_final_FINAL.pkl because no one is quite sure which version actually worked.

Once you move from a solo project to a team, or try to push something to production, that "organized chaos" quickly becomes a serious bottleneck.

Solving this mess requires more than just better naming conventions: it requires a way to standardize how we track and hand off our work. This is the specific gap MLflow was built to fill.

Originally released by the team at Databricks in 2018, it has become a standard open-source platform for managing the entire machine learning lifecycle. It acts as a central hub where your experiments, code, and models live together, rather than being tucked away in forgotten folders.

In this tutorial, we'll cover the core philosophy behind MLflow and how its modular architecture solves the 'dependency hell' of machine learning. We'll break down the four primary pillars of Tracking, Projects, Models, and the Model Registry, and walk through a practical implementation of each so you can move your projects from local notebooks to a production-ready lifecycle.

Prerequisites:

To get the most out of this tutorial, you should have:

Basic Python proficiency: Comfort with context managers (with statements) and decorators.
Machine Learning fundamentals: A general understanding of training/testing splits and model evaluation metrics (like accuracy or loss).
Local Environment: Python 3.8+ installed. Familiarity with pip or conda for installing packages is helpful.

MLflow Architecture: The Big Picture

To understand why MLflow is so effective, you have to look at how it's actually put together. MLflow isn't one giant or rigid tool. It’s a modular system designed around four loosely coupled components that are its core pillars.

This is a big deal because it means you don’t have to commit to the entire ecosystem at once. If you only need to track experiments and don't care about the other features, you can just use that part and ignore the rest.

To make this a bit more concrete, here is how those pieces map to things you probably already use:

MLflow Tracking: Logs experiments, metrics, and parameters. (Think: Git commits for ML runs)
MLflow Projects: Packages code for reproducibility. (Think: A Docker image for ML code)
MLflow Models: A standard format for multiple frameworks. (Think: A universal adapter)
Model Registry: Handles versioning and governing models. (Think: A CI/CD pipeline for models)

Architecturally, you can think of MLflow in two layers: the Client and the Server.

The Client is where you spend most of your time. It’s your training script or your Jupyter notebook where you log metrics or register a model.

The Server is the brain in the background that handles the storage. It consists of a Tracking Server, a Backend Store (usually a database like PostgreSQL), and an Artifact Store. That’s the place where big files like model weights live, such as S3 or GCS.

This separation is why MLflow is so flexible. You can start with everything running locally on your laptop using just your file system. When you're ready to scale up to a larger team, you can swap that out for a centralized server and cloud storage with almost no changes to your actual code. It grows with your project instead of forcing you to start over once things get serious.

Now, let's look at each of these four pillars of MLflow so you understand how they work.

Understanding MLflow Tracking

For most teams, the Tracking component is the front door to MLflow. Its job is simple: it acts as a digital lab notebook that records everything happening during a training run.

Instead of you frantically trying to remember what your learning rate was or where you saved that accuracy plot, MLflow just sits in the background and logs it for you.

The core unit here is the run. Think of a run as a single execution of your training code. During that run, the architecture captures four specific types of information:

Parameters: Your inputs, like batch size or the number of trees in a forest.
Metrics: Your outputs, like accuracy or loss, which can be tracked over time.
Artifacts: The "heavy" stuff, such as model weights, confusion matrices, or images.
Tags and Metadata: Context like which developer ran the code and which Git commit was used.

A Tracking Example

Seeing this in practice is the best way to understand how the architecture actually works. You don't need to rebuild your entire pipeline – you just wrap your training logic in a context manager.

Here is what a basic integration looks like in Python:

import mlflow 
import mlflow.sklearn 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score 

# This block opens the run and keeps things organized
with mlflow.start_run():    
    # Log parameters    
    mlflow.log_param("n_estimators", 100)    
    mlflow.log_param("max_depth", 5)    
    
    # Train the model    
    model = RandomForestClassifier(n_estimators=100, max_depth=5)    
    model.fit(X_train, y_train)    
    
    # Log metrics    
    accuracy = accuracy_score(y_test, model.predict(X_test))    
    mlflow.log_metric("accuracy", accuracy)    
    
    # Log the model as an artifact    
    mlflow.sklearn.log_model(model, "random_forest_model")

The mlflow.start_run() context manager creates a new run and automatically closes it when the block exits. Everything logged inside that block is associated with that run and stored in the Backend Store.

Where Does the Data Actually Go?

When you’re just starting out on your laptop, MLflow keeps things simple by creating a local ./mlruns directory. The real power shows up when you move to a team environment and point everyone to a centralized Tracking Server.

The system splits the data based on how "heavy" it is. Your structured data (parameters and metrics) is small and needs to be searchable, so it goes into a SQL database like PostgreSQL. Your unstructured data (the actual model files or large plots) is too bulky for a database. The architecture ships that off to an Artifact Store like Amazon S3 or Google Cloud Storage.

Why Bother with This Setup?

Relying on "vibes" and messy naming conventions is a recipe for disaster once your project grows. It might work for a day or two, but it falls apart the moment you need to compare twenty different versions of a model.

By separating the tracking into its own architectural pillar, MLflow gives you a queryable history. Instead of digging through old notebooks, you can just hop into the UI, filter for the best results, and see exactly which configuration got you there. It takes the guesswork out of the "science" part of data science.

Understanding MLflow Projects

You can train the most accurate model in the world, but if your colleague can’t reproduce your results on their machine, that model isn't worth much.

This is where MLflow Projects come in. They solve the reproducibility headache by providing a standard way to package your code, your dependencies, and your entry points into one neat bundle.

Think of an MLflow Project as a directory (or a Git repo) with a special "instruction manual" at its root called an MLproject file. This file tells anyone (or any server) exactly what environment is needed and how to kick off the execution.

The MLproject File

Instead of sending someone a long README with installation steps, you just give them this file. Here is what a typical MLproject setup looks like for a training pipeline:

name: my_ml_project
conda_env: conda.yaml

entry_points:
  train:
    parameters:
      learning_rate: {type: float, default: 0.01}
      epochs: {type: int, default: 50}
      data_path: {type: str}
    command: "python train.py --lr {learning_rate} --epochs {epochs} --data {data_path}"
  
  evaluate:
    parameters:
      model_path: {type: str}
    command: "python evaluate.py --model {model_path}"

The conda_env line points to a conda.yaml file that lists the exact Python packages and versions your code needs. If you want even more isolation, MLflow supports Docker environments too.

The beauty of this setup is the simplicity. Anyone with MLflow installed can run your entire project with a single command:

mlflow run . -P learning_rate=0.001 -P epochs=100 -P data_path=./data/train.csv

Why this Actually Matters

MLflow Projects really shine in two specific scenarios. The first is onboarding. A new team member can clone your repo and be up and running in minutes, rather than spending their entire first day debugging library version conflicts.

The second is CI/CD. Because these projects are triggered programmatically, they fit perfectly into automated retraining pipelines. When reproducibility is non-negotiable, having a "single source of truth" for how to run your code makes life a lot easier for everyone involved.

Understanding the MLflow Model Registry

Tracking experiments tells you which model is the "winner," but the Model Registry is where you actually manage that winner’s journey from your notebook to a live production environment.

Think of it as the governance layer. It handles versioning, stage management, and creates a clear audit trail so you never have to guess which model is currently running in the wild.

The Registry uses a few simple concepts to keep things organized:

Registered Model: This is the overall name for your project, like CustomerChurnPredictor.
Model Version: Every time you push a new iteration, MLflow auto-increments the version (v1, v2, and so on).
Stage: These are labels like Staging, Production, or Archived. They tell your team exactly where a model stands in its lifecycle.
Annotations: These are just notes and tags. They’re great for documenting why a specific version was promoted or what its quirks are.

Moving a Model through the Pipeline

In a real-world workflow, you don't just "deploy" a file. You transition it through stages. Here's how that looks using the MLflow Client:

Python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# First, we register the model from a run that went well
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/random_forest_model",
    name="CustomerChurnPredictor"
)

# Then, we move Version 1 to Staging so the QA team can look at it
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Staging"
)

# Once everything checks out, we promote it to Production
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Production"
)

Why Does This Matter?

The Model Registry solves a problem that usually gets messy the moment a team grows: knowing exactly which version is live, who approved it, and what it was compared against. Without this, that information usually ends up buried in Slack threads or outdated spreadsheets.

It also makes rollbacks incredibly painless. If Version 3 starts acting up in production, you don't need to redeploy your entire stack. You can just transition Version 2 back to the "Production" stage in the registry. Since your serving infrastructure is built to always pull the "Production" tag, it will automatically swap back to the stable version.

How the Components Fit Together

To see how all of this actually works in the real world, it helps to walk through a typical workflow from start to finish. It's essentially a relay race where each component hands off the baton to the next one.

It starts with a data scientist running a handful of experiments. Every time they hit run, MLflow Tracking is in the background taking notes. It logs metrics and saves model artifacts into the Backend Store automatically. At this stage, everything is about exploration and finding that one winner.

Once that best run is identified, the model gets officially registered in the Model Registry. This is where the team takes over. They can hop into the UI to check the annotations, review the evaluation results, and move the model into Staging. After it passes a few more validation tests, it gets the green light and is promoted to Production.

When it is time to actually serve the model, the deployment system simply asks the Registry for the current Production version. This happens whether you are using Kubernetes, a cloud endpoint, or MLflow’s built-in server.

Because the MLproject file handled the dependencies and the MLflow Models format handled the framework details, the serving infrastructure does not have to care if the model was built with Scikit-learn or PyTorch. The hand-off is smooth because all the necessary info is already there.

This flow is what turns MLflow from a collection of useful utilities into a full MLOps platform. It connects the messy experimental phase of data science to the rigid world of production software.

Wrapping Up

At the end of the day, MLflow architecture is built to stay out of your way. It doesn't force you to change how you write your code or which libraries you use. Instead, it just provides the structure needed to make your machine learning projects reproducible and easier to manage as a team.

Whether you're just trying to get away from naming files model_final_v2.pkl or you are building a complex CI/CD pipeline for your models, understanding these four pillars is the best place to start. The best way to learn is to just fire up a local tracking server and start logging. You will probably find that once you have that "source of truth" for your experiments, you will never want to go back to the old way of doing things.

How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD

Sandeep Bharadwaj Mannapur — Tue, 17 Mar 2026 20:33:56 +0000

Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and trust over time.

Most ML systems fail in production for boring (and painful) reasons: the training code and the serving code drift apart, input data changes shape, a “small” preprocessing tweak breaks predictions, or the model silently degrades because real-world behavior shifts. None of these problems are solved by a better algorithm, they’re solved by engineering: repeatable pipelines, validation, versioning, monitoring, and automated checks.

In this hands-on handbook, you’ll build a complete mini ML platform on your local machine, an end-to-end project that takes a model from training to deployment with the core “last mile” infrastructure in place. We’ll use a fraud detection example (predicting fraudulent transactions), but the same workflow works for churn prediction or any binary classification problem. Everything runs locally (no cloud required), and every step is copy-paste runnable so you can follow along and verify outputs as you go.

By the end, you'll have a production-ready ML pipeline running on your machine – from training the model to serving predictions, with the infrastructure to test, monitor, and iterate with confidence. And yes, we'll do it in a hands-on manner with code snippets you can copy-paste and run. Let's dive in!

📦 Get the Complete Code
All code from this handbook is available in a ready-to-run repository:
Repository: https://github.com/sandeepmb/freecodecamp-local-ml-platform
Clone it and follow along, or use it as a reference implementation.

Project Overview and Setup
Build a Simple Model and API (The Naive Approach)
- Train a Quick Model
- Serve Predictions with FastAPI
Where the Naive Approach Breaks
Add Experiment Tracking and Model Registry with MLflow
Ensure Feature Consistency with Feast
Add Data Validation with Great Expectations
- Define Expectations
- Integrate Validation into FastAPI
Monitor Model Performance and Data Drift
Automate Testing and Deployment with CI/CD
Incident Response Playbook
How to Put It All Together
What’s Next: Scale to Production
Conclusion
References

Project Overview and Setup

Before we jump into coding, let's set the stage. Our use-case is credit card fraud detection – a binary classification problem where we predict whether a transaction is fraudulent (is_fraud = 1) or legitimate (is_fraud = 0). This is a common ML task and a good proxy for production ML challenges because fraud patterns can change over time (allowing us to discuss model drift), and bad input data (for example, malformed transaction info) can cause serious issues if not handled properly.

Tech Stack

We will use Python-based tools that are popular in MLOps but still beginner-friendly:

Tool	Purpose	Why We Chose It
MLflow	Experiment tracking and model registry	Open-source, widely adopted, great UI
Feast	Feature store for consistent feature serving	Production-grade, runs locally, same API for offline/online
FastAPI	High-performance web framework for serving predictions	Fast, automatic docs, modern Python
Great Expectations	Data validation framework	Declarative expectations, great reports
Evidently	Monitoring for data drift and model decay	Beautiful reports, easy to integrate
Docker	Containerization for environment consistency	Industry standard, works everywhere
GitHub Actions	CI/CD automation	Free for public repos, tight GitHub integration

Let me explain each tool briefly:

MLflow is an open-source platform designed to manage the ML lifecycle. It provides experiment tracking (logging parameters, metrics, and artifacts), a model registry (versioning models with aliases), and model serving capabilities. We'll use it to ensure our experiments are reproducible and our models are versioned.

Feast (Feature Store) is an open-source feature store that helps manage and serve features consistently between training and inference. This prevents a common problem called "training-serving skew" where the features used in production differ slightly from those used in training, causing silent accuracy degradation.

FastAPI is a modern, fast web framework for building APIs with Python. It's known for being easy to use, efficient, and producing automatic interactive documentation. We'll use it to serve our model predictions.

Great Expectations is an open-source tool for data quality testing. It allows us to define "expectations" on data (like "amount should be positive" or "hour should be between 0 and 23") and test incoming data against them.

Evidently is an open-source library for monitoring data and model performance over time. It can detect data drift (when input distributions change) and model decay (when accuracy drops).

Docker ensures the same environment and dependencies in development and deployment, avoiding the classic "works on my machine" problem.

GitHub Actions provides CI/CD automation. An efficient CI/CD pipeline helps integrate and deploy changes faster and with fewer errors.

💡 Mental Model: Think of this as building a "safety net" around your ML model. Each tool we add catches a different failure mode, like defensive driving for machine learning.

Prerequisites

You'll need:

Python 3.9+ installed on your machine
Docker Desktop installed and running
GitHub account (if you want to try the CI/CD pipeline)
Basic familiarity with Python and ML concepts (what training and prediction mean)

You don't need MLOps or Kubernetes experience. Everything will be done locally with just Python and Docker – no cloud and no Kubernetes needed.

Project Structure

Let's set up a basic project structure on your local machine. Open your terminal and run:

# Create project directory and subfolders
mkdir ml-platform-tutorial && cd ml-platform-tutorial
mkdir -p data models src tests feature_repo

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Your project structure should look like this:

ml-platform-tutorial/
├── data/              # Training and test datasets
├── models/            # Saved model files
├── src/               # Source code
├── tests/             # Test files
├── feature_repo/      # Feast feature repository
├── venv/              # Virtual environment
└── requirements.txt   # Dependencies

Next, create a requirements.txt with all the necessary libraries:

# requirements.txt

# Core ML libraries
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0

# Experiment tracking and model registry
mlflow==2.10.0

# Feature store
feast==0.36.0

# API framework
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.26.0

# Data validation
great-expectations==0.18.8

# Monitoring
evidently==0.7.20

# Testing
pytest==8.0.0
pytest-cov==4.1.0

# Utilities
pyarrow==15.0.0
pydantic==2.6.0

📌 Version Note: Exact versions are pinned to ensure reproducibility. Newer versions may work, but all examples were tested with the versions listed here.

Install the dependencies:

pip install -r requirements.txt

This might take a few minutes as it installs all the packages. Once complete, we're ready to start building our project step by step.

Checkpoint: You should have a project folder with data/, models/, src/, tests/, and feature_repo/ directories, and an activated virtual environment with all dependencies installed. Verify by running python -c "import mlflow; import feast; import fastapi; print('All imports successful!')".

Figure 1: The Complete ML Platform We'll Build

Don't worry if this looks complex, we'll build each component step by step, starting with the simplest piece and connecting them together.

1. Build a Simple Model and API (The Naive Approach)

To illustrate why we need all these tools, let's start by building a naive ML system without any MLOps infrastructure. We'll train a simple model and deploy it quickly, then observe what problems arise. This "naive approach" is how most ML projects start – and understanding its limitations will motivate the solutions we implement later.

1.1 Train a Quick Model

First, we need some data. For simplicity, we'll generate a synthetic dataset for fraud detection so that we don't rely on any external data files. The dataset will have features like:

amount: Transaction amount in dollars
hour: Hour of the day (0-23) when the transaction occurred
day_of_week: Day of the week (0=Monday, 6=Sunday)
merchant_category: Type of merchant (grocery, restaurant, retail, online, travel)
is_fraud: Label indicating if the transaction is fraudulent (1) or legitimate (0)

We will simulate that only ~2% of transactions are fraud, which is an imbalance typical in real fraud data. This imbalance is important because it affects how we evaluate our model.

Create src/generate_data.py:

# src/generate_data.py
"""
Generate synthetic fraud detection dataset.

This script creates realistic-looking transaction data where fraudulent
transactions have different patterns than legitimate ones:
- Fraud tends to have higher amounts
- Fraud tends to occur late at night
- Fraud is more common for online and travel merchants
"""
import pandas as pd
import numpy as np

def generate_transactions(n_samples=10000, fraud_ratio=0.02, seed=42):
    """
    Generate synthetic fraud detection dataset.
    
    Args:
        n_samples: Total number of transactions to generate
        fraud_ratio: Proportion of fraudulent transactions (default 2%)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with transaction features and fraud labels
    
    Fraud transactions have different patterns:
    - Higher amounts (mean \(245 vs \)33 for legit)
    - Late night hours (0-5, 23)
    - More likely to be online or travel merchants
    """
    np.random.seed(seed)
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud

    # Legitimate transactions: normal shopping patterns
    # - Amounts follow a log-normal distribution (most small, some large)
    # - Hours are uniformly distributed throughout the day
    # - Merchant categories weighted toward everyday shopping
    legit = pd.DataFrame({
        "amount": np.random.lognormal(mean=3.5, sigma=1.2, size=n_legit),  # ~$33 average
        "hour": np.random.randint(0, 24, size=n_legit),
        "day_of_week": np.random.randint(0, 7, size=n_legit),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_legit,
            p=[0.30, 0.25, 0.25, 0.15, 0.05]  # Weighted toward everyday shopping
        ),
        "is_fraud": 0
    })
    
    # Fraudulent transactions: suspicious patterns
    # - Higher amounts (fraudsters go big)
    # - Late night hours (less scrutiny)
    # - More online and travel (easier to exploit)
    fraud = pd.DataFrame({
        "amount": np.random.lognormal(mean=5.5, sigma=1.5, size=n_fraud),  # ~$245 average
        "hour": np.random.choice([0, 1, 2, 3, 4, 5, 23], size=n_fraud),  # Late night
        "day_of_week": np.random.randint(0, 7, size=n_fraud),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_fraud,
            p=[0.05, 0.05, 0.10, 0.60, 0.20]  # Weighted toward online/travel
        ),
        "is_fraud": 1
    })
    
    # Combine and shuffle
    df = pd.concat([legit, fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    return df

if __name__ == "__main__":
    # Generate dataset
    print("Generating synthetic fraud detection dataset...")
    df = generate_transactions(n_samples=10000, fraud_ratio=0.02)
    
    # Split into train (80%) and test (20%)
    train_df = df.sample(frac=0.8, random_state=42)
    test_df = df.drop(train_df.index)
    
    # Save to CSV files
    train_df.to_csv("data/train.csv", index=False)
    test_df.to_csv("data/test.csv", index=False)
    
    # Print summary statistics
    print(f"\nDataset generated successfully!")
    print(f"Training set: {len(train_df):,} transactions")
    print(f"Test set: {len(test_df):,} transactions")
    print(f"Overall fraud ratio: {df['is_fraud'].mean():.2%}")
    print(f"\nLegitimate transactions - Average amount: ${df[df['is_fraud']==0]['amount'].mean():.2f}")
    print(f"Fraudulent transactions - Average amount: ${df[df['is_fraud']==1]['amount'].mean():.2f}")
    print(f"\nMerchant category distribution (fraud):")
    print(df[df['is_fraud']==1]['merchant_category'].value_counts(normalize=True))

Run the data generation script:

python src/generate_data.py

You should see output like:

Generating synthetic fraud detection dataset...

Dataset generated successfully!
Training set: 8,000 transactions
Test set: 2,000 transactions
Overall fraud ratio: 2.00%

Legitimate transactions - Average amount: $33.45
Fraudulent transactions - Average amount: $245.67

Merchant category distribution (fraud):
online        0.60
travel        0.20
retail        0.10
restaurant    0.05
grocery       0.05

Now you have data/train.csv and data/test.csv with ~8000 training and ~2000 testing transactions.

Why This Matters: The synthetic data has realistic patterns — fraud is rare (2%), high-value, late-night, and concentrated in certain merchant categories. These patterns give our model something to learn.

Now, let's train a quick model. We'll use a simple Random Forest classifier from scikit-learn to predict is_fraud. In this naive version, we won't do much feature engineering – just label encode the categorical merchant_category and feed everything to the model.

Create src/train_naive.py:

# src/train_naive.py
"""
Train a fraud detection model - NAIVE VERSION.

This script demonstrates the "quick and dirty" approach to ML:
- No experiment tracking
- No model versioning
- Just train and save to a pickle file

We'll improve on this in later sections.
"""
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report
)

def main():
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    print(f"Training samples: {len(train_df):,}")
    print(f"Test samples: {len(test_df):,}")
    print(f"Training fraud ratio: {train_df['is_fraud'].mean():.2%}")
    
    # Encode the categorical feature
    # We need to save the encoder to use the same mapping at inference time
    print("\nEncoding categorical features...")
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    print(f"Merchant category mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
    
    # Prepare features and labels
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    # Train a Random Forest classifier
    print("\nTraining Random Forest model...")
    model = RandomForestClassifier(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of each tree
        random_state=42,       # For reproducibility
        n_jobs=-1              # Use all CPU cores
    )
    model.fit(X_train, y_train)
    print("Training complete!")
    
    # Evaluate on test data
    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"  True Negatives:  {cm[0][0]:,} (correctly identified legitimate)")
    print(f"  False Positives: {cm[0][1]:,} (legitimate flagged as fraud)")
    print(f"  False Negatives: {cm[1][0]:,} (fraud missed - DANGEROUS!)")
    print(f"  True Positives:  {cm[1][1]:,} (correctly caught fraud)")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Feature importance
    print("\nFeature Importance:")
    for name, importance in sorted(
        zip(feature_cols, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    ):
        print(f"  {name}: {importance:.4f}")
    
    # Save the model and encoder together
    print("\nSaving model to models/model.pkl...")
    with open("models/model.pkl", "wb") as f:
        pickle.dump((model, encoder), f)
    
    print("\nModel trained and saved successfully!")
    print("\nWARNING: This naive approach has several problems:")
    print("  - No record of hyperparameters or metrics")
    print("  - No model versioning")
    print("  - No way to reproduce this exact model")
    print("  - We'll fix these issues in the following sections!")

if __name__ == "__main__":
    main()

Run the training script:

python src/train_naive.py

You should see output similar to:

Loading data...
Training samples: 8,000
Test samples: 2,000
Training fraud ratio: 2.00%

Encoding categorical features...
Merchant category mapping: {'grocery': 0, 'online': 1, 'restaurant': 2, 'retail': 3, 'travel': 4}

Training Random Forest model...
Training complete!

==================================================
MODEL EVALUATION
==================================================

Accuracy:  0.9820
Precision: 0.7273
Recall:    0.6154
F1-score:  0.6667

Confusion Matrix:
  True Negatives:  1,956 (correctly identified legitimate)
  False Positives: 4 (legitimate flagged as fraud)
  False Negatives: 32 (fraud missed - DANGEROUS!)
  True Positives:  8 (correctly caught fraud)

Feature Importance:
  amount: 0.5423
  hour: 0.2156
  merchant_encoded: 0.1345
  day_of_week: 0.1076

Important observation: You'll see ~98% accuracy but a lower F1-score (around 0.5-0.7). With only 2% fraud, accuracy is extremely misleading! A model that always predicts "not fraud" would achieve 98% accuracy while catching zero fraud. This is why we focus on F1-score, precision, and recall for imbalanced classification problems.

💡 If you're new to imbalanced classification, remember: high accuracy can be meaningless when the positive class is rare.

The script outputs a file models/model.pkl containing both the trained model and the label encoder (we need both for inference).

Checkpoint: You should now have:

data/train.csv (~8,000 rows)
data/test.csv (~2,000 rows)
models/model.pkl (trained model + encoder)

The model should show ~98% accuracy but F1 around 0.5-0.7. Verify the files exist: ls -la data/ models/

1.2 Serve Predictions with FastAPI

Now that we have a model, let's deploy it as an API so that clients can get predictions. We'll use FastAPI because it's straightforward, very fast, and produces automatic interactive documentation.

FastAPI is known for:

Easy to use: Pythonic syntax with type hints
High performance: One of the fastest Python frameworks
Automatic documentation: Swagger UI out of the box
Data validation: Using Pydantic models

Create src/serve_naive.py:

# src/serve_naive.py
"""
Serve fraud detection model as a REST API - NAIVE VERSION.

This is a simple API that:
1. Loads the trained model at startup
2. Accepts transaction data via POST request
3. Returns fraud prediction

We'll improve this with validation, monitoring, and better
model loading in later sections.
"""
import pickle
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional

# Load the trained model and encoder at startup
# This is loaded once when the server starts, not on every request
print("Loading model...")
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)
print("Model loaded successfully!")

# Create the FastAPI application
app = FastAPI(
    title="Fraud Detection API",
    description="""
    Predict whether a credit card transaction is fraudulent.
    
    This API accepts transaction details and returns:
    - Whether the transaction is predicted to be fraud
    - The probability of fraud (0.0 to 1.0)
    
    **Note:** This is the naive version without validation or monitoring.
    """,
    version="1.0.0"
)

# Define the input schema using Pydantic
# This provides automatic validation and documentation
class Transaction(BaseModel):
    """Schema for a transaction to be evaluated for fraud."""
    amount: float = Field(
        ..., 
        description="Transaction amount in dollars",
        example=150.00
    )
    hour: int = Field(
        ..., 
        description="Hour of the day (0-23)",
        example=14
    )
    day_of_week: int = Field(
        ..., 
        description="Day of week (0=Monday, 6=Sunday)",
        example=3
    )
    merchant_category: str = Field(
        ..., 
        description="Type of merchant",
        example="online"
    )

class PredictionResponse(BaseModel):
    """Schema for the prediction response."""
    is_fraud: bool = Field(description="Whether the transaction is predicted as fraud")
    fraud_probability: float = Field(description="Probability of fraud (0.0 to 1.0)")
    
@app.post("/predict", response_model=PredictionResponse)
def predict(transaction: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Takes transaction details and returns a fraud prediction
    along with the probability score.
    """
    # Convert the request to a dictionary
    data = transaction.dict()
    
    # Encode the merchant category using the same encoder from training
    # This ensures consistency between training and serving
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        # Handle unknown merchant categories
        # In production, we'd want better handling here
        data["merchant_encoded"] = 0
    
    # Prepare features in the same order as training
    X = [[
        data["amount"],
        data["hour"],
        data["day_of_week"],
        data["merchant_encoded"]
    ]]
    
    # Get prediction and probability
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0][1]  # Probability of class 1 (fraud)
    
    return PredictionResponse(
        is_fraud=bool(prediction),
        fraud_probability=round(float(probability), 4)
    )

@app.get("/health")
def health_check():
    """
    Health check endpoint.
    
    Returns the status of the API. Useful for:
    - Load balancer health checks
    - Kubernetes liveness probes
    - Monitoring systems
    """
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/")
def root():
    """Root endpoint with API information."""
    return {
        "message": "Fraud Detection API",
        "version": "1.0.0",
        "docs": "/docs",
        "health": "/health"
    }

A few important things to note about this code:

Pydantic Models: We use BaseModel to define the expected input JSON schema. FastAPI automatically validates incoming requests against this schema.
Type Hints: The type hints (float, int, str) provide both documentation and runtime validation.
Feature Encoding: On each request, we encode the merchant category using the same LabelEncoder we saved from training. This ensures consistency between training and serving.
Health Endpoint: The /health endpoint is standard practice for production APIs - it allows load balancers and monitoring systems to check if the service is running.

To run this API, use Uvicorn (an ASGI server):

uvicorn src.serve_naive:app --reload --host 0.0.0.0 --port 8000

The --reload flag enables auto-reload during development (the server restarts when you change code).

You should see:

Loading model...
Model loaded successfully!
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process

Now open your browser and go to http://localhost:8000/docs. You'll see the Swagger UI – an auto-generated interactive documentation where you can test the API directly from your browser!

Test the API using curl in another terminal:

# Test with a legitimate-looking transaction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}'

Expected response:

{"is_fraud": false, "fraud_probability": 0.02}

# Test with a suspicious transaction (high amount, late night, online)
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 500.0, "hour": 3, "day_of_week": 1, "merchant_category": "online"}'

Expected response:

{"is_fraud": true, "fraud_probability": 0.78}

We have a working model served as an API! In a real scenario, we could now integrate this API with a payment processing frontend, mobile app, or any system that needs fraud predictions.

But before we celebrate, let's examine this naive approach for potential pitfalls...

Checkpoint: Your API should be running at http://localhost:8000. The Swagger UI at /docs should show both endpoints (/predict and /health). Test with curl or the Swagger UI to verify predictions are returned.

2. Where the Naive Approach Breaks

Our quick-and-dirty ML pipeline works on the surface: it can train a model and serve predictions. However, hidden problems will emerge if we try to maintain or scale this system in production.

This section is critical: understanding these issues will motivate the solutions we implement in the following sections. Let's go through the problems one by one.

Problem 1: No Experiment Tracking (Reproducibility)

Try this thought experiment: Run train_naive.py again with different hyperparameters (change n_estimators to 200, or max_depth to 15). Would you be able to exactly reproduce the previous model's results if someone asked?

Probably not. Currently, we have no record of:

Which hyperparameters we used
What metrics we achieved
What version of the data we trained on
What library versions were installed
When the training happened
Who ran the training

Three months from now, if your manager asks "How was this model trained? Can you reproduce the results?" – you'd be in trouble. You might have the code, but you don't know which version of the code, which parameters, or which data produced the model that's currently in production.

Experiment tracking is the practice of logging all these details (code versions, parameters, metrics, data versions, artifacts) so experiments can be compared and replicated. Our naive approach lacks this entirely, making our results hard to trust or build upon.

Problem 2: Model Versioning and Deployment Chaos

We trained one model and saved it as model.pkl. Now consider this scenario:

You train a new model with different hyperparameters
You overwrite model.pkl with the new model
You deploy it to production
Users start complaining about more false positives
You want to roll back to the previous model
Problem: The previous model was overwritten and is gone forever

There's no systematic versioning. Questions you cannot answer:

Which model version is currently in production?
What were the metrics for model v1 vs v2?
When was each model trained and by whom?
Can we instantly roll back if the new model performs worse?
What changed between versions?

Without version control for models, you're flying blind. Imagine deploying code without Git – that's what we're doing with our model.

Problem 3: No Data Validation – Garbage In, Garbage Out

Right now, our API will accept any input and try to make a prediction. Let's see what happens with bad data.

Create a test script src/test_bad_data.py:

# src/test_bad_data.py
"""Test what happens when we send garbage data to the API."""
import requests

BASE_URL = "http://localhost:8000"

print("Testing API with various bad inputs...\n")

# Test 1: Negative amount
print("Test 1: Negative amount")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -500.0,        # Negative amount - impossible!
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 2: Invalid hour
print("Test 2: Hour = 25 (should be 0-23)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 25,              # Invalid hour!
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 3: Invalid day of week
print("Test 3: day_of_week = 10 (should be 0-6)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 10,       # Invalid day!
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 4: Unknown merchant category
print("Test 4: Unknown merchant category")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "unknown_category"  # Not in training data!
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 5: All bad at once
print("Test 5: Everything wrong")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -1000.0,
    "hour": 99,
    "day_of_week": 15,
    "merchant_category": "totally_fake"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

print("Observation: The API happily accepts ALL garbage and returns predictions!")
print("This is dangerous - bad data leads to bad predictions with no warning.")

Run it (make sure your API is still running):

python src/test_bad_data.py

You'll see something like:

Testing API with various bad inputs...

Test 1: Negative amount
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.15}

Test 2: Hour = 25 (should be 0-23)
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.08}

...

Observation: The API happily accepts ALL garbage and returns predictions!

The API accepts garbage and returns predictions with no warning! In production, this could mean:

Incorrect predictions based on impossible data
Fraud going undetected because of malformed input
Legitimate transactions blocked based on corrupted data
No way to debug why predictions are wrong

As the saying goes: "Garbage in, garbage out." But even worse – we don't even know garbage went in!

Problem 4: Model Drift – Performance Decay Over Time

Here's a scenario that happens in every production ML system:

January: You train your model on historical fraud data. It achieves 98% accuracy and 0.67 F1-score. Everyone's happy.
February: The model is deployed and working well. Fraud is being caught.
March: Fraudsters adapt. They start using different patterns – smaller amounts, different merchant categories, different times of day.
April: Your model's accuracy has dropped from 98% to 85%. F1-score dropped from 0.67 to 0.35. Fraud is slipping through.
May: A major fraud incident occurs. Investigation reveals the model has been underperforming for 2 months.

The problem: Nobody noticed for 2 months because there was no monitoring.

This phenomenon is called data drift (when input data distributions change) or concept drift (when the relationship between inputs and outputs changes). Both are inevitable in real-world systems.

Without monitoring:

You don't know when performance degrades
You don't know why performance degrades
You can't take corrective action until users complain
By then, significant damage may have occurred

Problem 5: No CI/CD or Deployment Safety

Our "deployment process" was literally:

SSH into the server (or run locally)
Run python src/train_naive.py
Copy model.pkl to the right place
Restart the API
Hope for the best

There's:

No automated testing: A typo could break everything
No staging environment: We test directly in production
No gradual rollout: 100% of traffic hits the new model immediately
No rollback capability: If something breaks, we have to manually fix it
No audit trail: Who deployed what and when?

This is how production incidents happen. A rushed deployment at 5 PM on Friday breaks the fraud detection system, and nobody notices until Monday when fraud losses have spiked.

Figure 2: Problems with the Naive Approach

Summary: What We Need to Fix

Our simple ML service is missing critical infrastructure. Here's the mapping of problems to solutions:

Problem	Impact	Solution	Section
No experiment tracking	Can't reproduce or compare models	MLflow Tracking	3
No model versioning	Can't roll back or audit	MLflow Registry	3
No feature consistency	Training-serving skew	Feast Feature Store	4
No data validation	Garbage predictions	Great Expectations	5
No monitoring	Drift goes unnoticed	Evidently	6
No CI/CD	Risky deployments	GitHub Actions + Docker	7

The good news: We can fix each of these by incrementally adding components to our pipeline. Each tool addresses a specific problem, and together they form a robust ML platform.

Let's start fixing these issues, one by one.

3. Add Experiment Tracking and Model Registry with MLflow

What breaks without this: You can't reproduce yesterday's results, can't compare experiments, and can't roll back when a new model fails in production.

Our first fix addresses Problems 1 and 2: experiment reproducibility and model versioning.

MLflow is an open-source platform designed to manage the ML lifecycle. We'll use two of its key components:

MLflow Tracking: Log experiments (parameters, metrics, artifacts) so you can compare runs and reproduce results
MLflow Model Registry: Version your models with aliases (champion, challenger) and manage the deployment lifecycle

Why This Matters: Without tracking, ML is guesswork. With MLflow, every run is logged with parameters, metrics, and artifacts. You can compare runs side-by-side, understand what actually improved your model, and reproduce any past experiment. The Model Registry adds governance – you know exactly which model is in production and can roll back in seconds.

3.1 How to Set Up the MLflow Tracking Server

MLflow can log experiments to a local directory by default, but to use the full UI and model registry, it's best to run the MLflow tracking server.

Open a new terminal (keep it separate from your API terminal) and run:

# Create a directory for MLflow data
mkdir -p mlruns

# Start the MLflow server
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns

Let's break down these parameters:

--host 0.0.0.0: Listen on all network interfaces
--port 5000: Run on port 5000
--backend-store-uri sqlite:///mlflow.db: Store experiment metadata in a SQLite database (for production, you'd use PostgreSQL or MySQL)
--default-artifact-root ./mlruns: Store model artifacts (files) in the mlruns directory

You should see:

[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://0.0.0.0:5000

Now open your browser and navigate to http://localhost:5000. You'll see the MLflow UI – it should be empty initially since we haven't logged any experiments yet.

3.2 How to Log Experiments in Code

Now let's modify our training script to log everything to MLflow. Create src/train_mlflow.py:

# src/train_mlflow.py
"""
Train fraud detection model with MLflow experiment tracking.

This script demonstrates proper ML experiment tracking:
- Log all hyperparameters
- Log all metrics (train and test)
- Log the trained model as an artifact
- Register the model in the Model Registry

Compare this to train_naive.py to see the difference!
"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score
)
import pickle
from datetime import datetime

# Configure MLflow to use our tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create or get the experiment
# All runs will be grouped under this experiment name
mlflow.set_experiment("fraud-detection")

def load_and_preprocess_data():
    """Load and preprocess the training and test data."""
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # Encode categorical feature
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    # Prepare features
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    return X_train, y_train, X_test, y_test, encoder

def train_and_log_model(
    n_estimators: int = 100,
    max_depth: int = 10,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1
):
    """
    Train a model and log everything to MLflow.
    
    Args:
        n_estimators: Number of trees in the forest
        max_depth: Maximum depth of each tree
        min_samples_split: Minimum samples required to split a node
        min_samples_leaf: Minimum samples required at a leaf node
    """
    X_train, y_train, X_test, y_test, encoder = load_and_preprocess_data()
    
    # Start an MLflow run - everything logged will be associated with this run
    with mlflow.start_run():
        # Add a descriptive run name
        run_name = f"rf_est{n_estimators}_depth{max_depth}_{datetime.now().strftime('%H%M%S')}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        # Log all hyperparameters
        # These are the "knobs" we can tune
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)
        mlflow.log_param("min_samples_leaf", min_samples_leaf)
        mlflow.log_param("model_type", "RandomForestClassifier")
        
        # Log data information
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples", len(X_test))
        mlflow.log_param("fraud_ratio", float(y_train.mean()))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Train the model
        print(f"\nTraining model: n_estimators={n_estimators}, max_depth={max_depth}")
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Evaluate and log metrics for BOTH train and test sets
        # This helps detect overfitting
        for dataset_name, X, y in [("train", X_train, y_train), ("test", X_test, y_test)]:
            y_pred = model.predict(X)
            y_prob = model.predict_proba(X)[:, 1]
            
            # Calculate all metrics
            accuracy = accuracy_score(y, y_pred)
            precision = precision_score(y, y_pred, zero_division=0)
            recall = recall_score(y, y_pred, zero_division=0)
            f1 = f1_score(y, y_pred, zero_division=0)
            roc_auc = roc_auc_score(y, y_prob)
            
            # Log metrics with dataset prefix
            mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
            mlflow.log_metric(f"{dataset_name}_precision", precision)
            mlflow.log_metric(f"{dataset_name}_recall", recall)
            mlflow.log_metric(f"{dataset_name}_f1", f1)
            mlflow.log_metric(f"{dataset_name}_roc_auc", roc_auc)
            
            print(f"  {dataset_name.upper()} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, ROC-AUC: {roc_auc:.4f}")
        
        # Log feature importance
        for feature, importance in zip(
            ["amount", "hour", "day_of_week", "merchant_encoded"],
            model.feature_importances_
        ):
            mlflow.log_metric(f"importance_{feature}", importance)
        
        # Log the model to MLflow AND register it in the Model Registry
        # This creates a new version of the model automatically
        print("\nRegistering model in MLflow Model Registry...")
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name="fraud-detection-model",
            input_example=X_train.iloc[:5]  # Example input for documentation
        )
        
        # Save and log the encoder as a separate artifact
        # We need this for inference
        with open("encoder.pkl", "wb") as f:
            pickle.dump(encoder, f)
        mlflow.log_artifact("encoder.pkl")
        
        # Get the run ID for reference
        run_id = mlflow.active_run().info.run_id
        print(f"\nMLflow Run ID: {run_id}")
        print(f"View this run: http://localhost:5000/#/experiments/1/runs/{run_id}")
        
        return model, encoder

def run_experiment_sweep():
    """
    Run multiple experiments with different hyperparameters.
    
    This demonstrates how MLflow helps compare different configurations.
    """
    print("="*60)
    print("RUNNING HYPERPARAMETER EXPERIMENT SWEEP")
    print("="*60)
    
    # Define different configurations to try
    experiments = [
        {"n_estimators": 50, "max_depth": 5},
        {"n_estimators": 100, "max_depth": 10},
        {"n_estimators": 100, "max_depth": 15},
        {"n_estimators": 200, "max_depth": 10},
        {"n_estimators": 200, "max_depth": 20},
    ]
    
    for i, params in enumerate(experiments, 1):
        print(f"\n--- Experiment {i}/{len(experiments)} ---")
        train_and_log_model(**params)
    
    print("\n" + "="*60)
    print("EXPERIMENT SWEEP COMPLETE!")
    print("="*60)
    print("\nView all experiments at: http://localhost:5000")
    print("Compare runs to find the best hyperparameters!")

if __name__ == "__main__":
    run_experiment_sweep()

This script:

Connects to MLflow: mlflow.set_tracking_uri("http://localhost:5000")
Creates an experiment: mlflow.set_experiment("fraud-detection")
Logs parameters: All hyperparameters and data info
Logs metrics: Accuracy, precision, recall, F1, ROC-AUC for both train and test sets
Logs the model: Saves the trained model as an artifact
Registers the model: Adds it to the Model Registry with automatic versioning

Run the experiment sweep:

python src/train_mlflow.py

You'll see output for each experiment:

============================================================
RUNNING HYPERPARAMETER EXPERIMENT SWEEP
============================================================

--- Experiment 1/5 ---
Loading data...
Training model: n_estimators=50, max_depth=5
  TRAIN - Accuracy: 0.9821, F1: 0.6545, ROC-AUC: 0.9234
  TEST - Accuracy: 0.9795, F1: 0.5714, ROC-AUC: 0.8956

Registering model in MLflow Model Registry...
MLflow Run ID: abc123...

--- Experiment 5/5 ---
Training model: n_estimators=200, max_depth=20
  TRAIN - Accuracy: 0.9856, F1: 0.7123, ROC-AUC: 0.9567
  TEST - Accuracy: 0.9810, F1: 0.6667, ROC-AUC: 0.9234

============================================================
EXPERIMENT SWEEP COMPLETE!
============================================================

All 5 runs are now logged to MLflow with full metrics comparison available in the UI.

Now refresh the MLflow UI at http://localhost:5000. You'll see:

Experiments tab: Shows the "fraud-detection" experiment with 5 runs
Each run: Shows parameters, metrics, and artifacts
Compare: You can select multiple runs and compare them side-by-side
Models tab: Shows "fraud-detection-model" with 5 versions

MLflow Tracking UI: Compare runs, metrics, and models at a glance

3.3 How to Use the Model Registry

The Model Registry provides a central hub for managing model versions and their lifecycle stages.

In the MLflow UI:

Click the "Models" tab in the top navigation
Click "fraud-detection-model"
You'll see all 5 versions listed with their metrics

Model Aliases: MLflow now uses aliases instead of stages. If you've seen older tutorials using "Staging" and "Production" stages, aliases are the newer, more flexible approach.

@champion: The production model serving live traffic
@challenger: Candidate model being tested
You can create custom aliases like @baseline, @latest and so on.

Assign an alias:

Open MLflow UI → Models → fraud-detection-model
Click on the version you want to promote
Click "Add Alias"
Enter champion and save

Now you've assigned the @champion alias to your best model. Your API will load whichever version has this alias, making rollbacks as simple as moving the alias to a different version.

Figure 3: MLflow Model Lifecycle — From Training to Production

3.4 Update API to Load from Registry

Now let's update our API to load the champion model from the MLflow Registry instead of a pickle file. Create src/serve_mlflow.py:

# src/serve_mlflow.py
"""
Serve fraud detection model from MLflow Model Registry.

This version loads the @champion model from MLflow, which means:
- Always serves the latest @champion model
- Can roll back by changing the @champion alias
- No manual file copying needed
"""
import mlflow
import mlflow.sklearn
import pickle
import os
from fastapi import FastAPI
from pydantic import BaseModel, Field

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

print("Loading model from MLflow Model Registry...")

# Load the champion model from the registry
# This automatically gets whichever version has the @champion alias
try:
    model = mlflow.sklearn.load_model("models:/fraud-detection-model@champion")
    print("Successfully loaded champion model from MLflow!")
except Exception as e:
    print(f"Error loading from MLflow: {e}")
    print("Make sure you've assigned the @champion alias to a model in the MLflow UI")
    raise

# Load the encoder (saved as an artifact)
# In a real system, you might also version this in MLflow
with open("encoder.pkl", "rb") as f:
    encoder = pickle.load(f)
print("Encoder loaded successfully!")

app = FastAPI(
    title="Fraud Detection API (MLflow)",
    description="""
    Fraud detection API that loads models from MLflow Model Registry.
    
    This version always serves the model with the @champion alias.
    To update the model:
    1. Train a new model with train_mlflow.py
    2. Compare metrics in MLflow UI
    3. Promote the best model to Production
    4. Restart this API
    
    To roll back: Move the @champion alias to a previous version in MLflow UI.
    """,
    version="2.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount in dollars", example=150.00)
    hour: int = Field(..., description="Hour of the day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Monday, 6=Sunday)", example=3)
    merchant_category: str = Field(..., description="Type of merchant", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_source: str = "MLflow Production"

@app.post("/predict", response_model=PredictionResponse)
def predict(tx: Transaction):
    """Predict whether a transaction is fraudulent using the champion model."""
    data = tx.dict()
    
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        data["merchant_encoded"] = 0
    
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        model_source="MLflow Production"
    )

@app.get("/health")
def health():
    return {"status": "healthy", "model_source": "MLflow Registry"}

@app.get("/model-info")
def model_info():
    """Get information about the currently loaded model."""
    return {
        "registry": "MLflow",
        "model_name": "fraud-detection-model",
        "alias": "champion",
        "tracking_uri": "http://localhost:5000"
    }

Stop your old API (Ctrl+C) and start this new one:

uvicorn src.serve_mlflow:app --reload --host 0.0.0.0 --port 8000

Now deploying a new model is a controlled, auditable process:

Train new model → Automatically registered as new version
Compare metrics → Use MLflow UI to compare with current Production
Set as champion → Assign @champion alias in MLflow UI
Restart API → Loads new Production model
Roll back if needed → Move @champion alias to previous version

Checkpoint:

MLflow UI (http://localhost:5000) should show the "fraud-detection" experiment with 5 runs
The "Models" tab should show "fraud-detection-model" with 5 versions
One version should have @champion alias
The API should load and serve @champion model

4. Ensure Feature Consistency with Feast

⚠️ First time hearing about feature stores? Don't worry.
You don't need to master every Feast detail on the first read.
Focus on why feature consistency matters — you can revisit the implementation later.
Key takeaway: Training and serving must compute features the same way, or your model silently fails.

What breaks without this: Your model sees different feature values in production than it saw during training. Accuracy drops silently. This is called "training-serving skew" and it's one of the most common causes of ML system failures.

One subtle but critical issue in ML systems is training-serving skew – when data transformations at training time differ from inference time. Even small discrepancies can severely degrade performance.

Why This Matters: Imagine you're computing "average transaction amount per merchant category" as a feature. During training, you compute it using pandas in a notebook. During serving, you compute it using SQL in a different system. Small differences in how these computations handle edge cases (nulls, rounding, time windows) cause the model to see different features in production than it was trained on.

The result? Silent failures where accuracy drops but nothing errors out. Your model is making predictions based on features it's never seen before, and you have no idea.

In our naive implementation, we did handle one simple case: we saved the LabelEncoder to ensure merchant_category is encoded the same way in training and serving. But imagine if we had more complex feature engineering:

Rolling averages over time windows
User-level aggregations
Cross-feature interactions
Real-time features from streaming data

Maintaining consistency manually becomes impossible.

4.1 What is Feast and Why Use It?

In production ML platforms, teams use a feature store to guarantee feature consistency between training and serving. Feast is one popular open-source option.

In this tutorial, we use Feast not because you must, but because it makes the training-serving contract explicit and teachable. The principles apply whether you use Feast, Tecton, Featureform, or a custom solution.

Feast provides:

Capability	Description
Single source of truth	Define features once, use everywhere
Offline/online consistency	Same features for training and serving
Point-in-time correctness	Prevents data leakage in training
Low-latency serving	Millisecond feature retrieval
Feature versioning	Track changes to feature definitions

How Feast works:

Define features in Python code (feature definitions)
Materialize features from your data sources to the online store
Retrieve features using the same API for both training (offline) and serving (online)

This ensures that training and serving use exactly the same feature computation logic.

4.2 Install and Initialize Feast

We already installed Feast via requirements.txt. Now let's initialize a feature repository.

# Navigate to the feature_repo directory
cd feature_repo

# Initialize Feast (this creates template files)
feast init . --minimal

# Go back to project root
cd ..

This creates the basic Feast structure:

feature_repo/
├── feature_store.yaml    # Feast configuration
└── __init__.py

4.3 Define Feature Definitions

First, let's create the Feast configuration file:

# feature_repo/feature_store.yaml
project: fraud_detection
registry: ../data/registry.db
provider: local
online_store:
  type: sqlite
  path: ../data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 3

This configuration:

Names our project "fraud_detection"
Uses SQLite for the online store (for production, you'd use Redis or DynamoDB)
Uses local files for the offline store (for production, you'd use BigQuery or Snowflake)

Now create the feature definitions:

# feature_repo/features.py
"""
Feast feature definitions for fraud detection.

This file defines:
- Entities: The keys we use to look up features (merchant_category)
- Data Sources: Where the raw feature data comes from (Parquet file)
- Feature Views: The features themselves and their schemas

The key insight: These definitions are the SINGLE SOURCE OF TRUTH.
Both training and serving use these exact definitions.
"""
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# =============================================================================
# ENTITIES
# =============================================================================
# An entity is the "key" we use to look up features.
# For merchant-level features, the entity is merchant_category.

merchant = Entity(
    name="merchant_category",
    description="Merchant category for the transaction (for example, 'online', 'grocery')",
    value_type=ValueType.STRING,
)

# =============================================================================
# DATA SOURCES
# =============================================================================
# Data sources tell Feast where to find the raw feature data.
# For local development, we use a Parquet file.
# For production, this could be BigQuery, Snowflake, S3, etc.

merchant_stats_source = FileSource(
    name="merchant_stats_source",
    path="../data/merchant_features.parquet",  # We'll create this file
    timestamp_field="event_timestamp",       # Required for point-in-time joins
)

# =============================================================================
# FEATURE VIEWS
# =============================================================================
# A Feature View defines a group of related features.
# It specifies:
# - Which entity the features are for
# - The schema (names and types of features)
# - Where the data comes from
# - How long features are valid (TTL)

merchant_stats_fv = FeatureView(
    name="merchant_stats",
    description="Aggregated statistics per merchant category",
    entities=[merchant],
    ttl=timedelta(days=7),  # Features are valid for 7 days
    schema=[
        Field(name="avg_amount", dtype=Float32, description="Average transaction amount"),
        Field(name="transaction_count", dtype=Int64, description="Number of transactions"),
        Field(name="fraud_rate", dtype=Float32, description="Historical fraud rate"),
    ],
    source=merchant_stats_source,
    online=True,  # Enable online serving (low-latency retrieval)
)

4.4 Materialize Features to Online Store

Now we need to:

Compute the features from our training data
Save them in a format Feast can read
Apply the Feast definitions
Materialize features to the online store

Create src/prepare_feast_features.py:

# src/prepare_feast_features.py
"""
Prepare feature data for Feast.

This script:
1. Computes aggregated merchant features from training data
2. Saves them in Parquet format (Feast's offline store format)
3. Applies Feast feature definitions
4. Materializes features to the online store for low-latency serving

Run this whenever your training data changes or you want to refresh features.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import subprocess
import os

def compute_merchant_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute aggregated features by merchant category.
    
    THIS IS THE SINGLE SOURCE OF TRUTH FOR FEATURE COMPUTATION.
    
    Both training and serving will use features computed by this exact logic.
    Any change here automatically applies everywhere.
    
    Args:
        df: Transaction DataFrame with columns: amount, merchant_category, is_fraud
        
    Returns:
        DataFrame with computed features per merchant category
    """
    print("Computing merchant-level features...")
    
    # Group by merchant category and compute aggregates
    stats = df.groupby('merchant_category').agg({
        'amount': ['mean', 'count'],
        'is_fraud': 'mean'
    }).reset_index()
    
    # Flatten column names
    stats.columns = ['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']
    
    # Add timestamp for Feast (required for point-in-time correct joins)
    stats['event_timestamp'] = datetime.now()
    
    # Convert types to match Feast schema
    stats['avg_amount'] = stats['avg_amount'].astype('float32')
    stats['transaction_count'] = stats['transaction_count'].astype('int64')
    stats['fraud_rate'] = stats['fraud_rate'].astype('float32')
    
    return stats

def main():
    print("="*60)
    print("FEAST FEATURE PREPARATION")
    print("="*60)
    
    # Load training data
    print("\n1. Loading training data...")
    train_df = pd.read_csv('data/train.csv')
    print(f"   Loaded {len(train_df):,} transactions")
    
    # Compute merchant features
    print("\n2. Computing merchant features...")
    merchant_features = compute_merchant_features(train_df)
    
    print("\n   Computed features:")
    print(merchant_features.to_string(index=False))
    
    # Save as Parquet (required format for Feast file source)
    print("\n3. Saving features to Parquet...")
    os.makedirs('data', exist_ok=True)
    output_path = 'data/merchant_features.parquet'
    merchant_features.to_parquet(output_path, index=False)
    print(f"   Saved to {output_path}")
    
    # Apply Feast feature definitions
    print("\n4. Applying Feast feature definitions...")
    try:
        result = subprocess.run(
            ['feast', 'apply'],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Feature definitions applied successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error applying Feast: {e.stderr}")
        raise
    
    # Materialize features to online store
    print("\n5. Materializing features to online store...")
    try:
        result = subprocess.run(
            ['feast', 'materialize-incremental', datetime.now().isoformat()],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Features materialized successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error materializing: {e.stderr}")
        raise
    
    print("\n" + "="*60)
    print("FEAST FEATURE PREPARATION COMPLETE!")
    print("="*60)
    print("\nYou can now:")
    print("  - Retrieve features for training: get_training_features()")
    print("  - Retrieve features for serving: get_online_features()")
    print("  - View feature stats: feast feature-views list")

if __name__ == "__main__":
    main()

Run the feature preparation:

python src/prepare_feast_features.py

You should see:

============================================================
FEAST FEATURE PREPARATION
============================================================

1. Loading training data... 8,000 transactions
2. Computing merchant features...
   grocery: avg=$31.24, fraud_rate=0.85%
   online: avg=$98.45, fraud_rate=4.87%
   restaurant: avg=$28.12, fraud_rate=0.50%
   retail: avg=$45.67, fraud_rate=1.02%
   travel: avg=$156.23, fraud_rate=4.18%
3. Saving to data/merchant_features.parquet ✓
4. Applying Feast definitions... ✓
5. Materializing to online store... ✓

FEAST FEATURE PREPARATION COMPLETE!

4.5 Retrieve Features for Training and Serving

Now let's create utilities to retrieve features consistently for both training and serving:

# src/feast_features.py
"""
Feast feature retrieval for training and serving.

This module provides functions to retrieve features from Feast:
- get_training_features(): For offline training (historical features)
- get_online_features(): For real-time serving (low-latency)

IMPORTANT: Both functions use the SAME feature definitions,
ensuring consistency between training and serving.
"""
import pandas as pd
from feast import FeatureStore
from datetime import datetime

# Initialize Feast store (points to our feature_repo)
store = FeatureStore(repo_path="feature_repo")

def get_training_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Get features for training using Feast's offline store.
    
    Uses point-in-time correct joins to prevent data leakage.
    This means features are looked up as of the time each transaction occurred,
    not as of "now" - preventing you from accidentally using future data.
    
    Args:
        df: DataFrame with at least 'merchant_category' column
        
    Returns:
        DataFrame with original columns plus Feast features
    """
    print("Retrieving training features from Feast offline store...")
    
    # Prepare entity dataframe with timestamps
    # Each row needs: entity key(s) + event_timestamp
    entity_df = df[['merchant_category']].copy()
    entity_df['event_timestamp'] = datetime.now()  # See note below
    entity_df = entity_df.drop_duplicates()
    
    # ⚠️ Simplification: For clarity, we use the current timestamp here.
    # In real systems, this would be the actual event time of each transaction.
    
    # Retrieve historical features
    # Feast handles the point-in-time join automatically
    training_data = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
    ).to_df()
    
    # Merge features back with original dataframe
    result = df.merge(
        training_data[['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']],
        on='merchant_category',
        how='left'
    )
    
    print(f"Retrieved features for {len(entity_df)} unique merchants")
    return result

def get_online_features(merchant_category: str) -> dict:
    """
    Get features for real-time serving using Feast's online store.
    
    This is optimized for low-latency retrieval (milliseconds).
    Use this in your prediction API for real-time inference.
    
    Args:
        merchant_category: The merchant category to look up
        
    Returns:
        Dictionary with feature names and values
    """
    # Retrieve from online store (low-latency)
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": merchant_category}],
    ).to_dict()
    
    # Format the response
    return {
        'merchant_avg_amount': feature_vector['avg_amount'][0],
        'merchant_tx_count': feature_vector['transaction_count'][0],
        'merchant_fraud_rate': feature_vector['fraud_rate'][0],
    }

def get_online_features_batch(merchant_categories: list) -> pd.DataFrame:
    """
    Get features for multiple merchants at once (batch serving).
    
    More efficient than calling get_online_features() in a loop.
    
    Args:
        merchant_categories: List of merchant categories to look up
        
    Returns:
        DataFrame with features for each merchant
    """
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": mc} for mc in merchant_categories],
    ).to_df()
    
    return feature_vector

if __name__ == "__main__":
    # Test the feature retrieval functions
    print("="*60)
    print("TESTING FEAST FEATURE RETRIEVAL")
    print("="*60)
    
    # Test offline retrieval (for training)
    print("\n1. Testing OFFLINE feature retrieval (for training)...")
    train_df = pd.read_csv('data/train.csv').head(10)
    enriched = get_training_features(train_df)
    print("\n   Sample enriched training data:")
    print(enriched[['amount', 'merchant_category', 'avg_amount', 'fraud_rate']].head())
    
    # Test online retrieval (for serving)
    print("\n2. Testing ONLINE feature retrieval (for serving)...")
    for category in ['online', 'grocery', 'travel', 'restaurant', 'retail']:
        features = get_online_features(category)
        print(f"   {category}: avg_amount=${features['merchant_avg_amount']:.2f}, "
              f"fraud_rate={features['merchant_fraud_rate']:.2%}")
    
    # Test batch retrieval
    print("\n3. Testing BATCH online retrieval...")
    batch_features = get_online_features_batch(['online', 'grocery', 'travel'])
    print(batch_features)
    
    print("\n" + "="*60)
    print("FEAST FEATURE RETRIEVAL TEST COMPLETE!")
    print("="*60)

Test the feature retrieval:

python src/feast_features.py

You should see:

============================================================
TESTING FEAST FEATURE RETRIEVAL
============================================================

1. Testing OFFLINE feature retrieval (for training)...
Retrieving training features from Feast offline store...
Retrieved features for 5 unique merchants

   Sample enriched training data:
   amount merchant_category  avg_amount  fraud_rate
    45.23           grocery       31.24      0.0085
   123.45            online       98.45      0.0487
    ...

2. Testing ONLINE feature retrieval (for serving)...
   online: avg_amount=$98.45, fraud_rate=4.87%
   grocery: avg_amount=$31.24, fraud_rate=0.85%
   travel: avg_amount=$156.23, fraud_rate=4.18%
   restaurant: avg_amount=$28.12, fraud_rate=0.50%
   retail: avg_amount=$45.67, fraud_rate=1.02%

3. Testing BATCH online retrieval...
  merchant_category  avg_amount  transaction_count  fraud_rate
               online       98.45               1234      0.0487
              grocery       31.24               2345      0.0085
               travel      156.23                478      0.0418

Why Feast Over Custom Code?

Aspect	Custom Code	Feast
Consistency	Manual effort to keep in sync	Automatic - same definitions everywhere
Point-in-time correctness	Must implement yourself	Built-in
Online serving	Must build your own cache	Built-in online store
Feature versioning	Not supported	Built-in
Scalability	Limited	Production-ready (BigQuery, Redis, etc.)
Team collaboration	Difficult	Feature registry with documentation
Monitoring	Manual	Built-in feature statistics

💡 Mental Model: Treat feature definitions like database schemas.
You wouldn't compute a column one way in your application and a different way in your reports. Features deserve the same discipline — define once, use everywhere.

Checkpoint: After running prepare_feast_features.py, you should have:

data/merchant_features.parquet (computed features)
data/registry.db (Feast registry)
data/online_store.db (SQLite online store)

Running python src/feast_features.py should successfully retrieve features for all merchant categories.

5. Add Data Validation with Great Expectations

What breaks without this: Your API accepts garbage input (negative amounts, invalid hours) and returns meaningless predictions. Worse, you have no idea it happened.

Recall that our API currently trusts input blindly. We saw how garbage data produces a prediction with no warning. Great Expectations is an open-source tool for data quality testing – defining rules (expectations) and testing data against them.

Why This Matters: Data validation acts as a gatekeeper. Bad data is rejected before it can harm predictions. As the saying goes, "Garbage in, garbage out" – feeding unreliable data yields unreliable results. With validation, we transform this to "Garbage in, error out" – much better for debugging and reliability.

5.1 Define Expectations

What are reasonable expectations for our transaction data? Based on domain knowledge:

Field	Expectation	Reason
`amount`	Positive (> 0)	Negative transactions don't make sense
`amount`	Below $50,000	Extremely large amounts are outliers/errors
`hour`	0-23 inclusive	Valid hours in a day
`day_of_week`	0-6 inclusive	Valid days (Mon=0, Sun=6)
`merchant_category`	One of known categories	Must match training data
All fields	Not null	Required for prediction

Create src/data_validation.py:

# src/data_validation.py
"""
Data validation for fraud detection.

This module provides functions to validate input data BEFORE making predictions.
Invalid data is rejected with clear error messages.

The key insight: It's better to reject bad input than to make garbage predictions.
"""
import pandas as pd
from typing import Dict, List, Any, Optional

# Define the valid merchant categories (must match training data!)
VALID_CATEGORIES = ["grocery", "restaurant", "retail", "online", "travel"]

def validate_transaction(data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Validate a single transaction for fraud prediction.
    
    Checks all business rules and data quality requirements.
    Returns a dictionary with 'valid' (bool) and 'errors' (list).
    
    Args:
        data: Dictionary with transaction fields
        
    Returns:
        {"valid": bool, "errors": list of error messages}
        
    Example:
        >>> validate_transaction({"amount": -100, "hour": 25, ...})
        {"valid": False, "errors": ["amount must be positive", "hour must be 0-23"]}
    """
    errors = []
    
    # ==========================================================================
    # Amount Validation
    # ==========================================================================
    amount = data.get("amount")
    if amount is None:
        errors.append("amount is required")
    elif not isinstance(amount, (int, float)):
        errors.append(f"amount must be a number (got {type(amount).__name__})")
    elif amount <= 0:
        errors.append("amount must be positive")
    elif amount > 50000:
        errors.append(f"amount exceeds maximum allowed value of \(50,000 (got \){amount:,.2f})")
    
    # ==========================================================================
    # Hour Validation
    # ==========================================================================
    hour = data.get("hour")
    if hour is None:
        errors.append("hour is required")
    elif not isinstance(hour, int):
        errors.append(f"hour must be an integer (got {type(hour).__name__})")
    elif not (0 <= hour <= 23):
        errors.append(f"hour must be between 0 and 23 (got {hour})")
    
    # ==========================================================================
    # Day of Week Validation
    # ==========================================================================
    day = data.get("day_of_week")
    if day is None:
        errors.append("day_of_week is required")
    elif not isinstance(day, int):
        errors.append(f"day_of_week must be an integer (got {type(day).__name__})")
    elif not (0 <= day <= 6):
        errors.append(f"day_of_week must be between 0 (Monday) and 6 (Sunday) (got {day})")
    
    # ==========================================================================
    # Merchant Category Validation
    # ==========================================================================
    category = data.get("merchant_category")
    if category is None:
        errors.append("merchant_category is required")
    elif not isinstance(category, str):
        errors.append(f"merchant_category must be a string (got {type(category).__name__})")
    elif category not in VALID_CATEGORIES:
        errors.append(
            f"merchant_category must be one of {VALID_CATEGORIES} (got '{category}')"
        )
    
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }

def validate_batch(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Validate a batch of transactions using Great Expectations.
    
    This is useful for validating training data or batch prediction requests.
    Uses Great Expectations for more sophisticated validation.
    
    Args:
        df: DataFrame with transaction data
        
    Returns:
        Dictionary with validation results
    """
    import great_expectations as gx
    
    # Convert to Great Expectations dataset
    ge_df = gx.from_pandas(df)
    
    results = []
    
    # Amount expectations
    r = ge_df.expect_column_values_to_be_between(
        'amount', min_value=0.01, max_value=50000, mostly=0.99
    )
    results.append(('amount_range', r.success, r.result))
    
    # Hour expectations
    r = ge_df.expect_column_values_to_be_between(
        'hour', min_value=0, max_value=23
    )
    results.append(('hour_range', r.success, r.result))
    
    # Day of week expectations
    r = ge_df.expect_column_values_to_be_between(
        'day_of_week', min_value=0, max_value=6
    )
    results.append(('day_range', r.success, r.result))
    
    # Merchant category expectations
    r = ge_df.expect_column_values_to_be_in_set(
        'merchant_category', VALID_CATEGORIES
    )
    results.append(('category_valid', r.success, r.result))
    
    # No nulls in critical fields
    for col in ['amount', 'hour', 'day_of_week', 'merchant_category']:
        r = ge_df.expect_column_values_to_not_be_null(col)
        results.append((f'{col}_not_null', r.success, r.result))
    
    # Summarize results
    passed = sum(1 for _, success, _ in results if success)
    total = len(results)
    
    return {
        'success': passed == total,
        'passed': passed,
        'total': total,
        'pass_rate': passed / total,
        'details': {name: {'passed': success, 'result': result} 
                   for name, success, result in results}
    }

if __name__ == "__main__":
    print("="*60)
    print("TESTING DATA VALIDATION")
    print("="*60)
    
    # Test single transaction validation
    print("\n1. Single Transaction Validation")
    print("-"*40)
    
    test_cases = [
        {
            "name": "Valid transaction",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Negative amount",
            "data": {"amount": -100.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Invalid hour",
            "data": {"amount": 50.0, "hour": 25, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Unknown merchant",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "unknown"}
        },
        {
            "name": "Everything wrong",
            "data": {"amount": -999, "hour": 99, "day_of_week": 15, "merchant_category": "fake"}
        },
    ]
    
    for tc in test_cases:
        result = validate_transaction(tc["data"])
        status = "PASS" if result["valid"] else "FAIL"
        print(f"\n{tc['name']}: {status}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  - {error}")
    
    # Test batch validation
    print("\n\n2. Batch Validation with Great Expectations")
    print("-"*40)
    
    train_df = pd.read_csv('data/train.csv')
    results = validate_batch(train_df)
    
    print(f"\nTraining data validation: {results['passed']}/{results['total']} checks passed")
    print(f"Pass rate: {results['pass_rate']:.1%}")
    
    if not results['success']:
        print("\nFailed checks:")
        for name, detail in results['details'].items():
            if not detail['passed']:
                print(f"  - {name}")

When to Use Which Validation Approach

Approach	Use Case	Latency	When to Use
Custom Python (`validate_transaction`)	Real-time API requests	<1ms	Every prediction request
Great Expectations	Batch data quality	Seconds	Training data, periodic audits, CI/CD

We use both in this tutorial because they serve different purposes:

Custom validation is your runtime gatekeeper — fast enough for every request
Great Expectations is your batch auditor — thorough checks on datasets

5.2 Integrate Validation into FastAPI

Now let's update our API to reject invalid input with clear error messages:

# src/serve_validated.py
"""
Serve fraud detection model with input validation.

This version adds data validation BEFORE making predictions:
- Invalid inputs are rejected with HTTP 400 and clear error messages
- Valid inputs are processed and predictions returned

This is much safer than the naive version which accepted garbage.
"""
import pickle
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.data_validation import validate_transaction

# Load model
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)

app = FastAPI(
    title="Fraud Detection API (Validated)",
    description="""
    Fraud detection API with input validation.
    
    All inputs are validated before prediction:
    - amount: Must be positive and below $50,000
    - hour: Must be 0-23
    - day_of_week: Must be 0-6
    - merchant_category: Must be one of: grocery, restaurant, retail, online, travel
    
    Invalid inputs return HTTP 400 with detailed error messages.
    """,
    version="3.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount (must be positive)", example=150.00)
    hour: int = Field(..., description="Hour of day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Mon, 6=Sun)", example=3)
    merchant_category: str = Field(..., description="Merchant type", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    validation_passed: bool = True

class ValidationErrorResponse(BaseModel):
    detail: dict

@app.post("/predict", response_model=PredictionResponse, responses={400: {"model": ValidationErrorResponse}})
def predict(tx: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Input is validated before prediction. Invalid inputs return HTTP 400.
    """
    data = tx.dict()
    
    # VALIDATE INPUT BEFORE MAKING PREDICTION
    validation = validate_transaction(data)
    
    if not validation["valid"]:
        raise HTTPException(
            status_code=400,
            detail={
                "message": "Validation failed",
                "errors": validation["errors"],
                "input": data
            }
        )
    
    # Input is valid - make prediction
    data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        validation_passed=True
    )

@app.get("/health")
def health():
    return {"status": "healthy", "validation": "enabled"}

Start the validated API:

uvicorn src.serve_validated:app --reload --host 0.0.0.0 --port 8000

Now test with bad data:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}'

Response (HTTP 400):

{
  "detail": {
    "message": "Validation failed",
    "errors": [
      "amount must be positive",
      "hour must be between 0 and 23 (got 25)",
      "day_of_week must be between 0 (Monday) and 6 (Sunday) (got 10)",
      "merchant_category must be one of ['grocery', 'restaurant', 'retail', 'online', 'travel'] (got 'fake')"
    ],
    "input": {"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}
  }
}

This is a huge improvement! Instead of silently accepting garbage and returning meaningless predictions, we now:

Reject invalid input immediately
Provide clear, actionable error messages
Return the original input for debugging
Use proper HTTP status codes (400 for client error)

Checkpoint: Your validated API should:

Accept valid transactions and return predictions
Reject invalid transactions with HTTP 400 and detailed error messages
Show validation errors for each invalid field

6. Monitor Model Performance and Data Drift

What breaks without this: Your model's accuracy drops from 98% to 70% over two months. Nobody notices until customers complain. By then, significant damage has occurred.

Even with a great model and clean input data, time can be an enemy. Model performance can decline as real-world data evolves – this is known as model drift or model decay.

Why This Matters: In traditional software, you monitor CPU, memory, error rates, and response times. In ML, you must also monitor:

Data quality (are inputs within expected ranges?)
Model performance (is accuracy holding up?)
Data drift (has input distribution changed?)
Prediction drift (has the distribution of predictions changed?)

Without monitoring, your model could be silently failing for weeks before anyone notices. By then, significant damage may have occurred – fraud slipping through, good customers blocked, revenue lost.

6.1 The Four Pillars of ML Observability

Pillar	What to Monitor	Why It Matters
Data Quality	Are inputs valid? Nulls? Outliers?	Bad data causes bad predictions
Model Performance	Accuracy, precision, recall, F1	Is the model still working?
Data Drift	Has input distribution changed from training?	Model may not generalize to new data
Prediction Drift	Has prediction distribution changed?	May indicate data or concept drift

6.2 Build a Drift Monitor with Evidently

Evidently is an open-source library specifically designed for ML monitoring. It can detect drift, generate reports, and integrate with monitoring systems.

Create src/monitoring.py:

# src/monitoring.py
"""
Model monitoring with Evidently.

This module provides tools to:
1. Detect data drift between training and production data
2. Generate detailed HTML reports
3. Track drift over time
4. Alert when drift exceeds thresholds

In production, you would run drift checks periodically (hourly, daily)
and alert when significant drift is detected.
"""
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric
)
from datetime import datetime
from typing import List, Dict, Any, Optional

class DriftMonitor:
    """
    Monitor for detecting data drift between reference (training) and current data.
    
    Implementation Note: We use two approaches here:
    1. Scipy's KS-test — A lightweight statistical method that works anywhere (our fallback)
    2. Evidently — A full-featured library with beautiful reports (our primary tool)
    
    The KS-test is included as defensive coding — if Evidently fails to generate 
    a report, we still get drift detection.
    
    Usage:
        monitor = DriftMonitor(training_data)
        result = monitor.check_drift(production_data)
        if result['drift_detected']:
            alert("Drift detected!")
    """
    
    def __init__(self, reference_data: pd.DataFrame, feature_columns: Optional[List[str]] = None):
        """
        Initialize the drift monitor with reference (training) data.
        
        Args:
            reference_data: The training data to compare against
            feature_columns: Columns to monitor (default: all numeric columns)
        """
        self.reference = reference_data
        self.feature_columns = feature_columns or reference_data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        self.history: List[Dict[str, Any]] = []
        
        print(f"Drift monitor initialized with {len(self.reference):,} reference samples")
        print(f"Monitoring columns: {self.feature_columns}")
    
    def check_drift(self, current_data: pd.DataFrame, threshold: float = 0.1) -> Dict[str, Any]:
        """
        Check for drift between reference and current data.
        
        Args:
            current_data: Current/production data to check
            threshold: Drift share threshold for alerting (default 10%)
            
        Returns:
            Dictionary with drift results
        """
        from scipy import stats
        
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        # Simple statistical drift detection using KS test
        drifted_columns = []
        for col in self.feature_columns:
            statistic, p_value = stats.ks_2samp(
                ref_subset[col].dropna(),
                cur_subset[col].dropna()
            )
            if p_value < 0.05:  # 5% significance level
                drifted_columns.append(col)
        
        n_features = len(self.feature_columns)
        n_drifted = len(drifted_columns)
        drift_share = n_drifted / n_features if n_features > 0 else 0
        
        result = {
            'timestamp': datetime.now().isoformat(),
            'drift_detected': n_drifted > 0,
            'drift_share': drift_share,
            'drifted_columns': drifted_columns,
            'n_features': n_features,
            'n_drifted': n_drifted,
            'current_samples': len(current_data),
            'threshold': threshold,
            'alert': drift_share > threshold
        }
        
        self.history.append(result)
        
        return result
    
    def generate_report(self, current_data: pd.DataFrame, output_path: str = "drift_report.html"):
        """
        Generate a detailed HTML drift report using Evidently.
        
        Opens in browser for visual inspection of drift patterns.
        """
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        try:
            report = Report(metrics=[DataDriftPreset()])
            report.run(reference_data=ref_subset, current_data=cur_subset)
            
            # Save HTML report
            with open(output_path, 'w') as f:
                f.write(report.show(mode='inline').data)
            
            print(f"Drift report saved to {output_path}")
            print(f"Open this file in a browser to view detailed visualizations.")
        except Exception as e:
            print(f"Could not generate Evidently report: {e}")
            print(f"Using simplified drift detection instead.")
    
    def get_alerts(self, threshold: float = 0.1) -> List[Dict[str, Any]]:
        """
        Get all alerts from history where drift exceeded threshold.
        """
        return [
            {
                'timestamp': r['timestamp'],
                'severity': 'HIGH' if r['drift_share'] > 0.3 else 'MEDIUM',
                'drift_share': r['drift_share'],
                'message': f"Drift detected: {r['drift_share']:.1%} of features drifted",
                'drifted_columns': r['drifted_columns']
            }
            for r in self.history
            if r['drift_share'] > threshold
        ]
    
    def summary(self) -> Dict[str, Any]:
        """Get summary statistics from monitoring history."""
        if not self.history:
            return {"message": "No drift checks performed yet"}
        
        drift_shares = [r['drift_share'] for r in self.history]
        alerts = [r for r in self.history if r['alert']]
        
        return {
            'total_checks': len(self.history),
            'total_alerts': len(alerts),
            'avg_drift_share': np.mean(drift_shares),
            'max_drift_share': np.max(drift_shares),
            'first_check': self.history[0]['timestamp'],
            'last_check': self.history[-1]['timestamp']
        }


def simulate_drift_scenarios():
    """
    Demonstrate drift detection with different scenarios.
    
    This simulates what happens when production data differs from training data.
    """
    from src.generate_data import generate_transactions
    
    print("="*70)
    print("DRIFT DETECTION SIMULATION")
    print("="*70)
    
    # Load reference (training) data
    print("\n1. Loading reference data (training set)...")
    reference = pd.read_csv('data/train.csv')
    feature_cols = ['amount', 'hour', 'day_of_week']
    
    # Initialize drift monitor
    monitor = DriftMonitor(reference, feature_cols)
    
    # Scenario 1: Similar data (should show minimal drift)
    print("\n" + "-"*70)
    print("SCENARIO 1: Test data (similar distribution)")
    print("-"*70)
    test_data = pd.read_csv('data/test.csv')
    result = monitor.check_drift(test_data)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 2: Fraud spike (10% fraud instead of 2%)
    print("\n" + "-"*70)
    print("SCENARIO 2: Fraud spike (10% fraud rate instead of 2%)")
    print("-"*70)
    fraud_spike = generate_transactions(n_samples=2000, fraud_ratio=0.10, seed=101)
    result = monitor.check_drift(fraud_spike)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 3: Amount inflation (everything costs more)
    print("\n" + "-"*70)
    print("SCENARIO 3: Amount inflation (2x multiplier)")
    print("-"*70)
    inflated = test_data.copy()
    inflated['amount'] = inflated['amount'] * 2
    result = monitor.check_drift(inflated)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 4: Time shift (more late-night transactions)
    print("\n" + "-"*70)
    print("SCENARIO 4: Time shift (mostly late-night transactions)")
    print("-"*70)
    night_shift = test_data.copy()
    night_shift['hour'] = np.random.choice([0, 1, 2, 3, 22, 23], size=len(night_shift))
    result = monitor.check_drift(night_shift)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Generate detailed report for the most drifted scenario
    print("\n" + "-"*70)
    print("GENERATING DETAILED REPORT")
    print("-"*70)
    monitor.generate_report(night_shift, "drift_report.html")
    
    # Print summary
    print("\n" + "-"*70)
    print("MONITORING SUMMARY")
    print("-"*70)
    summary = monitor.summary()
    print(f"  Total checks: {summary['total_checks']}")
    print(f"  Total alerts: {summary['total_alerts']}")
    print(f"  Average drift share: {summary['avg_drift_share']:.1%}")
    print(f"  Maximum drift share: {summary['max_drift_share']:.1%}")
    
    # Print alerts
    alerts = monitor.get_alerts()
    if alerts:
        print(f"\n  Alerts ({len(alerts)}):")
        for alert in alerts:
            print(f"    [{alert['severity']}] {alert['message']}")
    
    print("\n" + "="*70)
    print("DRIFT DETECTION SIMULATION COMPLETE")
    print("="*70)
    print("\nOpen drift_report.html in your browser to see detailed visualizations!")


if __name__ == "__main__":
    simulate_drift_scenarios()

Run the drift simulation:

python src/monitoring.py

You'll see output showing how drift detection works in different scenarios. Then open drift_report.html in your browser to see beautiful visualizations of the drift patterns.

6.3 Production Monitoring Strategy

In a production environment, you would:

Log all predictions to a database or data warehouse
Run drift checks periodically (hourly for high-traffic systems, daily for lower traffic)
Set up alerts when drift exceeds thresholds (integrate with PagerDuty, Slack, etc.)
Trigger retraining if drift is severe or sustained
Create dashboards to track drift over time (Grafana, Datadog, etc.)

Checkpoint: Running python src/monitoring.py should:

Show minimal drift for similar data (test set)
Show significant drift for modified data (fraud spike, inflation, time shift)
Generate an HTML report that you can view in your browser

7. Automate Testing and Deployment with CI/CD

What breaks without this: A typo in your code breaks the API. You deploy on Friday at 5 PM. Nobody notices until Monday. Fraud losses spike over the weekend.

CI/CD (Continuous Integration/Continuous Deployment) ensures reliable, repeatable releases. As JFrog notes: "A strong CI/CD pipeline enables ML teams to build robust, bug-free models more quickly and efficiently."

Why This Matters: In ML, changes aren't just code – they're also data and models. CI/CD ensures that when you change training logic, data preprocessing, or hyperparameters, tests verify the change doesn't break anything before it reaches production. It's the difference between deploying with confidence and deploying with crossed fingers.

7.1 Write Tests for Data and Model

Create tests/test_data_and_model.py:

# tests/test_data_and_model.py
"""
Tests for data quality and model performance.

These tests run in CI/CD to ensure:
1. Data meets quality requirements
2. Model meets performance thresholds
3. No regressions are introduced

Run with: pytest tests/test_data_and_model.py -v
"""
import pandas as pd
import pickle
import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

class TestDataQuality:
    """Tests for training data quality."""
    
    @pytest.fixture
    def train_data(self):
        return pd.read_csv("data/train.csv")
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_train_data_has_expected_columns(self, train_data):
        """Training data must have all required columns."""
        required_columns = {"amount", "hour", "day_of_week", "merchant_category", "is_fraud"}
        actual_columns = set(train_data.columns)
        missing = required_columns - actual_columns
        assert not missing, f"Missing columns: {missing}"
    
    def test_train_data_not_empty(self, train_data):
        """Training data must have rows."""
        assert len(train_data) > 0, "Training data is empty"
        assert len(train_data) >= 1000, f"Training data too small: {len(train_data)} rows"
    
    def test_no_negative_amounts(self, train_data):
        """Transaction amounts must be non-negative."""
        negative_count = (train_data["amount"] < 0).sum()
        assert negative_count == 0, f"Found {negative_count} negative amounts"
    
    def test_amounts_reasonable(self, train_data):
        """Transaction amounts should be within reasonable bounds."""
        max_amount = train_data["amount"].max()
        assert max_amount <= 100000, f"Max amount {max_amount} exceeds reasonable limit"
    
    def test_hours_valid(self, train_data):
        """Hours must be 0-23."""
        invalid = train_data[(train_data["hour"] < 0) | (train_data["hour"] > 23)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid hours"
    
    def test_days_valid(self, train_data):
        """Days of week must be 0-6."""
        invalid = train_data[(train_data["day_of_week"] < 0) | (train_data["day_of_week"] > 6)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid days"
    
    def test_merchant_categories_valid(self, train_data):
        """Merchant categories must be from known set."""
        valid_categories = {"grocery", "restaurant", "retail", "online", "travel"}
        actual_categories = set(train_data["merchant_category"].unique())
        invalid = actual_categories - valid_categories
        assert not invalid, f"Invalid merchant categories: {invalid}"
    
    def test_fraud_ratio_reasonable(self, train_data):
        """Fraud ratio should be realistic (between 0.1% and 50%)."""
        fraud_ratio = train_data["is_fraud"].mean()
        assert 0.001 <= fraud_ratio <= 0.5, f"Fraud ratio {fraud_ratio:.2%} is unrealistic"
    
    def test_no_nulls_in_critical_columns(self, train_data):
        """Critical columns must not have null values."""
        critical = ["amount", "hour", "day_of_week", "merchant_category", "is_fraud"]
        for col in critical:
            null_count = train_data[col].isnull().sum()
            assert null_count == 0, f"Column {col} has {null_count} null values"


class TestModelPerformance:
    """Tests for model performance thresholds."""
    
    @pytest.fixture
    def model_and_encoder(self):
        with open("models/model.pkl", "rb") as f:
            return pickle.load(f)
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_model_loads_successfully(self, model_and_encoder):
        """Model file must load without errors."""
        model, encoder = model_and_encoder
        assert model is not None, "Model is None"
        assert encoder is not None, "Encoder is None"
    
    def test_model_can_predict(self, model_and_encoder, test_data):
        """Model must be able to make predictions."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        predictions = model.predict(X)
        assert len(predictions) == len(X), "Prediction count mismatch"
    
    def test_accuracy_threshold(self, model_and_encoder, test_data):
        """Model accuracy must be at least 90%."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        accuracy = model.score(X, y)
        assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"
    
    def test_f1_threshold(self, model_and_encoder, test_data):
        """Model F1-score must be at least 0.3 (sanity check for imbalanced data)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred)
        assert f1 >= 0.3, f"F1-score {f1:.2f} below 0.3 threshold"
    
    def test_precision_not_zero(self, model_and_encoder, test_data):
        """Model precision must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        precision = precision_score(y, y_pred, zero_division=0)
        assert precision > 0, "Model has zero precision (predicts no fraud)"
    
    def test_recall_not_zero(self, model_and_encoder, test_data):
        """Model recall must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        recall = recall_score(y, y_pred, zero_division=0)
        assert recall > 0, "Model has zero recall (misses all fraud)"

Create tests/test_api.py:

# tests/test_api.py
"""
Tests for the FastAPI prediction service.

These tests ensure the API:
1. Returns correct responses for valid inputs
2. Rejects invalid inputs with proper error messages
3. Health check works

Run with: pytest tests/test_api.py -v
Note: Requires the API to be running on localhost:8000
"""
import pytest
import httpx

BASE_URL = "http://localhost:8000"

class TestPredictionEndpoint:
    """Tests for the /predict endpoint."""
    
    def test_valid_prediction_returns_200(self):
        """Valid input should return HTTP 200 with prediction."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        assert "is_fraud" in data
        assert "fraud_probability" in data
        assert isinstance(data["is_fraud"], bool)
        assert 0 <= data["fraud_probability"] <= 1
    
    def test_high_risk_transaction(self):
        """High-risk transaction should have higher fraud probability."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 500.0,
            "hour": 3,  # Late night
            "day_of_week": 1,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        # High-risk transactions should have elevated probability
        # (not asserting exact value as model may vary)
        assert data["fraud_probability"] >= 0.0
    
    def test_negative_amount_rejected(self):
        """Negative amount should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": -100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
        assert "errors" in response.json()["detail"]
    
    def test_invalid_hour_rejected(self):
        """Invalid hour should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 25,  # Invalid
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_invalid_merchant_rejected(self):
        """Unknown merchant category should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "unknown_category"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_missing_field_rejected(self):
        """Missing required field should be rejected."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14
            # Missing day_of_week and merchant_category
        }, timeout=10)
        
        assert response.status_code == 422  # Pydantic validation error


class TestHealthEndpoint:
    """Tests for the /health endpoint."""
    
    def test_health_returns_200(self):
        """Health endpoint should return 200."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        assert response.status_code == 200
    
    def test_health_returns_healthy_status(self):
        """Health endpoint should indicate healthy status."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        data = response.json()
        assert data["status"] == "healthy"

Run tests locally:

# Run data and model tests (API not needed)
pytest tests/test_data_and_model.py -v

# Run API tests (requires API to be running)
pytest tests/test_api.py -v

7.2 GitHub Actions Workflow

⚠️ Note for Production Teams
In real ML teams, you typically don't retrain full models inside CI — it's slow and resource-intensive.
Here we do it to keep everything local, reproducible, and self-contained for learning.
Production pipelines usually separate training (scheduled jobs) from testing (CI/CD).

Create .github/workflows/ci.yml:

# .github/workflows/ci.yml
name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      
      - name: Generate training data
        run: python src/generate_data.py
      
      - name: Train model
        run: python src/train_naive.py
      
      - name: Run data quality tests
        run: pytest tests/test_data_and_model.py -v --tb=short
      
      - name: Build Docker image
        run: docker build -t fraud-detection-api .
      
      - name: Run container for API tests
        run: |
          docker run -d -p 8000:8000 --name test-api fraud-detection-api
          sleep 10  # Wait for API to start
          curl -f http://localhost:8000/health || exit 1
      
      - name: Run API tests
        run: pytest tests/test_api.py -v --tb=short
      
      - name: Cleanup
        if: always()
        run: docker stop test-api || true

7.3 Dockerize the Application

Create Dockerfile:

# Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ src/
COPY models/ models/
COPY data/ data/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the API
CMD ["uvicorn", "src.serve_validated:app", "--host", "0.0.0.0", "--port", "8000"]

Create .dockerignore:

# .dockerignore
venv/
__pycache__/
*.pyc
.git/
.github/
mlruns/
*.db
*.html
.pytest_cache/

Build and run locally:

# Build the Docker image
docker build -t fraud-detection-api .

# Run the container
docker run -p 8000:8000 fraud-detection-api

# Test it
curl http://localhost:8000/health

Checkpoint:

All tests pass: pytest tests/test_data_and_model.py -v
Docker image builds successfully
Container runs and responds to health checks

8. Incident Response Playbook

When things go wrong in production (and they will), you need a plan. This section provides playbooks for common ML incidents.

Scenario: False Positive Spike

Symptoms: Your fraud model suddenly flags 40% of legitimate transactions as fraud, blocking customers and overwhelming your manual review team.

Severity: HIGH - Direct customer impact

Phase 1: Mitigation (0-5 minutes)

Acknowledge the incident - Notify stakeholders that you're aware and responding
Roll back to previous model - In MLflow UI, move the @champion alias to the previous model version
Restart the API - docker restart fraud-api or redeploy
Verify - Check that false positive rate has returned to normal
Communicate - "Issue detected and mitigated. Investigating root cause."

Phase 2: Diagnosis (5-60 minutes)

Check drift report - Run python src/monitoring.py with recent production data
Check data validation logs - Did upstream data format change?
Check recent deployments - Was there a new model or code deployed recently?
Compare metrics - What's different between the rolled-back and problematic model?

Example root causes:

Upstream system sent amounts in cents instead of dollars
New merchant category appeared that wasn't in training data
Holiday shopping patterns differed significantly from training data

Phase 3: Remediation (1-24 hours)

Fix the root cause - Add validation for the edge case, or update training data
Retrain if needed - Include new patterns in training data
Add test case - Prevent this from happening again
Document - Add to runbook for future reference

Scenario: Gradual Performance Decay

Symptoms: Monitoring shows fraud recall dropping 2% per week over a month. No sudden failures, just slow degradation.

Severity: MEDIUM - Gradual impact, time to respond

Response:

Investigate drift report - Look for gradual distribution changes
```
python src/monitoring.py
```
Collect recent labeled data - Get confirmed fraud cases from the past month
Analyze patterns - What's different about recent fraud?
- New attack vectors?
- Different time patterns?
- New merchant categories?
Retrain on combined data - Include both old and new patterns
```
python src/train_mlflow.py
```
Deploy via canary - Route 10% of traffic to the new model first
- Monitor metrics for 1-2 days
- If metrics improve, increase to 50%, then 100%
- If metrics worsen, roll back
Set up recurring retraining - Schedule weekly or monthly retraining

Scenario: Upstream Data Schema Change

Symptoms: API starts returning 500 errors. Logs show KeyError: 'merchant_category'.

Severity: HIGH - Service is down

Response:

Check error logs - Identify the exact error
```
KeyError: 'merchant_category'
```
Check upstream data - Did the field name change?
- merchant_category -> category
- amount -> transaction_amount

Immediate fix - Add field name mapping

# Quick fix in API
if 'category' in data and 'merchant_category' not in data:
    data['merchant_category'] = data['category']

Long-term fix - Add validation that catches schema changes

required_fields = ['amount', 'hour', 'day_of_week', 'merchant_category']
missing = [f for f in required_fields if f not in data]
if missing:
    raise ValidationError(f"Missing fields: {missing}")

Add integration test - Test with upstream system in CI/CD

9. How to Put It All Together

Let's step back and appreciate what we've built. Our initial naive system has transformed into a local ML platform with production-grade components.

💡 Mental Model: Each tool in this stack is a "catch net" for a specific failure mode:

MLflow catches "which model is this?"

Feast catches "are features consistent?"

Great Expectations catches "is this data valid?"

Evidently catches "has the world changed?"

CI/CD catches "did we break something?"

Together, they form defense-in-depth for ML systems.

Component	Tool	Problem Solved
Experiment Tracking	MLflow	Every run logged, reproducible
Model Registry	MLflow	Versioned models, rollback capability
Feature Store	Feast	Consistent features, no training-serving skew
Data Validation	Great Expectations	Bad data rejected with clear errors
Monitoring	Evidently	Drift detected before it causes problems
Containerization	Docker	Environment consistency everywhere
CI/CD	GitHub Actions	Automated testing and safe deployments

The Complete Workflow

Here's how all the pieces work together in practice:

Data arrives - New transaction data comes in from upstream systems
Validation gate - Great Expectations rules check data quality. Bad data is rejected with clear error messages before it can cause harm.
Feature computation - Feast computes features using the same definitions for both training and serving. No more training-serving skew.
Training - When you retrain, MLflow logs all parameters, metrics, and artifacts. Every experiment is reproducible and comparable.
Model registry - Trained models are automatically versioned. You can compare metrics, promote the best to Production, and roll back if needed.
Serving - FastAPI loads the @champion model from MLflow. Each request is validated, features are retrieved from Feast, and predictions are returned.
Monitoring - Evidently checks for drift periodically. If input distributions change significantly, alerts are triggered.
Retraining loop - When drift is detected, you retrain on new data, compare metrics, and promote if better. The cycle continues.
CI/CD safety net - All code changes go through automated tests. Docker ensures environment consistency. Nothing reaches production without passing the pipeline.

10. What's Next: Scale to Production

This project runs locally, but the principles and tools extend directly to production deployments. Here's how each component scales:

Scaling Feast for Production

We used Feast with local SQLite stores. For production:

Component	Local	Production
Online Store	SQLite	Redis, DynamoDB, or PostgreSQL
Offline Store	Parquet files	BigQuery, Snowflake, or Redshift
Feature Server	Embedded	Dedicated Feast serving cluster

Benefits at scale:

Sub-10ms feature retrieval
Horizontal scaling for high throughput
Feature monitoring and statistics
Point-in-time joins at petabyte scale

Scaling MLflow for Production

Component	Local	Production
Backend Store	SQLite	PostgreSQL or MySQL
Artifact Store	Local filesystem	S3, GCS, or Azure Blob
Tracking Server	Single instance	Load-balanced cluster

Kubernetes Deployment

When you outgrow Docker Compose:

KServe or Seldon for serverless model serving with auto-scaling
Horizontal Pod Autoscaler to scale based on CPU/memory/custom metrics
Canary deployments to safely roll out new models (route 10% traffic first)
GPU scheduling for inference-heavy models

Advanced Monitoring

Expand observability with:

Prometheus + Grafana for real-time dashboards
OpenTelemetry for distributed tracing
PagerDuty/Slack integration for alerts
Labeled data collection for continuous model evaluation

A/B Testing and Multi-Armed Bandits

How to Use the Model Registry:

Serve multiple models concurrently (champion vs challengers)
Route traffic dynamically based on context
Collect metrics for each model variant
Automatically promote the best performer

Conclusion

Congratulations on building a production-ready ML system on your local machine!

What we assembled here is a microcosm of real-world ML platforms:

We started with just a model saved to a pickle file
We ended up with MLOps best practices: experiment tracking, model versioning, feature stores, data validation, monitoring, containerization, and CI/CD

The tools we used are production-grade:

MLflow powers ML platforms at companies like Microsoft, Facebook, and Databricks
Feast is used by companies like Gojek, Shopify, and Robinhood
FastAPI is one of the fastest Python web frameworks
Great Expectations is used at companies like GitHub and Shopify
Evidently is used for monitoring ML in production at scale

The principles apply at any scale:

Always track experiments
Always version models
Always validate data
Always monitor for drift
Always containerize for consistency
Always automate testing

Next Steps You Can Try

Deploy to the cloud - Push your Docker container to AWS ECS, Google Cloud Run, or Azure Container Instances
Add model explainability - Use SHAP or LIME to explain individual predictions
Implement A/B testing - Serve multiple models and compare performance
Add feature importance monitoring - Track how feature importance changes over time
Set up real-time alerting - Connect Evidently to Slack or PagerDuty
Implement continuous training - Automatically retrain when drift is detected
Add bias and fairness monitoring - Ensure your model treats all groups fairly

Remember that productionizing ML is an iterative process. There's always another layer of robustness to add, another edge case to handle, another metric to track. But with the foundation you've built here, you're well on your way to taking models from promising notebook experiments to deployed, monitored, and maintainable production applications.

Happy building, and may your models be accurate and your pipelines resilient!

Get the Complete Code

The entire project from this handbook is available as a public GitHub repository:

🔗 github.com/sandeepmb/freecodecamp-local-ml-platform

The repository includes:

All source code (src/ directory)
Test files (tests/ directory)
Feast feature definitions (feature_repo/)
Docker and CI/CD configuration
Ready-to-run scripts

Quick Start:

git clone https://github.com/sandeepmb/freecodecamp-local-ml-platform.git
cd freecodecamp-local-ml-platform
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python src/generate_data.py
python src/train_naive.py

References

MLflow Documentation - Experiment tracking and model registry
Feast Documentation - Feature store
Feast Quickstart - Getting started with Feast
FastAPI Documentation - Modern Python web framework
Great Expectations - Data validation
Evidently AI Documentation - ML monitoring
CI/CD for Machine Learning (JFrog) - CI/CD best practices
Training-Serving Skew Explained - Understanding skew
Docker Documentation - Containerization
GitHub Actions Documentation - CI/CD automation

How to Containerize Your MLOps Pipeline from Training to Serving

Balajee Asish Brahmandam — Thu, 12 Mar 2026 22:34:01 +0000

Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deploy it.

The model depended on a specific version of scikit-learn that conflicted with the production Python environment. The feature engineering pipeline required a NumPy build compiled against OpenBLAS, but the deployment servers ran MKL. A preprocessing step used a system library that existed on the data scientist's MacBook but not on the Ubuntu deployment target.

Three weeks of debugging later, we had the model running in production. Three weeks. For a model that was technically finished.

That experience is what pushed me to containerize our entire MLOps pipeline end to end. Not because Docker is trendy in ML circles, but because the alternative (hand-tuning environments, writing installation scripts that break on the next OS update, praying that what worked in training works in production) was costing us more time than the actual model development.

In this tutorial, you'll learn how to structure training and serving containers with multi-stage builds, how to set up experiment tracking with MLflow, how to version your training data with DVC, how to configure GPU passthrough for training, and how to tie it all together into a single Compose file with profiles. This is based on a year of running containerized ML pipelines across three teams.

Prerequisites

Docker Engine 24+ or Docker Desktop 4.20+ with Compose v2.22.0+
For GPU training, you'll need the NVIDIA Container Toolkit installed on the host and a compatible GPU driver. Run nvidia-smi to verify your GPU is visible, and docker compose version to check your Compose version.
Familiarity with Python, basic Docker concepts, and ML workflows (training, evaluation, serving) is assumed.

The MLOps Lifecycle: Where Containers Fit

If you've built a machine learning model, you know the process has a lot of stages. But if you're coming from a software engineering background (or you're a data scientist who mostly works in notebooks), it helps to see the full picture of what an MLOps pipeline looks like and where Docker fits into each stage.

An MLOps pipeline is a chain of interdependent stages:

Data ingestion and validation. Raw data comes in from databases, APIs, or file systems. You clean it, validate it, and store it in a format your model can use.
Feature engineering. You transform raw data into features the model can learn from. This might be as simple as normalizing numbers or as complex as generating embeddings.
Experiment tracking. You log every training run's configuration (hyperparameters, data version, code version) and results (accuracy, loss, evaluation metrics) so you can compare experiments and reproduce the best ones.
Model training. The model learns from your features. This is the compute-heavy part that often needs GPUs.
Evaluation. You measure the trained model against test data to see if it's good enough to deploy.
Packaging and serving. You wrap the trained model in an API so other systems can send it data and get predictions back.
Monitoring. You watch the model in production to catch problems like data drift (when the real-world data starts looking different from the training data) or performance degradation.

Each stage has different computational needs. Training might require GPUs and terabytes of memory. Serving needs low latency and horizontal scaling. Feature engineering might need distributed processing tools like Spark or Dask.

The thing that changed our approach: you don't containerize the entire pipeline as one monolithic image. You containerize each stage independently, with shared interfaces between them.

Think of it like microservices applied to ML infrastructure. Each container does one thing, does it well, and communicates with the others through well-defined interfaces: model artifacts stored in a registry, metrics logged to MLflow, data versioned in object storage.

This gives you the flexibility to:

Scale training on expensive GPU instances while running serving on cheaper CPU nodes
Update your feature engineering code without rebuilding your training environment
Version each stage independently in your container registry
Let data scientists and ML engineers work on training while platform engineers optimize serving

How to Build the Training Container

The training container is where most teams start, and where most teams make their first mistake.

The temptation is to create one massive image with every possible library, every CUDA version, every data processing tool. I've seen training images exceed 15GB. They take twenty minutes to build, ten minutes to push, and break whenever someone adds a new dependency.

Here's the pattern that works: use multi-stage builds to separate the build environment from the runtime environment, and use cache mounts to avoid re-downloading packages on every build.

If you're new to these concepts: a multi-stage build lets you use one Docker image to build your software and a different, smaller image to run it. You copy only the final artifacts from the build stage to the runtime stage, leaving behind compilers, build tools, and other things you don't need in production.

A cache mount tells Docker to keep a directory (like pip's download cache) between builds, so it doesn't re-download packages that haven't changed.

Here's the training Dockerfile:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

# System dependencies (rarely change)
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Dependencies (change occasionally)
COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

# Training code (changes frequently)
COPY src/ /app/src/
COPY configs/ /app/configs/

WORKDIR /app
ENTRYPOINT ["python", "-m", "src.train"]

Notice the layer ordering. Docker builds images in layers, and it caches each layer. If a layer hasn't changed, Docker reuses the cached version instead of rebuilding it. But here's the catch: if one layer changes, Docker rebuilds that layer and every layer after it.

That's why we put things in order of how often they change:

System packages at the top (they almost never change). Installing python3.11 and git takes time, but you only do it once.
Python dependencies in the middle (they change when you add or update a library). This layer rebuilds when requirements-train.txt changes.
Your actual code at the bottom (changes on every commit). This is the layer that rebuilds most often.

With this ordering, a code change only rebuilds the final layer, not the entire image. If you put COPY src/ before pip install, every code change would trigger a full reinstall of all Python packages. That's the mistake I see most often in ML Dockerfiles.

The --mount=type=cache,target=/root/.cache/pip line on the pip install command tells Docker to persist pip's download cache between builds. When you do update requirements, pip checks the cache first and only downloads packages that are new or changed. On a project with hundreds of ML dependencies (PyTorch alone pulls in dozens of sub-packages), this saves five to ten minutes per build.

Separate Training from Serving Requirements

Your training environment needs libraries that your serving environment does not. Training needs experiment tracking tools like MLflow, data processing libraries like pandas and polars, visualization libraries for debugging, and hyperparameter tuning frameworks. Serving needs a lightweight inference runtime, an API framework like FastAPI, health check endpoints, and minimal overhead.

It's a good idea to maintain separate requirements files:

# requirements-train.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
pandas==2.2.3
polars==1.20.0
dvc[s3]==3.59.1
optuna==4.2.0
matplotlib==3.10.0

# requirements-serve.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
fastapi==0.115.0
uvicorn[standard]==0.34.0
pydantic==2.10.0

The overlap is smaller than you'd think. torch and scikit-learn appear in both because the model needs them for inference. Everything else in the training file is baggage that slows down serving deployments and increases the attack surface.

CUDA and Driver Compatibility

One thing that will bite you if you ignore it: the CUDA runtime version inside your container must be compatible with the GPU driver version on the host. The rule is that the host driver must be equal to or newer than the CUDA version in the container. For example, CUDA 12.6 requires driver version 560.28+ on Linux.

Make sure you check your host driver version before choosing your base image:

# On the host machine
nvidia-smi
# Look for "Driver Version: 560.35.03" and "CUDA Version: 12.6"

# The CUDA version shown by nvidia-smi is the maximum CUDA version
# your driver supports, not the version installed

If your host driver is 535.x, don't use a cuda:12.6 base image. Use cuda:12.2 or upgrade the driver. Mismatched versions produce cryptic errors like CUDA error: no kernel image is available for execution on the device that are painful to debug.

Pin your base images to specific tags (not latest) and document the minimum driver version in your README. When you deploy to new hardware, the driver version check should be part of your provisioning checklist.

How to Set Up Experiment Tracking with MLflow

If you've ever trained a model and thought "wait, which hyperparameters gave me that good result last week?", you need experiment tracking. Without it, ML development turns into a mess of Jupyter notebooks, screenshots of metrics, and spreadsheets that nobody keeps up to date.

MLflow is the most widely adopted open-source tool for this. It logs three things for every training run: parameters (learning rate, batch size, number of epochs), metrics (accuracy, loss, F1 score), and artifacts (the trained model file, plots, evaluation reports). It stores all of this in a database and gives you a web UI to compare runs side by side.

Running MLflow as a containerized service means the tracking server is persistent and shared across your team, not running on one person's laptop:

services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  mlflow-artifacts:
  postgres-data:

Let me break down what's happening here.

The mlflow service runs the MLflow tracking server. It stores experiment metadata (parameters, metrics) in a Postgres database and saves artifacts (model files, plots) to a Docker volume.

The depends_on with condition: service_healthy tells Compose to wait until Postgres is actually ready to accept connections before starting MLflow. Without this, MLflow would crash on startup because the database isn't ready yet.

The db service runs Postgres with a health check that uses pg_isready, a built-in Postgres utility that checks if the database is accepting connections. The start_period gives Postgres 10 seconds to initialize before health checks start counting failures.

Your training code connects to MLflow by setting one environment variable:

import os
import mlflow

# This tells MLflow where to log experiments
# When running inside Docker Compose, "mlflow" resolves to the mlflow container
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow:5000"

# Example: logging a training run
with mlflow.start_run(run_name="fraud-detector-v2"):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)
    mlflow.log_param("epochs", 50)

    # ... train your model here ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.log_metric("precision", 0.93)
    mlflow.log_metric("recall", 0.89)

    # Log the trained model as an artifact
    mlflow.sklearn.log_model(model, "model")
    # Or for PyTorch: mlflow.pytorch.log_model(model, "model")

After the run completes, open http://localhost:5000 in your browser. You'll see a table of all your runs with their parameters and metrics. Click any run to see details, compare it with other runs, or download the model artifact. No more "I think experiment 7 was the good one" conversations.

A note on the password in the YAML: for local development this is fine. For staging and production, use Docker secrets or inject the credentials from your CI environment. Don't commit real database passwords to your repo.

How to Version Training Data with DVC

Models are reproducible only if you can also reproduce the data they were trained on. This is a problem Git can't solve on its own, because training datasets are often gigabytes or terabytes in size and Git isn't designed for large binary files.

DVC (Data Version Control) fills this gap. It works like Git, but for data. Here's the concept: instead of storing your 10GB training dataset in Git, DVC stores a small text file (a .dvc file) that acts as a pointer to the actual data. The real data lives in cloud storage (S3, Google Cloud Storage, Azure Blob). When you check out a specific Git commit, DVC knows which version of the data goes with that commit and can pull it from remote storage.

The workflow on your local machine looks like this:

# Initialize DVC in your project (one time)
dvc init

# Add your training data to DVC tracking
dvc add data/training_data.parquet
# This creates data/training_data.parquet.dvc (small pointer file)
# and adds training_data.parquet to .gitignore

# Push the actual data to remote storage
dvc push

# Commit the pointer file to Git
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"

Now your Git repo contains the pointer file, and the real data lives in S3. When someone else (or a container) needs the data, they run dvc pull and DVC downloads it from remote storage.

The training Dockerfile includes DVC, and the entrypoint pulls the correct data version before training begins:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

COPY src/ /app/src/
COPY configs/ /app/configs/

# DVC tracking files (these are small text files in Git)
COPY data/*.dvc /app/data/
COPY .dvc/ /app/.dvc/

WORKDIR /app
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

The entrypoint script pulls the data and then starts training:

#!/bin/bash
set -e

echo "Pulling training data from remote storage..."
dvc pull data/

echo "Starting training run..."
python -m src.train "$@"

For DVC to pull from S3, the container needs AWS credentials. You can pass them as environment variables in your Compose file or mount them from the host:

training:
  build: { context: ., dockerfile: Dockerfile.train }
  environment:
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - AWS_DEFAULT_REGION=us-east-1

Combined with MLflow's experiment logging, you get a complete provenance chain: this model was trained on this version of the data (tracked by DVC), with these parameters (logged in MLflow), producing these metrics.

You can reproduce any past experiment by checking out the Git commit and running the training container.

How to Build the Serving Container

"Serving" means wrapping your trained model in an API so other systems can send it data and get predictions back. For example, a fraud detection model might expose a /predict endpoint that accepts transaction data and returns a fraud probability.

The serving container has different priorities than the training container. Training optimizes for flexibility and raw compute. Serving optimizes for speed, small size, and reliability:

FROM python:3.11-slim AS serving

WORKDIR /app

# Install curl for healthcheck
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

COPY requirements-serve.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-serve.txt

COPY src/serving/ /app/src/serving/

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.serving.app:app", "--host", "0.0.0.0"]

A few things to understand if you're new to this:

uvicorn is a lightweight Python web server that runs FastAPI applications. FastAPI is a framework for building APIs in Python. Together, they let you turn your model into a web service that responds to HTTP requests.

HEALTHCHECK tells Docker to periodically check if your container is actually working, not just running. Every 30 seconds, Docker runs the curl command against the /health endpoint. If it fails three times in a row, Docker marks the container as unhealthy. This matters because your model server might be running but not ready (maybe the model file is still downloading), and you don't want to send traffic to a server that can't respond.

start-period of 60 seconds is important for ML serving containers. Model loading can take time, especially for large models (loading a 2GB model from a registry takes a while). Without start-period, the health check would start failing immediately, count those failures toward the retry limit, and the orchestrator might kill the container before the model finishes loading. The start period gives the container grace time to initialize.

Notice we're using python:3.11-slim here, not the NVIDIA CUDA image. Most trained models can run inference on CPU. If you need GPU inference (for example, running a large language model or doing real-time video processing), use the CUDA base image instead, but be aware that it makes the serving container much larger.

If you want to skip the curl dependency, use Python's built-in urllib for the health check:

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1

Decouple Models from Containers

This is one of the most important patterns in this article, and the one beginners most often get wrong.

The temptation is to copy your trained model file (the .pkl, .pt, or .onnx file that contains the learned weights) directly into the Docker image during the build. Don't do this. When you embed model files in your Docker image, every model update requires a new image build and push. For a 2GB model, that means rebuilding the container, uploading 2GB to a registry, and redeploying, even though only the model changed and the code is identical.

Instead, have your serving container download the model from a model registry (like MLflow) or cloud storage (like S3) at startup. The container image stays small and generic. Model updates become a configuration change (pointing to a new model version) rather than a deployment.

Here's a full serving app using FastAPI with the modern lifespan pattern. If you've used Flask, FastAPI is similar but faster and with built-in request validation:

import os
from contextlib import asynccontextmanager

import mlflow
from fastapi import FastAPI

# MODEL_URI points to a specific model version in MLflow's registry
# Format: "models://" where stage is Staging or Production
MODEL_URI = os.environ.get("MODEL_URI", "models:/fraud-detector/production")
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    # This runs once when the server starts up
    global model
    print(f"Loading model from {MODEL_URI}...")
    model = mlflow.pyfunc.load_model(MODEL_URI)
    print("Model loaded successfully.")
    yield
    # This runs when the server shuts down
    print("Shutting down model server.")


app = FastAPI(lifespan=lifespan)


@app.get("/health")
async def health():
    """Used by Docker HEALTHCHECK to verify the server is ready."""
    if model is None:
        return {"status": "loading"}, 503
    return {"status": "healthy"}


@app.post("/predict")
async def predict(features: dict):
    """Accept features as JSON, return model prediction."""
    import pandas as pd

    # Convert the input dict into a DataFrame (what most sklearn/mlflow models expect)
    df = pd.DataFrame([features])
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}

When a client sends a POST request to /predict with JSON like {"amount": 500, "merchant_category": "electronics", "hour": 23}, the model returns a prediction. The /health endpoint returns 503 while the model is loading and 200 once it's ready, which is exactly what the Docker HEALTHCHECK checks for.

Promoting a new model version means updating the MODEL_URI environment variable and restarting the container. The MLflow model registry supports stage transitions (Staging, Production, Archived), so you can promote a model in the MLflow UI and then point your serving container at the new version.

For zero-downtime model updates, implement a reload endpoint that swaps models without restarting:

@app.post("/admin/reload")
async def reload_model():
    global model
    model = mlflow.pyfunc.load_model(MODEL_URI)
    return {"status": "reloaded"}

How to Configure GPU Passthrough for Training

By default, Docker containers can't see the GPU hardware on the host machine. "GPU passthrough" means giving a container access to the host's GPUs so that libraries like PyTorch and TensorFlow can use them for accelerated computation.

This requires two things on the host (the machine running Docker, not inside the container):

NVIDIA GPU drivers installed and working. Verify with nvidia-smi. If that command shows your GPUs, you're good.
NVIDIA Container Toolkit installed. This is the bridge between Docker and the GPU drivers. Install it from the NVIDIA docs and verify with docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi. If you see your GPU listed, the toolkit is working.

Once the host is set up, GPU access in Docker Compose looks like this:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

The deploy.resources.reservations.devices block is saying: "this container needs all available NVIDIA GPUs." Inside the container, PyTorch and TensorFlow will see the GPUs and use them automatically. You can verify by adding print(torch.cuda.is_available()) to your training script, which should print True.

If you're running Compose v2.30.0+, you can use the shorter gpus syntax:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    gpus: all
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

For multi-GPU training with frameworks like PyTorch's DistributedDataParallel, you can assign specific GPUs using device_ids. This matters when running multiple training jobs at the same time:

services:
  training-job-1:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

  training-job-2:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

Note that CUDA_VISIBLE_DEVICES inside the container is relative to the devices assigned by Docker, not the host GPU indices. Both containers see their GPUs as device 0 and 1, even though they're using different physical GPUs.

How to Tie It All Together with Compose Profiles

If you're new to Compose profiles: by default, docker compose up starts every service defined in your docker-compose.yml. But you don't always want everything running. Your MLflow server and serving API should run all the time, but the training container should only launch when you're actually training a model (and it needs a GPU, which you might not have on your laptop).

Profiles solve this. When you add profiles: ["train"] to a service, that service is excluded from docker compose up by default. It only starts when you explicitly activate the profile with docker compose --profile train. This means one file defines your entire ML infrastructure, but you control what runs and when.

Here's the complete docker-compose.yml that ties every piece from this article together:

services:
  # --- Always-on infrastructure ---
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  serving:
    build: { context: ., dockerfile: Dockerfile.serve }
    ports:
      - "8000:8000"
    environment:
      - MODEL_URI=models:/fraud-detector/production
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      mlflow: { condition: service_started }

  # --- Training (on-demand) ---
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    profiles: ["train"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      mlflow: { condition: service_started }

volumes:
  postgres-data:
  mlflow-artifacts:

The day-to-day workflow with this file:

# Step 1: Start the infrastructure (MLflow + Postgres + serving API)
# The -d flag runs everything in the background
docker compose up -d

# Step 2: Open the MLflow UI to see past experiments
open http://localhost:5000    # macOS
# xdg-open http://localhost:5000  # Linux

# Step 3: Check that the serving API is healthy
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Step 4: Run a training job (pulls data via DVC, logs to MLflow)
# This only starts the "training" service because of the profile flag
docker compose --profile train run training

# Step 5: Watch training progress in the MLflow UI at localhost:5000
# You'll see metrics updating in real time if your training code logs them

# Step 6: After training completes, promote the model in MLflow UI
# Click the model, go to "Register Model", set stage to "Production"

# Step 7: Restart the serving container to pick up the new model version
docker compose restart serving

# Step 8: Test the new model
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "hour": 23}'

This single-file approach means a new team member can clone the repo, run docker compose up -d, and have the complete ML infrastructure running locally within minutes. The same containers deploy to staging and production with only environment variable changes (database credentials, model URIs, GPU allocation).

Reproducibility: The Whole Point

Everything in this article serves one goal: reproducibility. The ability to take any commit hash, build the same containers, pull the same data, and produce the same model.

Here are the practices that make this work:

Pin Everything

Pin your base images to specific digests, not just tags. Pin your Python packages to exact versions with pip freeze > requirements.txt. Use fixed random seeds in your training code and log them in MLflow.

Log Everything

Every training run should log the exact library versions (pip freeze), the Git commit hash, the DVC data version, all hyperparameters, and all evaluation metrics to MLflow. You can automate this:

import subprocess
import mlflow

with mlflow.start_run():
    # Log environment info automatically
    pip_freeze = subprocess.check_output(["pip", "freeze"]).decode()
    mlflow.log_text(pip_freeze, "pip_freeze.txt")

    git_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    mlflow.log_param("git_commit", git_hash)

    # ... rest of training ...

Version Everything

Git for code, DVC for data, MLflow for experiments, Docker digests for environments. The combination creates a complete provenance chain. When a stakeholder asks why a model made a particular prediction, you can trace it back to the exact code, data, and hyperparameters that produced it. For regulated industries like finance and healthcare, that traceability is a compliance requirement, not a nice-to-have.

Where This Breaks Down

This approach works well for small-to-medium teams running on single hosts or small clusters. Here's where you'll hit walls:

Large datasets. Don't mount multi-terabyte datasets into containers. Use object storage (S3, GCS) and stream data during training. DVC handles the versioning, but the data itself should live outside Docker entirely.

GPU driver mismatches. Your container's CUDA version must be compatible with the host driver. Test on identical hardware and driver versions to what you'll run in production. Document the minimum driver version in your README.

Multi-node training. When you need to distribute training across multiple machines, you'll outgrow Compose. Kubernetes with Kubeflow or KServe is the standard path for distributed training and auto-scaled serving.

Serving at scale. A single container running uvicorn handles moderate traffic. For high-throughput inference, you'll need a load balancer, multiple replicas, and potentially a dedicated serving framework like NVIDIA Triton Inference Server or TensorFlow Serving. Compose can run multiple replicas with docker compose up --scale serving=3, but it doesn't give you the routing, health-based load balancing, or rolling updates that a real orchestrator provides.

Secrets in production. The Compose file above uses plaintext passwords for local development. In production, use Docker secrets, HashiCorp Vault, or your cloud provider's secret manager. Never commit credentials to your repo.

Conclusion

Containerizing your MLOps pipeline turns fragile, environment-dependent models into reproducible, deployable artifacts. Multi-stage builds keep images lean. MLflow gives you experiment tracking and model lineage. DVC links code to data. GPU passthrough preserves training performance. A single Compose file with profiles ties the whole workflow together.

That fraud detection model I mentioned at the start? We eventually containerized the entire pipeline around it. The next model we shipped went from "notebook finished" to "running in production" in two days instead of three weeks. Most of that time was spent on evaluation and review, not fighting environments.

Containerization doesn't make your models better. It gets the infrastructure out of the way so you can focus on the work that does.

But even with these caveats, containerized MLOps eliminates the most common source of ML project delays: environment mismatch between development and production. The three weeks we spent debugging that fraud detection model deployment? That doesn't happen anymore.

If you found this useful, you can find me writing about MLOps, containerized workflows, and production AI systems on my blog.

Learn MLOps with MLflow and Databricks

Beau Carnes — Thu, 05 Mar 2026 14:53:59 +0000

As the industry standard for managing the machine learning life cycle, MLflow provides the necessary architecture to build systems that are both reproducible and scalable.

We just posted a course on the freeCodeCamp.org YouTube channel that will help you master the art of taking machine learning models out of the research phase and into a real production environment with this new end-to-end course on MLflow.

The curriculum begins with the fundamentals of experiment tracking, explaining why moving beyond basic Jupyter notebooks is critical for professional workflows. You will learn how to properly manage model parameters, metrics, and decision history so that every model pushed to production is fully auditable and traceable.

This course also covers LLM ops. You will discover how to use the prompt registry to version templates, manage different model providers through the AI Gateway, and implement LLM-as-a-judge for automated prompt evaluation. By integrating these tools with Databricks and Hugging Face, you will gain the hands-on expertise needed to serve and monitor complex models in an enterprise setting.

Watch the full course over at freeCodeCamp.org to start building production-ready ML systems today (5-hour watch).

How to Build End-to-End Machine Learning Lineage

Kuriko — Thu, 16 Oct 2025 13:43:13 +0000

Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance.

While many services for tracking ML lineage exist, creating a comprehensive and manageable lineage often proves complicated.

In this article, I’ll walk you through integrating a comprehensive ML lineage solution for an ML application deployed on serverless AWS Lambda, covering the end-to-end pipeline stages:

ETL pipeline
Data drift detection
Preprocessing
Model tuning
Risk and fairness evaluation.

What is Machine Learning Lineage?
What We’ll Build
- The System Architecture - AI Pricing for Retailers
- The ML Lineage
Workflow in Action
Step 1: Initiating a DVC Project
Step 2: The ML Lineage
Step 3: Deploying the DVC Project
Step 4: Configuring Scheduled Run with Prefect
Step 5: Deploying the Application
- Test in Local
Conclusion

Prerequisites:

Knowledge of key Machine Learning / Deep Learning concepts including the full lifecycle: data handling, model training, tuning, and validation.
Proficiency in Python, with experience using major ML libraries.
Basic understanding of DevOps principles.

Tools we’ll use:

Here is a summary of the tools we’re going to use to track the ML lineage:

DVC: An open-source version system for data. Used to track the ML lineage.
AWS S3: A secure object storage service from AWS. Used as a remote storage.
Evently AI: An open-source ML and LLM observability framework. Used to detect data drift.
Prefect: A workflow orchestration engine. Used to manage the schedule run of the lineage.

What is Machine Learning Lineage?

Machine learning (ML) lineage is a framework for tracking and understanding the complete lifecycle of a machine learning model.

It contains information at different levels such as:

Code: The scripts, libraries, and configurations for model training.
Data: The original data, transformations, and features.
Experiments: Training runs, hyperparameter tuning results.
Models: The trained models and their versions.
Predictions: The outputs of deployed models.

ML lineage is essential for multiple reasons:

Reproducibility: Recreate the same model and prediction for validation.
Root cause analysis: Trace back to the data, code, or configuration change when a model fails in production.
Compliance: Some regulated industries require proof of model training to ensure fairness, transparency, and adherence to laws like GDPR and the EU AI Act.

What We’ll Build

In this project, I’ll integrate an ML lineage into this price prediction system built on AWS Lambda architecture using DVC, an open-source version control system for ML applications.

The below diagram illustrates the system architecture and the ML lineage we’ll integrate:

Figure A: A comprehensive ML lineage for an ML application on serverless Lambda (Created by Kuriko IWAI)

The System Architecture: AI Pricing for Retailers

The system operates as a containerized, serverless microservice designed to provide optimal price recommendations to maximize retailer sales.

Its core intelligence comes from AI models trained on historical purchase data to predict the quantity of the product sold at various prices, allowing sellers to determine the best price.

For consistent deployment, the prediction logic and its dependencies are packaged into a Docker container image and stored in AWS ECR (Elastic Container Registry).

The prediction is then served by an AWS Lambda function, which retrieves and runs the container from ECR and exposes the result via AWS API Gateway for the Flask application to consume.

If you want to see how to build this from the ground up, you can follow along with my tutorial How to Build a Machine Learning System on Serverless Architecture.

The ML Lineage

In the system, GitHub handles the code lineage, while DVC captures the lineage of:

Data (blue boxes): ETL and preprocessing.
Experiments (light orange): Hyperparamters tuning and validation.
Models and Prediction (dark orange): Final model artifacts and prediction results.

DVC tracks the lineage through separate stages, from data extraction to fairness testing (yellow rows in Figure A).

For each stage, DVC uses an MD5 or SHA256 hash to track and push metadata like artifacts, metrics, and reports to its remote on AWS S3.

The pipeline incorporates Evently AI to handle data drift tests, which are essential for identifying shifts in data distributions that could compromise the model's generalization capabilities in production.

Only models that successfully pass both the data drift and fairness tests can serve predictions via the AWS API gateway (red box in Figure A).

Lastly, this entire lineage process is triggered weekly by the open-source workflow scheduler, Prefect.

Prefect prompts DVC to check for updates in data and scripts, and executes the full lineage process if changes are detected.

Workflow in Action

The building process involves five main steps:

Initiate a DVC project
Define the lineage stages with the DVC script dvc.yaml and corresponding Python script
Deploy the DVC project
Configure scheduled run with Prefect
Deploy the application

Let’s walk through each step together.

Step 1: Initiating a DVC Project

The first step is to initiate a DVC project:

$dvc init

This command automatically creates a .dvc directory at the root of the project folder:

.
.dvc/
│
└── cache/         # [.gitignore] store dvc caches (cached actual data files)
└── tmp/           # [.gitignore]
└── .gitignore     # gitignore cache, tmp, and config.local
└── config         # dvc config for production
└── config.local   # [.gitignore] dvc config for local

DVC maintains a fast, lightweight Git repository by separating the original data in large files from the repository.

The process involves caching the original data in the local .dvc/cache directory, creating a small .dvc metadata file which contains an MD5 hash and a link to the original data file path, pushing only the small metadata files to Git, and pushing the original data to the DVC remote.

Step 2: The ML Lineage

Next, we’ll configure the ML lineage with the following stages:

etl_pipeline: Extract, clean, impute the original data and perform feature engineering.
data_drift_check: Run data drift tests. If they fail, the system exits.
preprocess: Create training, validation, and test datasets.
tune_primary_model: Tune hyperparameters and train the model.
inference_primary_model: Perform inference on the test dataset.
assess_model_risk: Runs risk and fairness tests.

Each stage requires defining the DVC command and its corresponding Python script.

Let’s get started.

Stage 1: The ETL Pipeline

The first stage is to extract, clean, impute the original data, and perform feature engineering.

DVC Configuration

We’ll create the dvc.yaml file at the root of the project directory and add the etl_pipeline stage:

dvc.yaml

stages:
  etl_pipeline:
    # the main command dvc will run in this stage
    cmd: python src/data_handling/etl_pipeline.py

    # dependencies necessary to run the main command
    deps:
      - src/data_handling/etl_pipeline.py
      - src/data_handling/
      - src/_utils/

    # output paths for dvc to track
    outs:
      - data/original_df.parquet
      - data/processed_df.parquet

The dvc.yaml file defines a sequence of steps (stages) using sections like:

cmd: The shell command to be executed for that stage
deps: Dependencies that need to run the cmd
prams: Default parameters for the cmd defined in the params.yaml file
metrics: The metrics files to track
reports: The report files to track
plots: The DVC plot files for visualization
outs: The output files produced by the cmd, which DVC will track

The configuration helps DVC ensure reproducibility by explicitly listing dependencies, outputs, and the commands of each stage. It also helps it manage the lineage by establishing a Directed Acyclic Graph (DAG) of the workflow, linking each stage to the next.

Python Scripts

Next, let’s add Python scripts, ensuring the data is stored using the file paths specified in the outs section of the dvc.yaml file:

src/data_handling/etl_pipeline.py:

import os
import argparse

import src.data_handling.scripts as scripts
from src._utils import main_logger

def etl_pipeline():
    # extract the entire data
    df = scripts.extract_original_dataframe()

    # load perquet file
    ORIGINAL_DF_PATH = os.path.join('data', 'original_df.parquet')
    df.to_parquet(ORIGINAL_DF_PATH, index=False) # dvc tracked

    # transform
    df = scripts.structure_missing_values(df=df)
    df = scripts.handle_feature_engineering(df=df)

    PROCESSED_DF_PATH = os.path.join('data', 'processed_df.parquet')
    df.to_parquet(PROCESSED_DF_PATH, index=False) # dvc tracked
    return df

# for dvc execution
if __name__ == '__main__':  
    parser = argparse.ArgumentParser(description="run etl pipeline")
    parser.add_argument('--stockcode', type=str, default='', help="specific stockcode to process. empty runs full pipeline.")
    parser.add_argument('--impute', action='store_true', help="flag to create imputation values")
    args = parser.parse_args()

    etl_pipeline(stockcode=args.stockcode, impute_stockcode=args.impute)

Outputs

The original and structured data in Pandas’ DataFrames are stored in the DVC cache:

data/original_df.parquet
data/processed_df.parquet

Stage 2: The Data Drift Check

Before jumping into preprocessing, we’ll run data drift tests to ensure any notable drift is in the data. To do this, we’ll use EventlyAI, an open-source ML and LLM observability framework.

What is Data Drift?

Data drift refers to any changes in the statistical properties like the mean, variance, or distribution of the data that the model is trained on.

There are three main types of data drift:

Covariate Drift (Feature Drift): A change in the input feature distribution.
Prior Probability Drift (Label Drift): A change in the target variable distribution.
Concept Drift: A change in the relationship between the input data and the target variable.

Data drift compromises the model's generalization capabilities over time, making its detection after deployment crucial.

DVC Configuration

We’ll add the data_drift_check stage right after the etl_pipeline stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
     # the main command dvc will run in this stage
    cmd: >
      python src/data_handling/report_data_drift.py
      data/processed/processed_df.csv 
      data/processed_df_${params.stockcode}.parquet
      reports/data_drift_report_${params.stockcode}.html
      metrics/data_drift_${params.stockcode}.json
      ${params.stockcode}

    # default values to the parameters (defined in the param.yaml file)
    params:
      - params.stockcode

    # dependencies necessary to run the main command
    deps:
      - src/data_handling/report_data_drift.py
      - src/

    # output file pathes for dvc to track
    plots:
      - reports/data_drift_report_${params.stockcode}.html:

    metrics:
      - metrics/data_drift_${params.stockcode}.json:
          type: json

Then, add default values to the parameters passed to the DVC command:

params.yaml:

params:
  stockcode:  OF CHOICE>

Python Scripts

After generating an API token from the EventlyAI workplace, we’ll add a Python script to detect data drift and store the results in the metrics variable:

src/data_handling/report_data_drift.py:

import os
import sys
import json
import pandas as pd
import datetime
from dotenv import load_dotenv

from evidently import Dataset, DataDefinition, Report
from evidently.presets import DataDriftPreset
from evidently.ui.workspace import CloudWorkspace

import src.data_handling.scripts as scripts
from src._utils import main_logger


if __name__ == '__main__':
    # initiate evently cloud workspace
    load_dotenv(override=True)
    ws = CloudWorkspace(token=os.getenv('EVENTLY_API_TOKEN'), url='https://app.evidently.cloud')

    # retrieve evently project
    project = ws.get_project('EVENTLY AI PROJECT ID')

    # retrieve paths from the command line args
    REFERENCE_DATA_PATH = sys.argv[1]
    CURRENT_DATA_PATH = sys.argv[2]
    REPORT_OUTPUT_PATH = sys.argv[3]
    METRICS_OUTPUT_PATH = sys.argv[4]
    STOCKCODE = sys.argv[5]

    # create folders if not exist
    os.makedirs(os.path.dirname(REPORT_OUTPUT_PATH), exist_ok=True)
    os.makedirs(os.path.dirname(METRICS_OUTPUT_PATH), exist_ok=True)

    # extract datasets
    reference_data_full = pd.read_csv(REFERENCE_DATA_PATH)
    reference_data_stockcode = reference_data_full[reference_data_full['stockcode'] == STOCKCODE]
    current_data_stockcode = pd.read_parquet(CURRENT_DATA_PATH)

    # define data schema
    nums, cats = scripts.categorize_num_cat_cols(df=reference_data_stockcode)
    for col in nums: current_data_stockcode[col] = pd.to_numeric(current_data_stockcode[col], errors='coerce')

    schema = DataDefinition(numerical_columns=nums, categorical_columns=cats)

    # define evently dataset w/ the data schema
    eval_data_1 = Dataset.from_pandas(reference_data_stockcode, data_definition=schema)
    eval_data_2 = Dataset.from_pandas(current_data_stockcode, data_definition=schema)

    # execute drift detection
    report = Report(metrics=[DataDriftPreset()])
    data_eval = report.run(reference_data=eval_data_1, current_data=eval_data_2)
    data_eval.save_html(REPORT_OUTPUT_PATH)

    # create metrics for dvc tracking
    report_dict = json.loads(data_eval.json())
    num_drifts = report_dict['metrics'][0]['value']['count']
    shared_drifts = report_dict['metrics'][0]['value']['share']
    metrics = dict(
        drift_detected=bool(num_drifts > 0.0), num_drifts=num_drifts, shared_drifts=shared_drifts,
        num_cols=nums,
        cat_cols=cats,
        stockcode=STOCKCODE,
        timestamp=datetime.datetime.now().isoformat(),
    )

    # load metrics file
    with open(METRICS_OUTPUT_PATH, 'w') as f:
        json.dump(metrics, f, indent=4)
        main_logger.info(f'... drift metrics saved to {METRICS_OUTPUT_PATH}... ')

    # stop the system if data drift is found
    if num_drifts > 0.0: sys.exit('❌ FATAL: data drift detected. stopping pipeline')

If data drift is found, the script immediately exits using the final sys.exit command.

Outputs

The script generates two files that DVC will track:

reports/data_drift_report.html: The data drift report in a HTML file.
metrics/data_drift.json: The data drift metics in a JSON file including drift results along with feature columns and a timestamp:

metrics/data_drift.json:

{
    "drift_detected": false,
    "num_drifts": 0.0,
    "shared_drifts": 0.0,
    "num_cols": [
        "invoiceno",
        "invoicedate",
        "unitprice",
        "product_avg_quantity_last_month",
        "product_max_price_all_time",
        "unitprice_vs_max",
        "unitprice_to_avg",
        "unitprice_squared",
        "unitprice_log"
    ],
    "cat_cols": [
        "stockcode",
        "customerid",
        "country",
        "year",
        "year_month",
        "day_of_week",
        "is_registered"
    ],
    "timestamp": "2025-10-07T00:24:29.899495"
}

The drift test results are also available on the Evently workplace dashboard for further analysis:

Figure B. Screenshot of the Evently workspace dashboard

Stage 3: Preprocessing

If no data drift is detected, the linage moves onto the preprocessing stage.

DVC Configuration

We’ll add the preprocess stage right after the data_drift_check stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    cmd: >
      python src/data_handling/preprocess.py --target_col ${params.target_col} --should_scale ${params.should_scale} --verbose ${params.verbose}

    deps:
      - src/data_handling/preprocess.py
      - src/data_handling/
      - src/_utils

    # params from params.yaml
    params:
      - params.target_col
      - params.should_scale
      - params.verbose

    outs:
      # train, val, test datasets
      - data/x_train_df.parquet
      - data/x_val_df.parquet
      - data/x_test_df.parquet
      - data/y_train_df.parquet
      - data/y_val_df.parquet
      - data/y_test_df.parquet

      # preprocessed input datasets
      - data/x_train_processed.parquet
      - data/x_val_processed.parquet
      - data/x_test_processed.parquet

      # trained preprocessor and human readable feature names for shap analysis
      - preprocessors/column_transformer.pkl
      - preprocessors/feature_names.json

And then add default values of the parameters used in the cmd:

params.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

Python Scripts

Next, we’ll add a Python script to create training, validation, and test datasets and preprocess input data:

import os
import argparse
import json
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import src.data_handling.scripts as scripts
from src._utils import main_logger

def preprocess(stockcode: str = '', target_col: str = 'quantity', should_scale: bool = True, verbose: bool = False):
    # initiate metrics to track (dvc)
    DATA_DRIFT_METRICS_PATH = os.path.join('metrics', f'data_drift_{args.stockcode}.json')

    if os.path.exists(DATA_DRIFT_METRICS_PATH):
        with open(DATA_DRIFT_METRICS_PATH, 'r') as f:
            metrics = json.load(f)
    else: metrics = dict()

    # load processed df from dvc cache
    PROCESSED_DF_PATH = os.path.join('data', 'processed_df.parquet')
    df = pd.read_parquet(PROCESSED_DF_PATH)

    # categorize num and cat columns
    num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
    if verbose: main_logger.info(f'num_cols: {num_cols} \ncat_cols: {cat_cols}')

    # structure cat cols
    if cat_cols:
        for col in cat_cols: df[col] = df[col].astype('string')

    # initiate preprocessor (either load from the dvc cache or create from scratch)
    PREPROCESSOR_PATH = os.path.join('preprocessors', 'column_transformer.pkl')
    try:
        preprocessor = joblib.load(PREPROCESSOR_PATH)
    except:
        preprocessor = scripts.create_preprocessor(num_cols=num_cols if should_scale else [], cat_cols=cat_cols)

    # creates train, val, test datasets
    y = df[target_col]
    X = df.copy().drop(target_col, axis='columns')

    # split
    test_size, random_state = 50000, 42
    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=False)
    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state, shuffle=False)

    # store train, val, test datasets (dvc track)
    X_train.to_parquet('data/x_train_df.parquet', index=False)
    X_val.to_parquet('data/x_val_df.parquet', index=False)
    X_test.to_parquet('data/x_test_df.parquet', index=False)
    y_train.to_frame(name=target_col).to_parquet('data/y_train_df.parquet', index=False)
    y_val.to_frame(name=target_col).to_parquet('data/y_val_df.parquet', index=False)
    y_test.to_frame(name=target_col).to_parquet('data/y_test_df.parquet', index=False)

    # preprocess
    X_train = preprocessor.fit_transform(X_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    # store preprocessed input data (dvc track)
    pd.DataFrame(X_train).to_parquet(f'data/x_train_processed.parquet', index=False)
    pd.DataFrame(X_val).to_parquet(f'data/x_val_processed.parquet', index=False)
    pd.DataFrame(X_test).to_parquet(f'data/x_test_processed.parquet', index=False)

    # save feature names (dvc track) for shap
    with open('preprocessors/feature_names.json', 'w') as f:
        feature_names = preprocessor.get_feature_names_out()
        json.dump(feature_names.tolist(), f)

    return  X_train, X_val, X_test, y_train, y_val, y_test, preprocessor


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='run data preprocessing')
    parser.add_argument('--stockcode', type=str, default='', help='specific stockcode')
    parser.add_argument('--target_col', type=str, default='quantity', help='the target column name')
    parser.add_argument('--should_scale', type=bool, default=True, help='flag to scale numerical features')
    parser.add_argument('--verbose', type=bool, default=False, help='flag for verbose logging')
    args = parser.parse_args()

    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = preprocess(
        target_col=args.target_col,
        should_scale=args.should_scale,
        verbose=args.verbose,
        stockcode=args.stockcode,
    )

Outputs

This stage generates the necessary datasets for both model training and inference:

Input features:

data/x_train_df.parquet
data/x_val_df.parquet
data/x_test_df.parquet

Preprocessed input features:

data/x_train_processed_df.parquet
data/x_val_processed_df.parquet
data/x_test_processed_df.parquet

Target variables:

data/y_train_df.parquet
data/y_val_df.parquet
data/y_test_df.parquet

The preprocessor and human-readable feature names are also stored in cache for inference and SHAP feature impact analysis later:

preprocessors/column_transformer.pk
preprocessors/feature_names.json

Lastly, DVC adds the preprocess_status , x_train_processed_path, and preprocessor_path to the data summary metrics file data.json created in Step 2 to track the end-to-end process of Steps 2 and 3:

metrics/data.json:

{
    "drift_detected": false,
    "num_drifts": 0.0,
    "shared_drifts": 0.0,
    "num_cols": [
        "invoiceno",
        "invoicedate",
        "unitprice",
        "product_avg_quantity_last_month",
        "product_max_price_all_time",
        "unitprice_vs_max",
        "unitprice_to_avg",
        "unitprice_squared",
        "unitprice_log"
    ],
    "cat_cols": [
        "stockcode",
        "customerid",
        "country",
        "year",
        "year_month",
        "day_of_week",
        "is_registered"
    ],
    "timestamp": "2025-10-07T00:24:29.899495",

    # updates
    "preprocess_status": "completed",
    "x_train_processed_path": "data/x_train_processed_85123A.parquet",
    "preprocessor_path": "preprocessors/column_transformer.pkl"
}

Next, let’s move onto the model/experiment lineage.

Stage 4: Tuning the Model

Now that we’ve created the datasets, we’ll tune and train the primary model. It’s a multi-layered feedforward network on PyTorch, using training and validation datasets created in the preprocess stage.

DVC Configuration

First, we’ll add the tuning_primary_model stage right after the preprocess stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    cmd: >
      python src/model/torch_model/main.py
      data/x_train_processed_${params.stockcode}.parquet
      data/x_val_processed_${params.stockcode}.parquet
      data/y_train_df_${params.stockcode}.parquet
      data/y_val_df_${params.stockcode}.parquet
      ${tuning.should_local_save}
      ${tuning.grid}
      ${tuning.n_trials}
      ${tuning.num_epochs}
      ${params.stockcode}

    deps:
      - src/model/torch_model/main.py
      - src/data_handling/
      - src/model/
      - src/_utils/

    params:
      - params.stockcode
      - tuning.n_trials
      - tuning.grid
      - tuning.should_local_save

    outs:
      - models/production/dfn_best_${params.stockcode}.pth # dvc track

    metrics:
      - metrics/dfn_val_${params.stockcode}.json: # dvc track

Then we’ll add default values to the parameters:

params.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

tuning:
  n_trials: 100
  num_epochs: 3000
  should_local_save: False
  grid: False

Python Scripts

Next, we’ll add the Python scripts to tune the model using Bayesian optimization and then train the optimal model on the complete X_train and y_train datasets created in the preprocess stage.

src/model/torch_model/main.py:

import os
import sys
import json
import datetime
import pandas as pd
import torch
import torch.nn as nn

import src.model.torch_model.scripts as scripts


def tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode: str = '',
        should_local_save: bool = True,
        grid: bool = False,
        n_trials: int = 50,
        num_epochs: int = 3000
    ) -> tuple[nn.Module, dict]:

    # perform bayesian optimization
    best_dfn, best_optimizer, best_batch_size, best_checkpoint = scripts.bayesian_optimization(
        X_train, X_val, y_train, y_val, n_trials=n_trials, num_epochs=num_epochs
    )

    # save the model artifact (dvc track)
    DFN_FILE_PATH = os.path.join('models', 'production', f'dfn_best_{stockcode}.pth' if stockcode else 'dfn_best.pth')
    os.makedirs(os.path.dirname(DFN_FILE_PATH), exist_ok=True)
    torch.save(best_checkpoint, DFN_FILE_PATH)

    return best_dfn, best_checkpoint



def track_metrics_by_stockcode(X_val, y_val, best_model, checkpoint: dict, stockcode: str):
    MODEL_VAL_METRICS_PATH = os.path.join('metrics', f'dfn_val_{stockcode}.json')
    os.makedirs(os.path.dirname(MODEL_VAL_METRICS_PATH), exist_ok=True)

    # validate the tuned model
    _, mse, exp_mae, rmsle = scripts.perform_inference(model=best_model, X=X_val, y=y_val)
    model_version = f"dfn_{stockcode}_{os.getpid()}"
    metrics = dict(
        stockcode=stockcode,
        mse_val=mse,
        mae_val=exp_mae,
        rmsle_val=rmsle,
        model_version=model_version,
        hparams=checkpoint['hparams'],
        optimizer=checkpoint['optimizer_name'],
        batch_size=checkpoint['batch_size'],
        lr=checkpoint['lr'],
        timestamp=datetime.datetime.now().isoformat()
    )
    # store the validation results (dvc track)
    with open(MODEL_VAL_METRICS_PATH, 'w') as f:
        json.dump(metrics, f, indent=4)
        main_logger.info(f'... validation metrics saved to {MODEL_VAL_METRICS_PATH} ...')


if __name__ == '__main__':
    # fetch command arg values
    X_TRAIN_PATH = sys.argv[1]
    X_VAL_PATH = sys.argv[2]
    Y_TRAIN_PATH = sys.argv[3]
    Y_VAL_PATH = sys.argv[4]
    SHOULD_LOCAL_SAVE = sys.argv[5] == 'True'
    GRID = sys.argv[6] == 'True'
    N_TRIALS = int(sys.argv[7])
    NUM_EPOCHS = int(sys.argv[8])
    STOCKCODE = str(sys.argv[9])

    # extract training and validation datasets from dvc cache
    X_train, X_val = pd.read_parquet(X_TRAIN_PATH), pd.read_parquet(X_VAL_PATH)
    y_train, y_val = pd.read_parquet(Y_TRAIN_PATH), pd.read_parquet(Y_VAL_PATH)

    # tuning
    best_model, checkpoint = tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode=STOCKCODE, should_local_save=SHOULD_LOCAL_SAVE, grid=GRID, n_trials=N_TRIALS, num_epochs=NUM_EPOCHS
    )

    # metrics tracking
    track_metrics_by_stockcode(X_val, y_val, best_model=best_model, checkpoint=checkpoint, stockcode=STOCKCODE)

Outputs

The stage generates two files:

models/production/dfn_best.pth: Includes model artifacts and checkpoint like the optimal hyperparameter set.
metrics/dfn_val.json: Contains tuning results, model version, timestamp, and validation results for MSE, MAE, and RMSLE:

metrics/dfn_val.json:

{
    "stockcode": "85123A",
    "mse_val": 0.6137686967849731,
    "mae_val": 9.092489242553711,
    "rmsle_val": 0.6953299045562744,
    "model_version": "dfn_85123A_35604",
    "hparams": {
        "num_layers": 4,
        "batch_norm": false,
        "dropout_rate_layer_0": 0.13765888061300502,
        "n_units_layer_0": 184,
        "dropout_rate_layer_1": 0.5509872409359128,
        "n_units_layer_1": 122,
        "dropout_rate_layer_2": 0.2408753527744403,
        "n_units_layer_2": 35,
        "dropout_rate_layer_3": 0.03451842588822594,
        "n_units_layer_3": 224,
        "learning_rate": 0.026240673135104406,
        "optimizer": "adamax",
        "batch_size": 64
    },
    "optimizer": "adamax",
    "batch_size": 64,
    "lr": 0.026240673135104406,
    "timestamp": "2025-10-07T00:31:08.700294"
}

Stage 5: Performing Inference

After the model tuning phase is complete, we’ll configure the test inference for a final evaluation.

The final evaluation uses the MSE, MAE, and RMSLE metrics, as well as SHAP for feature impact and interpretability analysis.

SHAP (SHapley Additive exPlanations) is a framework for quantifying how much each feature contributes to a model’s prediction by using the concept of Shapley values from game theory.

The SHAP values are leveraged for future EDA and feature engineering.

DVC Configuration

First, we’ll add the inference_primary_model stage to the DVC configuration.

This stage has the plots section where DVC will track and version the generated visualization files on the SHAP values.

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    ### 
  inference_primary_model:
    cmd: >
      python src/model/torch_model/inference.py
      data/x_test_processed_${params.stockcode}.parquet
      data/y_test_df_${params.stockcode}.parquet
      models/production/dfn_best_${params.stockcode}.pth
      ${params.stockcode}
      ${tracking.sensitive_feature_col}
      ${tracking.privileged_group}

    deps:
      - src/model/torch_model/inference.py
      - models/production/
      - src/

    params:
      - params.stockcode
      - tracking.sensitive_feature_col
      - tracking.privileged_group

    metrics:
      - metrics/dfn_inf_${params.stockcode}.json: # dvc track
          type: json

    plots:
      # shap summary / beeswarm plot for global interpretability
      - reports/dfn_shap_summary_${params.stockcode}.json:
          template: simple
          x: shap_value
          y: feature_name
          title: SHAP Beeswarm Plot

      # shap mean absolute vals - feature importance bar plot
      - reports/dfn_shap_mean_abs_${params.stockcode}.json:
          template: bar
          x: mean_abs_shap
          y: feature_name
          title: Mean Absolute SHAP Importance

    outs:
      - data/dfn_inference_results_${params.stockcode}.parquet
      - reports/dfn_raw_shap_values_${params.stockcode}.parquet # save raw shap vals for detailed analysis later

Python Scripts

Next, we’ll add scripts where the trained model performs inference:

src/model/torch_model/inference.py:

import os
import sys
import json
import datetime
import numpy as np
import pandas as pd
import torch
import shap

import src.model.torch_model.scripts as scripts
from src._utils import main_logger


if __name__ == '__main__':
    # load test dataset
    X_TEST_PATH = sys.argv[1]
    Y_TEST_PATH = sys.argv[2]
    X_test, y_test = pd.read_parquet(X_TEST_PATH), pd.read_parquet(Y_TEST_PATH)

    # create X_test w/ column names for shap analysis and sensitive feature tracking
    X_test_with_col_names = X_test.copy()
    FEATURE_NAMES_PATH = os.path.join('preprocessors', 'feature_names.json')
    try:
        with open(FEATURE_NAMES_PATH, 'r') as f: feature_names = json.load(f)
    except FileNotFoundError: feature_names = X_test.columns.tolist()
    if len(X_test_with_col_names.columns) == len(feature_names): X_test_with_col_names.columns = feature_names

    # reconstruct the optimal model tuned in the previous stage
    MODEL_PATH = sys.argv[3]
    checkpoint = torch.load(MODEL_PATH)
    model = scripts.load_model(checkpoint=checkpoint)

    # perform inference
    y_pred, mse, exp_mae, rmsle = scripts.perform_inference(model=model, X=X_test, y=y_test, batch_size=checkpoint['batch_size'])

    # create result df w/ y_pred, y_true, and sensitive features
    STOCKCODE = sys.argv[4]
    SENSITIVE_FEATURE = sys.argv[5]
    PRIVILEGED_GROUP = sys.argv[6]
    inference_df = pd.DataFrame(y_pred.cpu().numpy().flatten(), columns=['y_pred'])
    inference_df['y_true'] = y_test
    inference_df[SENSITIVE_FEATURE] = X_test_with_col_names[f'cat__{SENSITIVE_FEATURE}_{str(PRIVILEGED_GROUP)}'].astype(bool)
    inference_df.to_parquet(path=os.path.join('data', f'dfn_inference_results_{STOCKCODE}.parquet'))

    # record inference metrics
    MODEL_INF_METRICS_PATH = os.path.join('metrics', f'dfn_inf_{STOCKCODE}.json')
    os.makedirs(os.path.dirname(MODEL_INF_METRICS_PATH), exist_ok=True)
    model_version = f"dfn_{STOCKCODE}_{os.getpid()}"
    inf_metrics = dict(
        stockcode=STOCKCODE,
        mse_inf=mse,
        mae_inf=exp_mae,
        rmsle_inf=rmsle,
        model_version=model_version,
        hparams=checkpoint['hparams'],
        optimizer=checkpoint['optimizer_name'],
        batch_size=checkpoint['batch_size'],
        lr=checkpoint['lr'],
        timestamp=datetime.datetime.now().isoformat()
    )
    with open(MODEL_INF_METRICS_PATH, 'w') as f: # dvc track
        json.dump(inf_metrics, f, indent=4)
        main_logger.info(f'... inference metrics saved to {MODEL_INF_METRICS_PATH} ...')


    ## shap analysis
    # compute shap vals
    model.eval()

    # prepare backgdound data
    X_test_tensor = torch.from_numpy(X_test.values.astype(np.float32)).to(device_type)

    # take the small samples from x_test as background
    background = X_test_tensor[np.random.choice(X_test_tensor.shape[0], 100, replace=False)].to(device_type)

    # define deepexplainer
    explainer = shap.DeepExplainer(model, background)

    # compute shap vals
    shap_values = explainer.shap_values(X_test_tensor) # outputs = numpy array or tensor

    # convert shap array to pandas df
    if isinstance(shap_values, list): shap_values = shap_values[0]
    if isinstance(shap_values, torch.Tensor): shap_values = shap_values.cpu().numpy()
    shap_values = shap_values.squeeze(axis=-1) # type: ignore
    shap_df = pd.DataFrame(shap_values, columns=feature_names)

    # shap raw data (dvc track)
    RAW_SHAP_OUT_PATH = os.path.join('reports', f'dfn_raw_shap_values_{STOCKCODE}.parquet')
    os.makedirs(os.path.dirname(RAW_SHAP_OUT_PATH), exist_ok=True)
    shap_df.to_parquet(RAW_SHAP_OUT_PATH, index=False)
    main_logger.info(f'... shap values saved to {RAW_SHAP_OUT_PATH} ...')

    # bar plot of mean abs shap vals (dvc report)
    mean_abs_shap = shap_df.abs().mean().sort_values(ascending=False)
    shap_mean_abs_df = pd.DataFrame({'feature_name': feature_names, 'mean_abs_shap': mean_abs_shap.values })
    MEAN_ABS_SHAP_PATH = os.path.join('reports', f'dfn_shap_mean_abs_{STOCKCODE}.json')
    shap_mean_abs_df.to_json(MEAN_ABS_SHAP_PATH, orient='records', indent=4)

Outputs

This stage generates five output files:

data/dfn_inference_result_${params_stockcode}.parquet: Stores prediction results, labeled targets, and any columns with sensitive features like gender, age, income, and more. I’ll use this file for the fairness test in the last stage.
metrics/dfn_inf.json: Stores evaluation metrics and tuning results:

{
    "stockcode": "85123A",
    "mse_inf": 0.6841545701026917,
    "mae_inf": 11.5866117477417,
    "rmsle_inf": 0.7423332333564758,
    "model_version": "dfn_85123A_35834",
    "hparams": {
        "num_layers": 4,
        "batch_norm": false,
        "dropout_rate_layer_0": 0.13765888061300502,
        "n_units_layer_0": 184,
        "dropout_rate_layer_1": 0.5509872409359128,
        "n_units_layer_1": 122,
        "dropout_rate_layer_2": 0.2408753527744403,
        "n_units_layer_2": 35,
        "dropout_rate_layer_3": 0.03451842588822594,
        "n_units_layer_3": 224,
        "learning_rate": 0.026240673135104406,
        "optimizer": "adamax",
        "batch_size": 64
    },
    "optimizer": "adamax",
    "batch_size": 64,
    "lr": 0.026240673135104406,
    "timestamp": "2025-10-07T00:31:12.946405"
}

reports/dfn_shap_mean_abs.json: Stores the mean SHAP values:

[
    {
        "feature_name":"num__invoicedate",
        "mean_abs_shap":0.219255722
    },
    {
        "feature_name":"num__unitprice",
        "mean_abs_shap":0.1069829418
    },
    {
        "feature_name":"num__product_avg_quantity_last_month",
        "mean_abs_shap":0.1021453096
    },
    {
        "feature_name":"num__product_max_price_all_time",
        "mean_abs_shap":0.0855356899
    },
...
]

reports/dfn_shap_summary.json: Contains the data points necessary to draw the beeswarm/bar plots.
reports/dfn_raw_shap_values.parquet: Stores raw SHAP values.

Stage 6: Assessing Model Risk and Fairness

The last stage is to assess risk and fairness of the final inference results.

The Fairness Testing

Fairness testing in ML is the process of systematically evaluating a model’s predictions to ensure they are not unfairly biased toward specific groups defined by sensitive attributes like race and gender.

In this project, we’ll use the registration status is_registered column as a sensitive feature and make sure the Mean Outcome Difference (MOD) is within the specified threshold of 0.1.

The MOD is calculated as the absolute difference between the mean prediction values of the privileged (registered) and unprivileged (unregistered) groups.

DVC Configuration

First, we’ll add the assess_model_risk stage right after the inference_primary_model stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    ### 
  inference_primary_model:
    ###
  assess_model_risk:
    cmd: >
      python src/model/torch_model/assess_risk_and_fairness.py
      data/dfn_inference_results_${params.stockcode}.parquet
      metrics/dfn_risk_fairness_${params.stockcode}.json
      ${tracking.sensitive_feature_col}
      ${params.stockcode}
      ${tracking.privileged_group}
      ${tracking.mod_threshold}

    deps:
      - src/model/torch_model/assess_risk_and_fairness.py
      - src/_utils/
      - data/dfn_inference_results_${params.stockcode}.parquet # ensure the result df as dependency

    params:
      - params.stockcode
      - tracking.sensitive_feature_col
      - tracking.privileged_group
      - tracking.mod_threshold

    metrics:
      - metrics/dfn_risk_fairness_${params.stockcode}.json:
          type: json

Then we’ll add default values to the parameters:

param.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

tuning:
  n_trials: 100
  num_epochs: 3000
  should_local_save: False
  grid: False

# adding default values to the tracking metrics
tracking:
  sensitive_feature_col: "is_registered"
  privileged_group: 1 # member
  mod_threshold: 0.1

Python Script

The corresponding Python script contains the calculate_fairness_metrics function which performs the risk and fairness assessment:

src/model/torch_model/assess_risk_and_fairness.py:

import os
import json
import datetime
import argparse
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_log_error

from src._utils import main_logger


def calculate_fairness_metrics(
        df: pd.DataFrame,
        sensitive_feature_col: str,
        label_col: str = 'y_true',
        prediction_col: str = 'y_pred',
        privileged_group: int = 1,
        mod_threshold: float = 0.1,
    ) -> dict:

    metrics = dict()
    unprivileged_group = 0 if privileged_group == 1 else 1

    ## 1. risk assessment - predictive performance metrics by group
    for group, name in zip([unprivileged_group, privileged_group], ['unprivileged', 'privileged']):
        subset = df[df[sensitive_feature_col] == group]
        if len(subset) == 0: continue

        y_true = subset[label_col].values
        y_pred = subset[prediction_col].values

        metrics[f'mse_{name}'] = float(mean_squared_error(y_true, y_pred)) # type: ignore
        metrics[f'mae_{name}'] = float(mean_absolute_error(y_true, y_pred)) # type: ignore
        metrics[f'rmsle_{name}'] = float(root_mean_squared_log_error(y_true, y_pred)) # type: ignore

        # mean prediction (outcome disparity component)
        metrics[f'mean_prediction_{name}'] = float(y_pred.mean()) # type: ignore

    ## 2. bias assessment - fairness metrics
    # absolute mean error difference
    mae_diff = metrics.get('mae_unprivileged', 0) - metrics.get('mae_privileged', 0)
    metrics['mae_diff'] = float(mae_diff)

    # mean outcome difference
    mod = metrics.get('mean_prediction_unprivileged', 0) - metrics.get('mean_prediction_privileged', 0)
    metrics['mean_outcome_difference'] = float(mod)
    metrics['is_mod_acceptable'] = 1 if abs(mod) <= mod_threshold else 0

    return metrics


def main():
    parser = argparse.ArgumentParser(description='assess bias and fairness metrics on model inference results.')
    parser.add_argument('inference_file_path', type=str, help='parquet file path to the inference results w/ y_true, y_pred, and sensitive feature cols.')
    parser.add_argument('metrics_output_path', type=str, help='json file path to save the metrics output.')
    parser.add_argument('sensitive_feature_col', type=str, help='column name of sensitive features')
    parser.add_argument('stockcode', type=str)
    parser.add_argument('privileged_group', type=int, default=1)
    parser.add_argument('mod_threshold', type=float, default=.1)
    args = parser.parse_args()

    try:
        # load inf df
        df_inference = pd.read_parquet(args.inference_file_path)
        LABEL_COL = 'y_true'
        PREDICTION_COL = 'y_pred'
        SENSITIVE_COL = args.sensitive_feature_col

        # compute fairness metrics
        metrics = calculate_fairness_metrics(
            df=df_inference,
            sensitive_feature_col=SENSITIVE_COL,
            label_col=LABEL_COL,
            prediction_col=PREDICTION_COL,
            privileged_group=args.privileged_group,
            mod_threshold=args.mod_threshold,
        )

        # add items to metrics
        metrics['model_version'] = f'dfn_{args.stockcode}_{os.getpid()}'
        metrics['sensitive_feature'] = args.sensitive_feature_col
        metrics['privileged_group'] = args.privileged_group
        metrics['mod_threshold'] = args.mod_threshold
        metrics['stockcode'] = args.stockcode
        metrics['timestamp'] = datetime.datetime.now().isoformat()

        # load metrics (dvc track)
        with open(args.metrics_output_path, 'w') as f:
            json_metrics = { k: (v if pd.notna(v) else None) for k, v in metrics.items() }
            json.dump(json_metrics, f, indent=4)

    except Exception as e:
        main_logger.error(f'... an error occurred during risk and fairness assessment: {e} ...')
        exit(1)

if __name__ == '__main__':
    main()

Outputs

The final stage generates a metrics file which contains test results and model version:

metrics/dfn_risk_fairness.json:

{
    "mse_unprivileged": 3.5370739412593575,
    "mae_unprivileged": 1.48263614013523,
    "rmsle_unprivileged": 0.6080000224747837,
    "mean_prediction_unprivileged": 1.8507767915725708,
    "mae_diff": 1.48263614013523,
    "mean_outcome_difference": 1.8507767915725708,
    "is_mod_acceptable": 1,
    "model_version": "dfn_85123A_35971",
    "sensitive_feature": "is_registered",
    "privileged_group": 1,
    "mod_threshold": 0.1,
    "timestamp": "2025-10-07T00:31:15.998590"
}

That’s all for the lineage configuration. Now, we’ll test it in local.

Test in Local

We’ll run the entire ML lineage with this command:

$dvc repro -f

-f forces DVC to rerun all the stages with or without any updates.

The command will automatically create the dvc.lock file at the root of the project directory:

schema: '2.0'
stages:
  etl_pipeline_full:
    cmd: python src/data_handling/etl_pipeline.py
    deps:
    - path: src/_utils/
      hash: md5
      md5: ae41392532188d290395495f6827ed00.dir
      size: 15870
      nfiles: 10
    - path: src/data_handling/
      hash: md5
      md5: a8a61a4b270581a7c387d51e416f4e86.dir
      size: 95715
...

The dvc.lock file must be published in Git to make sure DVC will load the latest files:

$git add dvc.lock .dvc dvc.yaml params.yaml
$git commit -m'updated dvc config'
$git push

Step 3: Deploying the DVC Project

Next, we’ll deploy the DVC project to ensure the AWS Lambda function can access the cached files in production.

We’ll start by configuring the DVC remote where the cached files are stored.

DVC offers various storage types like AWS S3 and Google Cloud. We’ll use AWS S3 for this project but your choice depend on the project ecosystem, your familiarity with the tool, and any resource constraints.

First, we’ll create a new S3 bucket in the selected AWS region:

$aws s3 mb s3:///  --region

Make sure the IAM role has the following permissions: s3:ListBucket, s3:GetObject, s3:PutObject, and s3:DeleteObject.

Then, add theURI of the S3 bucket to the DVC remote:

$dvc remote add -d  ss3:///

Next, push the cache files to the DVC remote:

$dvc push

Now, all cache files are stored in the S3 bucket:

Figure C. Screenshot of the DVC remote in AWS S3 bucket

As shown in Figure A, this deployment step is necessary for the AWS Lambda function to access the DVC cache in production.

Step 4: Configuring Scheduled Run with Prefect

The next step is to configure the scheduled run of the entire lineage with Prefect.

Prefect is an open-source workflow orchestration tool for building, scheduling, and monitoring pipelines. It uses a concept called a work pool to effectively decouple the orchestration logic from the execution infrastructure.

Then, the work pool serves as a standardized base configuration by running a Docker container image to guarantee a consistent execution environment for all flows.

Configuring the Docker Image Registry

The first step is to configure the Docker image registry for the Prefect work pool:

For local deployment: A container registry in the Docker Hub.
For production deployment: AWS ECR.

For local deployment, we’ll first authenticate the Docker client:

$docker login

And grant a user permission to run Docker commands without sudo:

$sudo dscl . -append /Groups/docker GroupMembership $USER

For production deployment, we’ll create a new ECR:

$aws ecr create-repository --repository-name  --region

(Make sure the IAM role has access to this new ECR URI.)

Configure Prefect Tasks and Flows

Next, we’ll configure the Prefect task and flow in the project:

The Prefect task executes the dvc repro and dvc push commands
The Prefect flow weekly executes the Prefect task.

src/prefect_flows.py:

import os
import sys
import subprocess
from datetime import timedelta, datetime
from dotenv import load_dotenv
from prefect import flow, task
from prefect.schedules import Schedule
from prefect_aws import AwsCredentials

from src._utils import main_logger

# add project root to the python path - enabling prefect to find the script
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))

# define the prefect task
@task(retries=3, retry_delay_seconds=30)
def run_dvc_pipeline():
    # execute the dvc pipeline 
    result = subprocess.run(["dvc", "repro"], capture_output=True, text=True, check=True)

    # push the updated data
    subprocess.run(["dvc", "push"], check=True)


# define the prefect flow
@flow(name="Weekly Data Pipeline")
def weekly_data_flow():
    run_dvc_pipeline()

if __name__ == '__main__':
    # docker image registry (either docker hub or aws ecr)
    load_dotenv(override=True)
    ENV = os.getenv('ENV', 'production')
    DOCKER_HUB_REPO = os.getenv('DOCKER_HUB_REPO')
    ECR_FOR_PREFECT_PATH = os.getenv('S3_BUCKET_FOR_PREFECT_PATH')
    image_repo = f'{DOCKER_HUB_REPO}:ml-sales-pred-data-latest' if ENV == 'local' else f'{ECR_FOR_PREFECT_PATH}:latest'

    # define weekly schedule
    weekly_schedule = Schedule(
        interval=timedelta(weeks=1),
        anchor_date=datetime(2025, 9, 29, 9, 0, 0),
        active=True,
    )

    # aws credentials to access ecr
    AwsCredentials(
        aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
        aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
        region_name=os.getenv('AWS_REGION_NAME'),
    ).save('aws', overwrite=True)

    # deploy the prefect flow
    weekly_data_flow.deploy(
        name='weekly-data-flow',
        schedule=weekly_schedule, # schedule
        work_pool_name="wp-ml-sales-pred", # work pool where the docker image (flow) runs
        image=image_repo, # create a docker image at docker hub (local) or ecr (production)
        concurrency_limit=3,
        push=True # push the docker image to the image_repo
    )

Test in Local

Next, we’ll test the workflow locally with the Prefect server:

$uv run prefect server start

$export PREFECT_API_URL="http://127.0.0.1:4200/api"

Run the prefect_flows.py script:

$uv run src/prefect_flows.py

Upon the successful execution, the Prefect dashboard indicates the workflow is scheduled to run:

Figure D. As screenshot of the Prefect dashboard

Step 5: Deploying the Application

The final step is to deploy the entire application as a containerized Lambda by configuring the Dockerfile and the Flask application scripts.

The specific process in this final deployment step depends on the infrastructure.

But the common point is that DVC eliminates the need to store the large Parquet or CSV files directly in the feature store or model store because it caches them as lightweight hashed files.

So, first, we’ll simplify the loading logic of the Flask application script by using the dvc.api framework:

app.py:

### ... the rest components remain the same  ...

import dvc.api

DVC_REMOTE_NAME=


def configure_dvc_for_lambda():
    # set dvc directories to /tmp
    os.environ.update({
        'DVC_CACHE_DIR': '/tmp/dvc-cache',
        'DVC_DATA_DIR': '/tmp/dvc-data',
        'DVC_CONFIG_DIR': '/tmp/dvc-config',
        'DVC_GLOBAL_CONFIG_DIR': '/tmp/dvc-global-config',
        'DVC_SITE_CACHE_DIR': '/tmp/dvc-site-cache'
    })
    for dir_path in ['/tmp/dvc-cache', '/tmp/dvc-data', '/tmp/dvc-config']:
        os.makedirs(dir_path, exist_ok=True)


def load_x_test():
    global X_test
    if not os.environ.get('PYTEST_RUN', False):
        main_logger.info("... loading x_test ...")

        # config dvc directories
        configure_dvc_for_lambda()
        try:
            with dvc.api.open(X_TEST_PATH, remote=DVC_REMOTE_NAME, mode='rb') as fd:
                X_test = pd.read_parquet(fd)
                main_logger.info('✅ successfully loaded x_test via dvc api')
        except Exception as e:
            main_logger.error(f'❌ general loading error: {e}', exc_info=True)


def load_preprocessor():
    global preprocessor
    if not os.environ.get('PYTEST_RUN', False):
        main_logger.info("... loading preprocessor ...")
        configure_dvc_for_lambda()
        try:
            with dvc.api.open(PREPROCESSOR_PATH, remote=DVC_REMOTE_NAME, mode='rb') as fd:
                preprocessor = joblib.load(fd)
                main_logger.info('✅ successfully loaded preprocessor via dvc api')

        except Exception as e:
            main_logger.error(f'❌ general loading error: {e}', exc_info=True)

### ... the rest components remain the same  ...

Then, update the Dockerfile to enable Docker to correctly reference the DVC components:

Dockerfile.lambda.production:

# use an official python runtime
FROM public.ecr.aws/lambda/python:3.12

# set environment variables (adding dvc related env variables)
ENV JOBLIB_MULTIPROCESSING=0
ENV DVC_HOME="/tmp/.dvc"
ENV DVC_CACHE_DIR="/tmp/.dvc/cache"
ENV DVC_REMOTE_NAME="storage"
ENV DVC_GLOBAL_SITE_CACHE_DIR="/tmp/dvc_global"

# copy requirements file and install dependencies
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir dvc dvc-s3

# setup dvc
RUN dvc init --no-scm
RUN dvc config core.no_scm true

# copy the code to the lambda task root
COPY . ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]

Lastly, ensure the large files are ignored from the Docker container image:

.dockerignore:

### ... the rest components remain the same  ...

# dvc cache contains large files
.dvc/cache
.dvcignore

# add all folders that DVC will track
data/
preprocessors/
models/
reports/
metrics/

Test in Local

Finally, we’ll build and test the Docker image:

$docker build -t my-app -f Dockerfile.lambda.local .
$docker run -p 5002:5002 -e ENV=local my-app app.py

Upon the successful configuration, the waitress server will run the Flask application.

After confirming the changes, push the code to Git:

$git add .
$git commit -m'updated dockerfiles and flask app scripts'
$git push

This push command triggers the CI/CD pipeline via GitHub Actions, which generates a Docker container image and pushes it to AWS ECR.

And then after a successful pipeline flow and verification, we can manually run the deployment workflow using GitHub Actions.

And that’s it!

You can learn more here: Integrating the infrastructure CI/CD pipeline to an ML application

All code is available in my GitHub repository.

The mock app is available here.

Conclusion

Building robust ML applications requires comprehensive ML lineage to ensure reliability and traceability.

In this article, you learned how to build an ML lineage by integrating open-source services like DVC and Prefect.

In practice, initial planning matters. Specifically, defining how metrics are tracked and at which stages leads directly to a cleaner, more maintainable code structure and the extensibility in the future.

Moving forward, we can consider adding more stages to the lineage and integrating advanced logic for data drift detection or fairness tests.

This will further ensure continued model performance and data integrity in the production environment.

You can check out my Portfolio / Github.

All images, unless otherwise noted, are by the author.

Learn MLOps by Creating a YouTube Sentiment Analyzer

Beau Carnes — Sat, 14 Jun 2025 20:04:59 +0000

If you’re serious about machine learning and want to break into real-world ML engineering, learning MLOps is one of the best things you can do. It’s what turns experiments into reliable systems. You can train a great model, but without the right pipeline to deploy, monitor, and update it, that model won’t be useful in a real application. MLOps is how companies ship machine learning at scale, and even if you're working solo or on smaller projects, knowing how to build proper ML pipelines will save you tons of time and headaches. Plus, it's one of the most in-demand skill sets in the industry right now.

We just released a new video course on the freeCodeCamp.org YouTube channel that teaches you how to build a full end-to-end MLOps pipeline by working on a real, practical project. You’ll create a system that analyzes the sentiment of YouTube comments in real time using a Chrome extension. This isn’t just another toy example. It’s a full production-ready pipeline that covers everything from data collection to deployment, and it uses real tools that are used in modern ML workflows: MLflow, DVC, Docker, AWS, Flask, and more. The course is taught by Bappy Ahmed, who walks through each step in a clear, practical way, so you actually understand what’s happening.

Here’s what the course covers:

Introduction & Project Planning – Understand the problem and design the architecture of the full pipeline.
Data Collection – Learn how to scrape YouTube comments and prepare the data you'll use to train the sentiment model.
Data Preprocessing & EDA – Clean the data, explore patterns, and prep it for training.
Setup MLflow Server on AWS – Use MLflow to track experiments and manage your models.
Building a Baseline Model – Start simple with a basic model to establish a performance benchmark.
Improving the Model – Experiment with techniques like Bag of Words, TFIDF, adjusting feature sizes, handling class imbalance, and hyperparameter tuning.
Stacking Models – Use ensemble techniques to combine different models for better performance.
Build a Full ML Pipeline Using DVC – Break your code into modular components (data ingestion, preprocessing, model building, etc.) and version everything using DVC.
Model Evaluation and Registration with MLflow – Evaluate performance and keep track of the best models.
Deploy with Flask and Docker – Wrap your model in a Flask API, containerize it with Docker, and prepare it for production.
Create a Chrome Plugin – Build a browser extension that interacts with your deployed model in real time.
CI/CD Deployment on AWS – Automate your deployment so updates happen smoothly and reliably.

By the end, you’ll have a working, deployed MLOps project that shows you understand the full ML lifecycle. This course is perfect for anyone who already knows a bit of machine learning and wants to level up their engineering skills.

You can watch the full course on the freeCodeCamp.org YouTube channel (3-hour watch).

How to Automate Compliance and Fraud Detection in Finance with MLOps

Balajee Asish Brahmandam — Mon, 12 May 2025 16:21:29 +0000

These days, businesses are under increasing pressure to comply with stringent regulations while also combating fraudulent activities. The high volume of data and the intricate requirements of real-time fraud detection and compliance reporting are frequently a challenge for traditional systems to manage.

This is where MLOps (Machine Learning Operations) comes into play. It can help teams streamline these processes and elevate automation to the forefront of financial security and regulatory adherence.

In this article, we will investigate the potential of MLOps for automating compliance and fraud detection in the finance sector.

I’ll show you step by step how financial institutions can deploy a machine learning model for fraud detection and integrate it into their operations to ensure continuous monitoring and automated alerts for compliance. I’ll also demonstrate how to deploy this solution in a cloud-based environment using Google Colab, ensuring that it is both user-friendly and accessible, whether you are a beginner or more advanced.

Here’s what we’ll cover:

What is MLOps?
What You’ll Need
Step 1: Set Up Google Colab and Prepare the Data
Step 2: Data Preprocessing
Step 4: Retrain the Model with New Data
Step 5: Automated Alert System
Step 6: Visualize Model Performance
Conclusion
Key Takeaways

What is MLOps?

Machine Learning Operations, or MLOps for short, is a methodology that integrates DevOps with Machine Learning (ML). The whole machine learning model lifecycle, including development, training, deployment, monitoring, and maintenance, can be automated with its help.

MLOps has several main goals: continuous optimization, scalability, and the delivery of operational value over time.

The financial industry provides great use cases for MLOps processes and techniques, as these can help businesses manage complicated data pipelines, deploy models in real-time, and evaluate their performance – all while making sure they're compliant with regulations.

Why is MLOps Important in Finance?

Financial institutions are subject to various rules including Anti-Money Laundering (AML), Know Your Customer (KYC), and Fraud Prevention Regulations – so they have to carefully manage private information. Ignoring these rules might result in severe fines and loss of reputation.

Detecting fraud in financial transactions also calls for advanced systems capable of real-time identification of suspicious activity.

MLOps can help to solve these issues in the following ways:

MLOps lets financial institutions automatically track transactions for regulatory compliance, guaranteeing they follow changing legislation.
MLOps helps to create and implement machine learning models that can identify fraudulent transactions in real-time.
MLOps runs automated processes, enabling organizations to expand their fraud detection systems with as little human involvement as possible through automation.

What You’ll Need:

To follow along with this tutorial, ensure that you have the following:

Python installed, along with basic ML libraries such as scikit-learn, Pandas, and NumPy.
A sample dataset of financial transactions, which we will use to train a fraud detection model (You can use this sample dataset if you don’t have one on hand).
Google Colab (for cloud-based execution), which is free to use and doesn't require installation.

Step 1: Set Up Google Colab and Prepare the Data

Google Colab is an ideal choice for beginners and advanced users alike, because it’s cloud-based and doesn’t require installation. To start get started using it, follow these steps:

Access Google Colab:

Visit Google Colab and sign-in with your Google account.

Create a New Notebook:

In the Colab interface, go to File and then select New Notebook to create a fresh notebook.

Import Libraries and Load the Dataset

Now, let’s import the necessary libraries and load our fraud detection dataset. We'll assume the dataset is available as a CSV file, and we'll upload it to Colab.

Import libraries:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

Upload the Dataset:

from google.colab import files
uploaded = files.upload()

# Load dataset into pandas DataFrame
data = pd.read_csv('data.csv')
print(data.head())

Step 2: Data Preprocessing

Data preprocessing is essential to prepare the dataset for model training. This involves handling missing values, encoding categorical variables, and normalizing numerical features.

Why is Preprocessing Important?

Data preprocessing lets you take care of various data issues that could affect your results. During this process, you’ll:

Handle missing values: Financial datasets often have missing values. Filling in these missing values (for example, with the median) ensures that the model doesn’t encounter errors during training.
Convert categorical data: Machine learning algorithms require numerical input, so categorical features (like transaction type or location) need to be converted into numeric format using one-hot encoding.
Normalize data: Some machine learning models, like Random Forest, are not sensitive to feature scaling, but normalization helps maintain consistency and allows us to compare the importance of different features. This step is especially critical for models that rely on gradient descent.

Here’s an example:

# Handle missing data by filling with the median value for each column
data.fillna(data.median(), inplace=True)

# Convert categorical columns to numeric using one-hot encoding
data = pd.get_dummies(data, drop_first=True)

# Normalize numerical columns for scaling
data['normalized_amount'] = (data['Amount'] - data['Amount'].mean()) / data['Amount'].std()

# Separate features and target variable
X = data.drop(columns=['Class'])
y = data['Class']

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data preprocessing completed.")

Step 3: Train a Fraud Detection Model

We'll now train a RandomForestClassifier and evaluate its performance.

What is a Random Forest Classifier?

A Random Forest is an ensemble learning method that creates a collection (forest) of decision trees, typically trained with different parts of the data. It aggregates their predictions to improve accuracy and reduce overfitting.

This method is a popular choice for fraud detection because it can handle high-dimensional data. It’s also quite robust against overfitting.

Here’s how you can implement the Random Forest Classifier:

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=150, random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Predict on the test data
y_pred = rf_model.predict(X_test)

# Evaluate model performance
print("Model Evaluation:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Plot confusion matrix for visual understanding
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
cax = ax.matshow(cm, cmap='Blues')
fig.colorbar(cax)
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

How the model is evaluated:

Classification report: Shows metrics like precision, recall, and F1-score for the fraud and non-fraud classes.
Confusion matrix: Helps visualize the performance of the model by showing the true positives, false positives, true negatives, and false negatives.

Step 4: Retrain the Model with New Data

Once you have trained your model, it’s important to retrain it periodically with new data to ensure that it continues to detect emerging fraud patterns.

What is Retraining?

Retraining the model ensures that it adapts to new, unseen data and improves over time. In the case of fraud detection, retraining is crucial because fraud tactics evolve over time, and your model needs to stay up-to-date to recognize new patterns.

Here’s how you can do this:

# Simulate loading new fraud data
new_data = pd.read_csv('new_fraud_data.csv')

# Apply preprocessing steps to new data (like filling missing values, encoding, normalization)
new_data.fillna(new_data.median(), inplace=True)
new_data = pd.get_dummies(new_data, drop_first=True)
new_data['normalized_amount'] = (new_data['transaction_amount'] - new_data['transaction_amount'].mean()) / new_data['transaction_amount'].std()

# Concatenate old and new data for retraining
X_new = new_data.drop(columns=['fraud_label'])
y_new = new_data['fraud_label']

# Retrain the model with the updated dataset
X_combined = pd.concat([X_train, X_new], axis=0)
y_combined = pd.concat([y_train, y_new], axis=0)

rf_model.fit(X_combined, y_combined)

# Re-evaluate the model
y_pred_new = rf_model.predict(X_test)
print("Updated Model Evaluation:\n", classification_report(y_test, y_pred_new))

Step 5: Automated Alert System

To automate fraud detection, we’ll send an email whenever a suspicious transaction is detected.

How the Alert System Works

The email alert system uses SMTP to send an email whenever fraud is detected. When the model identifies a suspicious transaction, it triggers an automated alert to notify the compliance team for further investigation.

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

# Function to send an email alert
def send_alert(email_subject, email_body):
    sender_email = "your_email@example.com"
    receiver_email = "compliance_team@example.com"
    password = "your_password"

    msg = MIMEMultipart()
    msg['From'] = sender_email
    msg['To'] = receiver_email
    msg['Subject'] = email_subject

    msg.attach(MIMEText(email_body, 'plain'))

    # Send email using SMTP
    try:
        server = smtplib.SMTP_SSL('smtp.example.com', 465)
        server.login(sender_email, password)
        text = msg.as_string()
        server.sendmail(sender_email, receiver_email, text)
        server.quit()
        print("Fraud alert email sent successfully.")
    except Exception as e:
        print(f"Failed to send email: {str(e)}")

# Example: Check for fraud and trigger an alert
suspicious_transaction_details = "Transaction ID: 12345, Amount: $5000, Suspicious Activity Detected."
send_alert("Fraud Detection Alert", f"A suspicious transaction has been detected: {suspicious_transaction_details}")

Step 6: Visualize Model Performance

Finally, we will visualize the performance of the model using an ROC curve (Receiver Operating Characteristic Curve), which helps evaluate the trade-off between the true positive rate and false positive rate.

Visualizing the performance of a machine learning model is an essential step in understanding how well the model is doing, especially when it comes to evaluating its ability to detect fraudulent transactions.

What is an ROC curve?

An ROC curve shows how well a model performs across all classification thresholds. It plots the True Positive Rate (TPR) versus the False Positive Rate (FPR). The area under the ROC curve (AUC) provides a summary measure of model performance.

from sklearn.metrics import roc_curve, auc

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, rf_model.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

The ROC curve gives us a comprehensive picture of how well our model is distinguishing between the two classes across various thresholds. By evaluating this curve, we can make decisions on how to tune the model’s threshold to find the best balance between detecting fraud and minimizing false alarms (that is, minimizing false positives).

Conclusion

By following this guide, you’ve learned how to leverage MLOps to automate fraud detection and ensure compliance in the financial industry using Google Colab. This cloud-based environment makes it easy to work with machine learning models without the hassle of local setups or configurations.

From automating data preprocessing to deploying models in production, MLOps offers an end-to-end solution that improves efficiency, scalability, and accuracy in detecting fraudulent activities.

By integrating real-time monitoring and continuous updates, financial institutions can stay ahead of fraud threats while ensuring regulatory compliance with minimal manual effort.

Key Takeaways

MLOps automates the whole machine learning model lifecycle by integrating machine learning with DevOps.
Simplifies regulatory compliance and fraud detection, letting banks spot fraudulent transactions automatically.
Maintains fraud detection systems current with fresh data through constant monitoring and model retraining.
Machine learning model development and testing may be done on Google Colab, a free cloud-based platform that provides access to GPUs and TPUs. No local installation is required.
Allows for automated workflows to detect suspicious behavior and send out alerts in real-time, allowing for fraud detection and alerting.
Continuous integration/continuous delivery pipelines guarantee continuous system improvement by automating the testing and deployment of new fraud detection models.
Financial organizations may save money using MLOps because cloud-based systems like Google Colab lower infrastructure expenses.

How to Make IT Operations More Efficient with AIOps: Build Smarter, Faster Systems

Balajee Asish Brahmandam — Fri, 09 May 2025 21:20:18 +0000

In the rapidly evolving IT landscape, development teams have to operate at their best and manage complex systems while minimizing downtime. And having to do many routine tasks manually can really slow down operations and reduce efficiency.

These days, we can use artificial intelligence to manage and enhance IT operations. This is where AIOps for IT operations comes into play.

AIOps is changing IT operations as it lets teams create better, faster systems that can find and resolve problems on their own. It also helps them make the best use of resources, and grow without as many problems.

In this tutorial, you’ll learn about the key components of AIOps, how they interact with other IT systems, and how you can apply AIOps to improve the efficiency of your environment.

Here’s what we’ll cover:

What is AIOps?
- The Significance of AIOps for IT Operations
- AIOps can help address these challenges by
Getting Started with AIOps
Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management
Conclusion

What is AIOps?

AIOps is artificial intelligence for IT operations. It means enhancing and streamlining IT chores by means of artificial intelligence and machine learning.

AIOps systems examine the vast volumes of data generated by IT systems, such as logs and metrics, while utilizing machine learning methods. The main objective of AIOps is to enable companies to more quickly and effectively identify and resolve IT issues.

Key components of AIOps include:

Anomaly detection: the process of spotting unusual patterns in a system's operation that might indicate a problem.
Event correlation: the process of examining data from several sources to determine how they complement one another and help to explain why issues arise.
Automated response: acting to resolve issues without human assistance.

The Significance of AIOps for IT Operations

The rise of hybrid and multi-cloud platforms, microservices architectures, and systems that can expand quickly are complicating IT operations. Often, conventional IT management tools fall behind the size and speed of the systems that we need to monitor and maintain.

Here are some issues that often come up in standard IT operations:

Manual troubleshooting: IT teams sometimes must comb through logs and reports by hand to identify the root of issues.
Long settlement times: The longer it takes to resolve a problem after discovery, the more downtime and dissatisfied users result.
Scalability: Monitoring all system components becomes more difficult as they grow since more manual labor would be required.

AIOps can help address these challenges by

Improving incident resolution times: By correlating events and providing actionable insights, AIOps can resolve problems in real-time.
Scaling effortlessly: AIOps can handle large volumes of data and events without additional resources, making it ideal for scaling operations
Automating incident detection and response: AI models can detect issues and automatically resolve them, reducing manual intervention.

You can better understand AIOps by looking at its main components:

1. Machine Learning for Predictive Analytics

AIOps tools forecast future events by means of machine learning and examining historical data. Prediction analytics, for example, can inform teams when a system's performance is likely to decline, letting them address the issue before it worsens.

2. Automating and Self-Healing

AIOps lets your team automate daily tasks, eliminating the need for human intervention. Services, for instance, can be restarted, or resources can be relocated. Running the company costs less, and problem resolution takes less time.

3. Event Correlation and Root Cause Analysis

Event correlation is the technique of linking events from several related systems to identify the root cause of the problem. For instance, AIOps will examine server, network, and application logs to determine what’s wrong – whether it’s a network problem or a web application failure – and correct it.

Getting Started with AIOps

Enhancing your team’s IT operations with AIOps involves including tools and procedures run by artificial intelligence in your present system. These are the most crucial actions to start with:

1. Choose an AIOps Tool

There are several AIOps platforms available, each with its own set of features. Some popular AIOps tools include:

Moogsoft: An AIOps platform that uses machine learning for event correlation, anomaly detection, and incident management.
BigPanda: Focuses on automating incident management and root cause analysis.
Splunk IT Service Intelligence: Offers advanced analytics for monitoring and managing IT infrastructure.

When selecting an AIOps tool, consider the following:

Integration with existing tools: Ensure the platform integrates with your current monitoring, logging, and alerting systems.
Scalability: The platform should be able to handle large volumes of data and scale with your organization.
Ease of use: Look for a user-friendly interface and automation capabilities to minimize manual intervention.

2. Implement AIOps in Your IT Environment

These are the steps you’ll need to take to integrate AIOps into your IT operations:

Data aggregation: is the process of collecting data from various sources, including computers, network devices, cloud infrastructure, and applications, and consolidating it all onto one platform.
Determine thresholds and KPIs: Identify the most crucial key performance indicators such as error rates, system uptime, and response for your company.
Establishing alerts and automation: For instance, when thresholds are crossed, configure automatic responses to restart services or raise resource consumption.

3. Leverage Machine Learning for Anomaly Detection

Machine learning models are quite crucial in the search for anomalies. These models can identify trends that are not usual and learn from prior data. This enables IT departments to identify issues early on before they escalate.

Example: A machine learning model may detect a spike in CPU usage that is unusual for a particular time of day, triggering an alert or automatic remediation process, such as scaling the application to add more resources.

import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt

# Example dataset (e.g., CPU usage or network traffic over time)
data = np.array([50, 51, 52, 53, 200, 55, 56, 57, 58, 60]).reshape(-1, 1)

# Initialize Isolation Forest model for anomaly detection
model = IsolationForest(contamination=0.1)  # 10% outliers
model.fit(data)

# Predict anomalies: -1 indicates anomaly, 1 indicates normal
predictions = model.predict(data)

# Plotting the results
plt.plot(data, label="System Metric")
plt.scatter(np.arange(len(data)), data, c=predictions, cmap="coolwarm", label="Anomalies")
plt.title("Anomaly Detection in System Metric")
plt.legend()
plt.show()

4. Automate Root Cause Analysis

AIOps platforms can automatically correlate data from various sources to identify the root cause of incidents. For instance, if an application is experiencing high response times, AIOps can check the server logs, network status, and database performance to determine if the issue is due to a server failure, database bottleneck, or network congestion.

import splunklib.client as client
import splunklib.results as results

# Connect to Splunk server (replace with actual credentials)
service = client.Service(
    host='localhost',
    port=8089,
    username='admin',
    password='password'
)

# Perform a search query to find events related to system issues
search_query = 'search index=main "error" OR "fail" | stats count by sourcetype'

# Run the search
job = service.jobs.create(search_query)

# Wait for the search job to complete
while not job.is_done():
    print("Waiting for results...")
    time.sleep(2)

# Retrieve and process the results
for result in results.JSONResultsReader(job.results()):
    print(result)

5. Set Up Automated Responses Using Webhooks

In AIOps, automated incident response is triggered through Webhooks or other messaging systems. For example, when an anomaly is detected, a Webhook can notify a team or initiate a resolution process.

import requests

# Simulate an anomaly detection system that triggers when an anomaly is found
def send_alert_to_webhook(anomaly_detected):
    webhook_url = 'https://your-webhook-url.com'
    payload = {
        "text": f"Alert: Anomaly detected! Please review the system metrics immediately."
    }

    if anomaly_detected:
        response = requests.post(webhook_url, json=payload)
        print("Alert sent to webhook")
        return response.status_code
    return None

# Simulate anomaly detection
anomaly_detected = True  # Set to True when an anomaly is found

# Trigger automated response (alert)
status_code = send_alert_to_webhook(anomaly_detected)

if status_code == 200:
    print("Webhook triggered successfully")
else:
    print("Failed to trigger webhook")

6. Automate system cleanup with Ansible (sample playbook)

Automatic remediation is a major component of AIOps in resolving issues without any human intervention. Like restarting a service when a system measure exceeds a particular threshold, here is an illustration of an Ansible script that automatically resolves an issue.

- name: Automated Remediation for High CPU Usage
  hosts: all
  become: true
  tasks:
    - name: Check CPU Usage
      shell: "top -bn1 | grep load | awk '{printf \"%.2f\", $(NF-2)}'"
      register: cpu_load
      changed_when: false

    - name: Restart service if CPU load is high
      service:
        name: "your-service-name"
        state: restarted
      when: cpu_load.stdout | float > 80.0

Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management

Imagine a large-scale e-commerce company that operates in the cloud, hosting its infrastructure on AWS. The company’s platform is supported by hundreds of virtual machines (VMs), microservices, databases, and web servers.

As the company grows, so do the complexities of its IT operations, especially in managing system health, uptime, and performance. The company has a traditional monitoring setup in place using basic cloud-native tools. But as the platform scales, the sheer volume of data (logs, metrics, alerts) overwhelms the IT team, leading to delays in identifying the root cause of issues and resolving them in real time.

Challenges:

Incident overload: With hundreds of alerts coming in daily, the team struggled to prioritize critical incidents, which led to slower resolution times.
Manual processes: Identifying the root cause of issues required manual sifting through logs, which was time-consuming and error-prone.
Scalability issues: As the company scaled its infrastructure, manual intervention became increasingly inefficient, and the system could not dynamically respond to issues without human input.

AIOps implementation:

The company decided to implement an AIOps platform to streamline incident management, automate responses, and predict issues before they occurred.

Step 1: Setting Up Monitoring with Prometheus

First, we need to monitor system performance to collect metrics such as CPU usage and memory consumption. We’ll use Prometheus, an open-source monitoring tool, to collect this data.

Install Prometheus:

First, download and install Prometheus:

wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz
tar -xvzf prometheus-2.27.1.linux-amd64.tar.gz
cd prometheus-2.27.1.linux-amd64/
./prometheus

Then install Node Exporter (to collect system metrics):

wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar -xvzf node_exporter-1.1.2.linux-amd64.tar.gz
cd node_exporter-1.1.2.linux-amd64/
./node_exporter

Next, configure Prometheus to scrape metrics from Node Exporter:

##Edit prometheus.yml to scrape metrics from the Node Exporter:
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

And start Prometheus:

./prometheus --config.file=prometheus.yml

You can now access Prometheus via http://localhost:9090 to verify that it's collecting metrics.

Step 2: Collecting System Data (CPU Usage)

Now that we have Prometheus collecting metrics, we need to extract CPU usage data (which will be the focus of our anomaly detection) from Prometheus.

Querying Prometheus API for CPU Usage

We’ll use Python to query Prometheus and retrieve CPU usage data (for example, using the node_cpu_seconds_total metric). We’ll fetch the data for the last 30 minutes.

import requests
import pandas as pd
from datetime import datetime, timedelta

# Define the Prometheus URL and the query
prom_url = "http://localhost:9090/api/v1/query_range"
query = 'rate(node_cpu_seconds_total{mode="user"}[1m])'

# Define the start and end times
end_time = datetime.now()
start_time = end_time - timedelta(minutes=30)

# Make the request to Prometheus API
response = requests.get(prom_url, params={
    'query': query,
    'start': start_time.timestamp(),
    'end': end_time.timestamp(),
    'step': 60
})

data = response.json()['data']['result'][0]['values']
timestamps = [item[0] for item in data]
cpu_usage = [item[1] for item in data]

# Create a DataFrame for easier processing
df = pd.DataFrame({
    'timestamp': pd.to_datetime(timestamps, unit='s'),
    'cpu_usage': cpu_usage
})

print(df.head())

Step 3: Anomaly Detection with Machine Learning

To detect anomalies in CPU usage, we’ll use Isolation Forest, a machine learning algorithm from Scikit-learn.

Train an Anomaly Detection Model:

First, install Scikit-learn:

pip install scikit-learn matplotlib

Then you’ll need to train the model using the CPU usage data we collected:

from sklearn.ensemble import IsolationForest
import numpy as np
import matplotlib.pyplot as plt

# Prepare the data for anomaly detection (CPU usage data)
cpu_usage_data = df['cpu_usage'].values.reshape(-1, 1)

# Train the Isolation Forest model (anomaly detection)
model = IsolationForest(contamination=0.05)  # 5% expected anomalies
model.fit(cpu_usage_data)

# Predict anomalies (1 = normal, -1 = anomaly)
predictions = model.predict(cpu_usage_data)

# Add predictions to the DataFrame
df['anomaly'] = predictions

# Visualize the anomalies
plt.figure(figsize=(10, 6))
plt.plot(df['timestamp'], df['cpu_usage'], label='CPU Usage')
plt.scatter(df['timestamp'][df['anomaly'] == -1], df['cpu_usage'][df['anomaly'] == -1], color='red', label='Anomaly')
plt.title("CPU Usage with Anomalies")
plt.xlabel("Time")
plt.ylabel("CPU Usage (%)")
plt.legend()
plt.show()

Step 4: Automating Incident Response with AWS Lambda

When an anomaly is detected (for example, high CPU usage), AIOps can automatically trigger a response, such as scaling up resources.

AWS Lambda for Automated Scaling

Here’s an example of how to use AWS Lambda to scale up EC2 instances when CPU usage exceeds a threshold.

First, create your AWS Lambda function that scales EC2 instances when CPU usage exceeds 80%.

import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')

    # If CPU usage exceeds threshold, scale up EC2 instance
    if event['cpu_usage'] > 0.8:  # 80% CPU usage
        instance_id = 'i-1234567890'  # Replace with your EC2 instance ID
        ec2.modify_instance_attribute(InstanceId=instance_id, InstanceType={'Value': 't2.large'})

    return {
        'statusCode': 200,
        'body': f'Instance {instance_id} scaled up due to high CPU usage.'
    }

Then you’ll need to trigger the Lambda function. Set up AWS CloudWatch Alarms to monitor the output from the anomaly detection and trigger the Lambda function when CPU usage exceeds the threshold.

Step 5: Proactive Resource Scaling with Predictive Analytics

Finally, using predictive analytics, AIOps can forecast future resource usage and proactively scale resources before problems arise.

Predictive Scaling:

We’ll use a linear regression model to predict future CPU usage and trigger scaling events proactively.

Start by training a predictive model:

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

# Historical data (CPU usage trends)
data = pd.DataFrame({
    'timestamp': pd.date_range(start="2023-01-01", periods=100, freq='H'),
    'cpu_usage': np.random.normal(50, 10, 100)  # Simulated data
})

X = np.array(range(len(data))).reshape(-1, 1)  # Time steps
y = data['cpu_usage']

model = LinearRegression()
model.fit(X, y)

# Predict next 10 hours
future_prediction = model.predict([[len(data) + 10]])
print("Predicted CPU usage:", future_prediction)

If the predicted CPU usage exceeds a threshold, AIOps can trigger auto-scaling using AWS Lambda or Kubernetes.

Results:

Reduced incident resolution time: The average time to resolve incidents dropped from hours to minutes because AIOps helped the team identify issues faster.
Reduced false positives: By using anomaly detection, the system significantly reduced the number of false alerts.
Increased automation: With automated responses in place, the system dynamically adjusted resources in real time, reducing the need for manual intervention.
Proactive issue management: Predictive analytics enabled the team to address potential problems before they became critical, preventing performance degradation.

Conclusion

AIOps transforms IT operations, enabling companies to build more efficient, responsive, and superior systems. By automating routine tasks, identifying issues before they worsen, and providing real-time data, AIOps is altering the function of IT teams.

AIOps is the most effective tool for increasing system speed, reducing downtime, and streamlining your IT procedures. You can begin modestly, and gradually include more functionality. Then you’ll start to see how AIOps opens your IT environment to fresh ideas and increases its efficiency.

mlops - freeCodeCamp.org

How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP

Table of Contents

Prerequisites

Project Setup

Step 1: Install Packer

Step 2: Set Up Project Directory

Step 3: Install Packer's Plugins

Step 4: Define Your Source

Step 5: Writing the Build Template

Step 6: Writing the GPU Provisioning Script

Section 1: Pre-Installation (Kernel Headers)

Section 2: Installing NVIDIA's Apt Repository

Section 3: Pinning NVIDIA Drivers Version

Section 4: Installing the Driver

Section 5: CUDA Toolkit Installation

Section 6: NVIDIA Container Toolkit

Section 7: Installing DCGM (Data Center GPU Manager)

Section 8: Enabling Persistence Mode

Section 9: System Tuning for GPU Compute Workloads

Step 7: Assembling and Running the Build

Step 8: Test the Image and Verify the GPU Stack

Conclusion

References

Model Packaging Tools Every MLOps Engineer Should Know

Table Of Contents

Model Serialization Formats

4. Pickle and Joblib

Model Bundling and Serving Tools

Model Registries

Conclusion

How to Use MLflow to Manage Your Machine Learning Lifecycle

Table of Contents:

Prerequisites:

MLflow Architecture: The Big Picture

Understanding MLflow Tracking

A Tracking Example

Where Does the Data Actually Go?

Why Bother with This Setup?

Understanding MLflow Projects

The MLproject File

Why this Actually Matters

Understanding the MLflow Model Registry

Moving a Model through the Pipeline

Why Does This Matter?

How the Components Fit Together

Wrapping Up

How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD

Table of Contents

Project Overview and Setup

Tech Stack

Prerequisites

Project Structure

1. Build a Simple Model and API (The Naive Approach)

1.1 Train a Quick Model

1.2 Serve Predictions with FastAPI

2. Where the Naive Approach Breaks

Problem 1: No Experiment Tracking (Reproducibility)

Problem 2: Model Versioning and Deployment Chaos

Problem 3: No Data Validation – Garbage In, Garbage Out

Problem 4: Model Drift – Performance Decay Over Time

Problem 5: No CI/CD or Deployment Safety

Summary: What We Need to Fix

3. Add Experiment Tracking and Model Registry with MLflow

3.1 How to Set Up the MLflow Tracking Server

3.2 How to Log Experiments in Code

3.3 How to Use the Model Registry

3.4 Update API to Load from Registry

4. Ensure Feature Consistency with Feast

4.1 What is Feast and Why Use It?

4.2 Install and Initialize Feast

4.3 Define Feature Definitions

4.4 Materialize Features to Online Store

4.5 Retrieve Features for Training and Serving

Why Feast Over Custom Code?

5. Add Data Validation with Great Expectations

5.1 Define Expectations

When to Use Which Validation Approach

5.2 Integrate Validation into FastAPI

6. Monitor Model Performance and Data Drift