<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ mlops - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ mlops - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 01 Jun 2026 18:37:07 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/mlops/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP ]]>
                </title>
                <description>
                    <![CDATA[ Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-gpu-optimized-machine-image-with-hashicorp-packer-on-gcp/</link>
                <guid isPermaLink="false">69e93606d5f8830e7d9fbad6</guid>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ VM Image ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GCP ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hashicorp packer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rasheedat Atinuke Jamiu ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 20:30:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/fd393878-fe7c-458a-addf-7cd22d8280ac.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.</p>
<p>In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a href="#heading-step-1-install-packer">Step 1: Install Packer</a></p>
</li>
<li><p><a href="#heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</a></p>
</li>
<li><p><a href="#heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</a></p>
</li>
<li><p><a href="#heading-step-4-define-your-source">Step 4: Define Your Source</a></p>
</li>
<li><p><a href="#heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</a></p>
</li>
<li><p><a href="#heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</a></p>
<ul>
<li><p><a href="#heading-section-1-pre-installation-kernel-headers">section 1: Pre-Installation (Kernel Headers)</a></p>
</li>
<li><p><a href="#heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</a></p>
</li>
<li><p><a href="#heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</a></p>
</li>
<li><p><a href="#heading-section-4-installing-the-driver">Section 4: Installing the Driver</a></p>
</li>
<li><p><a href="#heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</a></p>
</li>
<li><p><a href="#heading-section-6-nvidia-container-toolkit">Section 6: Nvidia Container Toolkit</a></p>
</li>
<li><p><a href="#heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM — Data Center GPU Manager</a></p>
</li>
<li><p><a href="#heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</a></p>
</li>
<li><p><a href="#heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-7assembling-and-running-the-build">Step 7:Assembling and Running the Build</a></p>
</li>
<li><p><a href="#heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><a href="https://www.packer.io/">HashiCorp Packer</a> &gt;= 1.9</p>
</li>
<li><p><a href="https://github.com/hashicorp/packer-plugin-googlecompute">Google Compute Packer plugin</a> (installed via <code>packer init</code>)</p>
</li>
<li><p>Optionally, the <a href="https://github.com/hashicorp/packer-plugin-amazon">AWS Packer plugin</a> can be used for EC2 builds by adding an <code>amazon-ebs</code> source to <code>node.pkr.hcl</code></p>
</li>
<li><p>GCP project with Compute Engine API enabled (or AWS account with EC2 access)</p>
</li>
<li><p>GCP authentication (<code>gcloud auth application-default login</code>) or AWS credentials</p>
</li>
<li><p>Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-step-1-install-packer">Step 1: Install Packer</h3>
<p>To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation <a href="https://developer.hashicorp.com/packer/tutorials/docker-get-started/get-started-install-cli#:~:text=Chocolatey%20on%20Windows-,Linux,-HashiCorp%20officially%20maintains">guides</a>).</p>
<p>First, you'll install the official Packer formula from the terminal.</p>
<p>Install the HashiCorp tap, a repository of all Hashicorp packages.</p>
<pre><code class="language-plaintext">$ brew tap hashicorp/tap
</code></pre>
<p>Now, install Packer with <code>hashicorp/tap/packer</code>.</p>
<pre><code class="language-plaintext">$ brew install hashicorp/tap/packer
</code></pre>
<h3 id="heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</h3>
<p>With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your <code>packer_demo</code> folder using the command below:</p>
<pre><code class="language-plaintext">mkdir -p packer_demo/script &amp;&amp; touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh
</code></pre>
<p>Your file directory should look like this:</p>
<pre><code class="language-plaintext">packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script 
</code></pre>
<h3 id="heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</h3>
<p>In your <code>plugins.pkr.hcl file,</code>, define your plugins in the <code>packer block.</code> The <code>packer {}</code> block contains Packer settings, including specifying a required plugin version. You'll find the <code>required_plugins</code> block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin <a href="https://developer.hashicorp.com/packer/integrations">here</a>.</p>
<pre><code class="language-hcl">packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~&gt; 1"
    }
  }
}
</code></pre>
<p>Then, initialize your Packer plugin with the command below:</p>
<pre><code class="language-plaintext">packer init .
</code></pre>
<h3 id="heading-step-4-define-your-source">Step 4: Define Your Source</h3>
<p>With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your <code>project ID</code>, the zone where your machine will be created, the <code>source_image_family</code> (think of this as your base image, such as Debian, Ubuntu, and so on), and your <code>source_image_project_id</code>.</p>
<p>In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the <code>machine type</code> to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.</p>
<pre><code class="language-hcl">source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}
</code></pre>
<p>Setting <code>on_host_maintenance = "TERMINATE"</code> on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.</p>
<p>You'll define all your variables in the <code>variable.pkr.hcl</code> file, and set the values in the <code>values.pkrvars.hcl</code>. Remember to always add your <code>values.pkrvars.hcl</code> file to Gitignore.</p>
<pre><code class="language-hcl">variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}
</code></pre>
<p><code>values.pkrvars.hcl</code></p>
<pre><code class="language-hcl">image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1" 
</code></pre>
<h3 id="heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</h3>
<p>Create <code>build.pkr.hcl</code>. The <code>build</code> block creates a temporary instance, runs provisioners, and produces an image.</p>
<p>Provisioners in this template are organized as follows:</p>
<ul>
<li><p><strong>First provisioner</strong> runs system updates and upgrades.</p>
</li>
<li><p><strong>Second provisioner</strong> reboots the instance (<code>expect_disconnect = true</code>).</p>
</li>
<li><p><strong>Third provisioner</strong> waits for the instance to come back (<code>pause_before</code>), then runs <code>script/base.sh</code>. This provisioner sets <code>max_retries</code> to handle transient SSH timeouts and pass environment variables for <code>DRIVER_VERSION</code> and <code>CUDA_VERSION</code>.</p>
</li>
</ul>
<p>Lastly, you have the post-processor to tell you the image ID and completion status:</p>
<pre><code class="language-hcl">build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}
</code></pre>
<h3 id="heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</h3>
<p>Now we'll go through the base script, and break down some parts of it.</p>
<h3 id="heading-section-1-pre-installation-kernel-headers">Section 1: Pre-Installation (Kernel Headers)</h3>
<p>Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.</p>
<pre><code class="language-shellscript">log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget
</code></pre>
<h3 id="heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</h3>
<p>This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.</p>
<pre><code class="language-shellscript">log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq
</code></pre>
<h3 id="heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</h3>
<p>Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.</p>
<p>NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit</p>
<p>A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.</p>
<pre><code class="language-shellscript">log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"
</code></pre>
<h3 id="heading-section-4-installing-the-driver">Section 4: Installing the Driver</h3>
<p>The <code>libnvidia-compute</code> installs only the compute‑related user‑space libraries (CUDA driver components), while the <code>nvidia-dkms-open;</code> installs the <strong>open‑source NVIDIA kernel module</strong>, built locally via DKMS.</p>
<p>Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.</p>
<p>Here, we're using <strong>NVIDIA’s compute‑only driver stack using the open‑source kernel modules</strong>, as it deliberately avoids installing any display-related components, which you don't need.</p>
<p>This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open
</code></pre>
<h3 id="heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</h3>
<p>This part of the script installs the <strong>CUDA Toolkit</strong> for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.</p>
<p>It adds CUDA binaries to PATH, so commands like <code>nvcc</code>, <code>cuda-gdb</code>, and <code>cuda-memcheck</code> work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.</p>
<pre><code class="language-shellscript">log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig
</code></pre>
<h3 id="heading-section-6-nvidia-container-toolkit">Section 6: NVIDIA Container Toolkit</h3>
<p>This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi
</code></pre>
<h3 id="heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM (Data Center GPU Manager)</h3>
<p>This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.</p>
<p>It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.</p>
<p>The script extracts the installed version and checks that it meets the <strong>minimum required version</strong> for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.</p>
<pre><code class="language-shellscript">log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi
</code></pre>
<h3 id="heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</h3>
<p>The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.</p>
<p>Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.</p>
<pre><code class="language-shellscript">log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced
</code></pre>
<h3 id="heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</h3>
<p>This block applies a set of <strong>system‑level performance and stability tunings</strong> that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.</p>
<p>Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.</p>
<ul>
<li><p>Swap and memory behavior: Disabling swap and setting <code>vm.swappiness=0</code> prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.</p>
</li>
<li><p>Hugepages for large memory allocations: Setting <code>vm.nr_hugepages=2048</code> allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.</p>
<p>CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.</p>
</li>
<li><p>CPU frequency governor: Installing <code>cpupower</code> and forcing the CPU governor to <code>performance</code> ensures the CPU stays at maximum frequency instead of scaling down.</p>
<p>GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.</p>
</li>
<li><p>NUMA and topology tools: Installing <code>numactl</code>, <code>libnuma-dev</code>, and <code>hwloc</code> provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.</p>
</li>
<li><p>Disabling irqbalance: Stopping and disabling <code>irqbalance</code> it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.</p>
</li>
</ul>
<pre><code class="language-shell">log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system
</code></pre>
<p>Full base.sh script here:</p>
<pre><code class="language-shell">#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" &gt;&amp;2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] &amp;&amp; error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] &amp;&amp; error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release &amp;&amp; echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"
</code></pre>
<h2 id="heading-step-7-assembling-and-running-the-build">Step 7: Assembling and Running the Build</h2>
<p>Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.</p>
<pre><code class="language-shellscript">packer validate -var-file=values.pkrvars.hcl .
</code></pre>
<p>If validation succeeds, you’ll see a short confirmation like <code>The configuration is valid.</code>. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:</p>
<pre><code class="language-plaintext">packer build -var-file=values.pkrvars.hcl .
</code></pre>
<p>The build typically takes <strong>15–20 minutes,</strong> depending on network speed and package installs. Watch the Packer log for three key checkpoints:</p>
<ul>
<li><p><strong>Instance creation</strong> — confirms the temporary VM was provisioned.</p>
</li>
<li><p><strong>Provisioner output</strong> — shows each script step (updates, reboot, <code>script/base.sh</code>) and any errors.</p>
</li>
<li><p><strong>Image creation</strong> — indicates the build finished and an image artifact was written.</p>
</li>
</ul>
<p>If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.</p>
<pre><code class="language-plaintext">googlecompute.gpu-node: output will be in this color.

==&gt; googlecompute.gpu-node: Checking image does not exist...
==&gt; googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==&gt; googlecompute.gpu-node: no persistent disk to create
==&gt; googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==&gt; googlecompute.gpu-node: Creating instance...
==&gt; googlecompute.gpu-node: Loading zone: us-central1-a
==&gt; googlecompute.gpu-node: Loading machine type: g2-standard-4
==&gt; googlecompute.gpu-node: Requesting instance creation...
==&gt; googlecompute.gpu-node: Waiting for creation operation to complete...
==&gt; googlecompute.gpu-node: Instance has been created!
==&gt; googlecompute.gpu-node: Waiting for the instance to become running...
==&gt; googlecompute.gpu-node: IP: 34.58.58.214
==&gt; googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==&gt; googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==&gt; googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: No containers need to be restarted.
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: User sessions running outdated binaries:
==&gt; googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==&gt; googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==&gt; googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==&gt; googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==&gt; googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==&gt; googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==&gt; googlecompute.gpu-node: [BASE] Updating system packages...
==&gt; googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==&gt; googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==&gt; googlecompute.gpu-node: [BASE] Installing DCGM...
==&gt; googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==&gt; googlecompute.gpu-node: [BASE] Applying system tuning...
==&gt; googlecompute.gpu-node: vm.swappiness=0
==&gt; googlecompute.gpu-node: vm.nr_hugepages=2048
==&gt; googlecompute.gpu-node: Setting cpu: 0
==&gt; googlecompute.gpu-node: Error setting new values. Common errors:
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==&gt; googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==&gt; googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==&gt; googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==&gt; googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: Deleting instance...
==&gt; googlecompute.gpu-node: Instance has been deleted!
==&gt; googlecompute.gpu-node: Creating image...
==&gt; googlecompute.gpu-node: Deleting disk...
==&gt; googlecompute.gpu-node: Disk has been deleted!
==&gt; googlecompute.gpu-node: Running post-processor:  (type shell-local)
==&gt; googlecompute.gpu-node (shell-local): Running local shell script: 
==&gt; googlecompute.gpu-node (shell-local): === Image Build Complete ===
==&gt; googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==&gt; googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==&gt; Wait completed after 17 minutes 55 seconds

==&gt; Builds finished. The artifacts of successful builds are:
--&gt; googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134
</code></pre>
<h3 id="heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</h3>
<p>Confirm the image exists in the GCP Console: <strong>Compute → Storage → Images</strong> and locate your newly created OS image.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/90f304eb-3fe7-4304-b2ad-d86701dde607.png" alt="Your Image information on GCP" style="display:block;margin:0 auto" width="1686" height="692" loading="lazy">

<p>Create a test VM from the image:</p>
<pre><code class="language-plaintext">gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING
</code></pre>
<p>Once the instance is <code>RUNNING</code>, verify the NVIDIA driver and GPU are visible:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/364df8fc-7584-40df-8ab7-b3fe349d5065.png" alt="Output from the Nvidia-SMI command showing Driver and CUDA Version" style="display:block;margin:0 auto" width="1508" height="630" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/0912c303-3bb0-47fa-aa34-1c91ff26874f.png" alt="Image verifying the persistence mode is enabled" style="display:block;margin:0 auto" width="1508" height="80" loading="lazy">

<p><strong>The</strong> <code>nvidia-smi</code> <strong>output confirms:</strong></p>
<ul>
<li><p>Driver 590.48.01 loaded</p>
</li>
<li><p>CUDA 13.1 available</p>
</li>
<li><p>Persistence Mode is <code>On</code></p>
</li>
<li><p>The L4 GPU is detected with 23GB VRAM</p>
</li>
<li><p>Zero ECC errors</p>
</li>
<li><p>No running processes (clean idle state).</p>
</li>
</ul>
<p>This is exactly what a healthy base image should look like. Notice <code>Disp.A: Off</code>? That confirms our compute-only driver choice is working — no display adapter is active.</p>
<p>Confirm the installed CUDA toolkit by running. <code>nvcc --version</code>. You can see that version 13.1 was installed as specified.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/cc744624-9408-4348-88d7-61da04b5e1d0.png" alt="Output from the NVCC -Version command" style="display:block;margin:0 auto" width="1508" height="202" loading="lazy">

<p>Let's confirm DCGM installation by running <code>dcgmi discovery -l</code>. Successful output indicates DCGM is running and communicating with the driver.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/114996c6-1f28-43d4-a3fa-13aa7ccd2c82.png" alt="Output from the DCGMI dicovery -l command showing device information" style="display:block;margin:0 auto" width="1508" height="714" loading="lazy">

<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.</p>
<p>From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.</p>
<p>The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.</p>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p>NVIDIA Driver Installation Guide (Ubuntu): <a href="https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/">https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/</a></p>
</li>
<li><p>NVIDIA CUDA Toolkit Documentation: <a href="https://docs.nvidia.com/cuda/">https://docs.nvidia.com/cuda/</a></p>
</li>
<li><p>NVIDIA Container Toolkit Installation Guide: <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html">https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html</a></p>
</li>
<li><p>NVIDIA DCGM Documentation: <a href="https://docs.nvidia.com/datacenter/dcgm/latest/index.html">https://docs.nvidia.com/datacenter/dcgm/latest/index.html</a></p>
</li>
<li><p>NVIDIA Persistence Daemon: <a href="https://docs.nvidia.com/deploy/driver-persistence/index.html">https://docs.nvidia.com/deploy/driver-persistence/index.html</a></p>
</li>
<li><p>HashiCorp Packer Documentation: <a href="https://developer.hashicorp.com/packer/docs">https://developer.hashicorp.com/packer/docs</a></p>
</li>
<li><p>Packer Google Compute Builder: <a href="https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute">https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Model Packaging Tools Every MLOps Engineer Should Know ]]>
                </title>
                <description>
                    <![CDATA[ Most machine learning deployments don’t fail because the model is bad. They fail because of packaging. Teams often spend months fine-tuning models (adjusting hyperparameters and improving architecture ]]>
                </description>
                <link>https://www.freecodecamp.org/news/model-packaging-tools-every-mlops-engineer-should-know/</link>
                <guid isPermaLink="false">69d3ca7840c9cabf443c9ce3</guid>
                
                    <category>
                        <![CDATA[ ML ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Temitope Oyedele ]]>
                </dc:creator>
                <pubDate>Mon, 06 Apr 2026 15:00:08 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4fa02714-2cea-4592-813e-a5d5ebaf0842.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most machine learning deployments don’t fail because the model is bad. They fail because of packaging.</p>
<p>Teams often spend months fine-tuning models (adjusting hyperparameters and improving architectures) only to hit a wall when it’s time to deploy. Suddenly, the production system can’t even read the model file. Everything breaks at the handoff between research and production.</p>
<p>The good news? If you think about packaging from the start, you can save up to 60% of the time usually spent during deployment. That’s because you avoid the common friction between the experimental environment and the production system.</p>
<p>In this guide, we’ll walk through eleven essential tools every MLOps engineer should know. To keep things clear, we’ll group them into three stages of a model’s lifecycle:</p>
<ul>
<li><p><strong>Serialization</strong>: how models are stored and transferred</p>
</li>
<li><p><strong>Bundling &amp; Serving</strong>: how models are deployed and run</p>
</li>
<li><p><strong>Registry</strong>: how models are tracked and versioned</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table Of Contents</h2>
<ul>
<li><p><a href="#heading-model-serialization-formats">Model Serialization Formats</a></p>
<ul>
<li><p><a href="#heading-1-onnx-open-neural-network-exchangehttpsonnxai">1. ONNX (Open Neural Network Exchange)</a></p>
</li>
<li><p><a href="#heading-2-torchscripthttpsdocspytorchorgdocsstabletorchcompilerapihtml">2. TorchScript</a></p>
</li>
<li><p><a href="#heading-3-tensorflow-savedmodelhttpswwwtensorfloworgguidesavedmodel">3. TensorFlow SavedModel</a></p>
</li>
<li><p><a href="#heading-4-picklehttpsdocspythonorg3librarypicklehtmlle-joblibhttpsjoblibreadthedocsioenstable">4. Picklele / Joblib</a></p>
</li>
<li><p><a href="#heading-5-safetensorshttpsgithubcomhuggingfacesafetensors">5. Safetensors</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-model-bundling-and-serving-tools">Model Bundling and Serving Tools</a></p>
<ul>
<li><p><a href="#heading-1-bentomlhttpsdocsbentomlcomenlatest">1. BentoML</a></p>
</li>
<li><p><a href="#heading-2-nvidia-triton-inference-serverhttpsgithubcomtriton-inference-serverserver">2. NVIDIA Triton Inference Server</a></p>
</li>
<li><p><a href="#heading-3-torchservehttpsdocspytorchorgserverve">3. TorchServerve</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-model-registries">Model Registries</a></p>
<ul>
<li><p><a href="#heading-1-mlflow-model-registryhttpsmlfloworgdocslatestmlmodel-registry">1. MLflow Model Registry</a></p>
</li>
<li><p><a href="#heading-2-hugging-face-hubhttpshuggingfacecodocshubindex">2. Hugging Face Hub</a></p>
</li>
<li><p><a href="#heading-3-weights-amp-biaseshttpsdocswandbaimodels">3. Weights &amp; Biases</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-model-serialization-formats">Model Serialization Formats</h2>
<p>Serialization is simply the process of turning a trained model into a file that can be stored and moved around. It’s the first step in the pipeline, and it matters more than people think. The format you choose determines how your model will be loaded later in production.</p>
<p>So, you want something that either works across different frameworks or is optimized for the environment where your model will eventually run.</p>
<p>Below are some of the most common tools in this space:</p>
<h3 id="heading-1-onnx-open-neural-network-exchange"><a href="https://onnx.ai/">1. ONNX (Open Neural Network Exchange)</a></h3>
<p>ONNX is basically the common language for model serialization. It lets you train a model in one framework, like PyTorch, and then deploy it somewhere else without running into compatibility issues. It also performs well across different types of hardware.</p>
<p>ONNX separates your training framework from your inference runtime and allows hardware-level optimizations like quantization and graph fusion. It’s also widely supported across cloud platforms and edge devices.</p>
<p><strong>Key considerations:</strong> This format makes it possible to decouple training from deployment, while still enabling performance optimizations across different hardware setups.</p>
<p><strong>When to use it:</strong> Use ONNX when you need portability –&nbsp;especially if different teams or environments are involved.</p>
<h3 id="heading-2-torchscript"><a href="https://docs.pytorch.org/docs/stable/torch.compiler_api.html">2. TorchScript</a></h3>
<p>TorchScript lets you compile PyTorch models into a format that can run without Python. That means you can deploy it in environments like C++ or mobile without carrying the full Python runtime.</p>
<p>It supports two approaches: tracing (recording execution with sample inputs) and scripting (capturing full control flow).</p>
<p><strong>Key considerations:</strong> Its biggest advantage is removing the Python dependency, which helps reduce latency and makes it suitable for more constrained environments.</p>
<p><strong>When to use it:</strong> Best for high-performance systems where Python would be too heavy or introduce security concerns.</p>
<h3 id="heading-3-tensorflow-savedmodel"><a href="https://www.tensorflow.org/guide/saved_model">3. TensorFlow SavedModel</a></h3>
<p>SavedModel is TensorFlow’s native format. It stores everything –&nbsp;the computation graph, weights, and serving logic – in a single directory.</p>
<p>It’s also the standard input format for TensorFlow Serving, TFLite, and Google Cloud AI Platform.</p>
<p><strong>Key considerations:</strong> It keeps everything within the TensorFlow ecosystem intact, so you don’t lose any part of the model when moving to production.</p>
<p><strong>When to use it:</strong> If your project is built on TensorFlow, this is the default and safest choice.</p>
<h3 id="heading-4-pickle-and-joblib">4. &nbsp;<a href="https://docs.python.org/3/library/pickle.html">Pickle</a> and <a href="https://joblib.readthedocs.io/en/stable/">Joblib</a></h3>
<p>Pickle is Python’s built-in way of saving objects, and Joblib builds on top of it to better handle large arrays and models.</p>
<p>These are commonly used for scikit-learn pipelines, XGBoost models, and other traditional ML setups.</p>
<p><strong>Key considerations:</strong> They’re simple and convenient, but come with real trade-offs. Pickle can execute arbitrary code when loading, which makes it unsafe in untrusted environments. It’s also tightly coupled to Python versions and library dependencies, so models can break when moved across environments.</p>
<p><strong>When to use it:</strong> Best suited for controlled environments where everything runs in the same Python stack, such as internal tools, quick prototypes, or batch jobs.</p>
<p>It’s especially practical when you’re working with classical ML models and don’t need cross-language support or long-term portability. Avoid it for production systems that require security, reproducibility, or deployment across different environments.</p>
<h3 id="heading-5-safetensors"><a href="https://github.com/huggingface/safetensors">5. Safetensors</a></h3>
<p>Safetensors is a newer format developed by Hugging Face. It’s designed to be safe, fast, and straightforward.</p>
<p>It avoids arbitrary code execution and allows efficient loading directly from disk.</p>
<p><strong>Key considerations:</strong> It’s both memory-efficient and secure, which makes it a strong alternative to older formats like Pickle.</p>
<p><strong>When to use it:</strong> Ideal for modern workflows where speed and safety are important.</p>
<h2 id="heading-model-bundling-and-serving-tools">Model Bundling and Serving Tools</h2>
<p>Once your model is saved, the next step is making it usable in production. That means wrapping it in a way that can handle requests and connect it to the rest of your system.</p>
<h3 id="heading-1-bentoml"><a href="https://docs.bentoml.com/en/latest/">1. BentoML</a></h3>
<p>BentoML allows you to define your model service in Python – including preprocessing, inference, and postprocessing – and package everything into a single unit called a “Bento.”</p>
<p>This bundle includes the model, code, dependencies, and even Docker configuration.</p>
<p><strong>Key considerations</strong>: It simplifies deployment by packaging everything into one consistent artifact that can run anywhere.</p>
<p><strong>When to use it</strong>: Great when you want to ship your model and all its logic together as one deployable unit.</p>
<h3 id="heading-2-nvidia-triton-inference-server"><a href="https://github.com/triton-inference-server/server">2. NVIDIA Triton Inference Server</a></h3>
<p>Triton is NVIDIA’s production-grade inference server. It supports multiple model formats like ONNX, TorchScript, TensorFlow, and more.</p>
<p>It’s built for performance, using features like dynamic batching and concurrent execution to fully utilize GPUs.</p>
<p><strong>Key considerations:</strong> It delivers high throughput and efficiently uses hardware, especially GPUs, while supporting models from different frameworks.</p>
<p><strong>When to use it:</strong> Best for large-scale deployments where performance, low latency, and GPU usage are critical.</p>
<h3 id="heading-3-torchserve"><a href="https://docs.pytorch.org/serve/">3. TorchServe</a></h3>
<p>TorchServe is the official serving tool for PyTorch, developed with AWS.</p>
<p>It packages models into a MAR file, which includes weights, code, and dependencies, and provides APIs for managing models in production.</p>
<p><strong>Key considerations:</strong> It offers built-in features for versioning, batching, and management without needing to build everything from scratch.</p>
<p><strong>When to use it:</strong> A solid choice for deploying PyTorch models in a standard production setup.</p>
<h2 id="heading-model-registries">Model Registries</h2>
<p>A model registry is essentially your source of truth. It stores your models, tracks versions, and manages their lifecycle from experimentation to production.</p>
<p>Without one, things quickly become messy and hard to track.</p>
<h3 id="heading-1-mlflow-model-registry"><a href="https://mlflow.org/docs/latest/ml/model-registry/">1. MLflow Model Registry</a></h3>
<p>MLflow is one of the most widely used MLOps platforms. Its registry helps manage model versions and track their progression through stages like Staging and Production.</p>
<p>It also links models back to the experiments that created them.</p>
<p><strong>Key considerations:</strong> It provides strong lifecycle management and makes it easier to track and audit models.</p>
<p><strong>When to use it:</strong> Ideal for teams that need structured workflows and clear governance.</p>
<h3 id="heading-2-hugging-face-hub"><a href="https://huggingface.co/docs/hub/index">2. Hugging Face Hub</a></h3>
<p>The Hugging Face Hub is one of the largest platforms for sharing and managing models.</p>
<p>It supports both public and private repositories, along with dataset versioning and interactive demos.</p>
<p><strong>Key considerations:</strong> It offers a huge library of models and makes collaboration very easy.</p>
<p><strong>When to use it:</strong> Perfect for projects involving transformers, generative AI, or anything that benefits from sharing and discovery.</p>
<h3 id="heading-3-weights-and-biases"><a href="https://docs.wandb.ai/models">3. Weights and Biases</a></h3>
<p>Weights &amp; Biases combines experiment tracking with a model registry.</p>
<p>It connects each model directly to the training run that produced it.</p>
<p><strong>Key considerations:</strong> It gives you full traceability, so you always know how a model was created.</p>
<p><strong>When to use it:</strong> Best when you want a strong link between experimentation and production artifacts.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Machine learning systems rarely fail because the models are bad. They fail because the path to production is fragile.</p>
<p>Packaging is what connects research to production. If that connection is weak, even great models won’t make it into real use.</p>
<p>Choosing the right tools across serialization, serving, and registry layers makes systems easier to deploy and maintain. Formats like ONNX and Safetensors improve portability and safety. Tools like Triton and BentoML help with reliable serving. Registries like MLflow and Hugging Face Hub keep everything organized.</p>
<p>The main idea is simple: don’t leave deployment as something to figure out later.</p>
<p>When packaging is planned early, teams move faster and avoid a lot of unnecessary problems.</p>
<p>In practice, success in MLOps isn’t just about building models. It’s about making sure they actually run in the real world.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use MLflow to Manage Your Machine Learning Lifecycle ]]>
                </title>
                <description>
                    <![CDATA[ Training machine learning models usually starts out being organized and ends up in absolute chaos. We’ve all been there: dozens of experiments scattered across random notebooks, and model files saved  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-mlflow-to-manage-your-machine-learning-lifecycle/</link>
                <guid isPermaLink="false">69c18bfc30a9b81e3a92bbbd</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Temitope Oyedele ]]>
                </dc:creator>
                <pubDate>Mon, 23 Mar 2026 18:52:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/f829ab55-926d-43cd-b027-16c754445b09.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Training machine learning models usually starts out being organized and ends up in absolute chaos.</p>
<p>We’ve all been there: dozens of experiments scattered across random notebooks, and model files saved as <code>model_v2_final_FINAL.pkl</code> because no one is quite sure which version actually worked.</p>
<p>Once you move from a solo project to a team, or try to push something to production, that "organized chaos" quickly becomes a serious bottleneck.</p>
<p>Solving this mess requires more than just better naming conventions: it requires a way to standardize how we track and hand off our work. This is the specific gap MLflow was built to fill.</p>
<p>Originally released by the team at Databricks in 2018, it has become a standard open-source platform for managing the entire machine learning lifecycle. It acts as a central hub where your experiments, code, and models live together, rather than being tucked away in forgotten folders.</p>
<p>In this tutorial, we'll cover the core philosophy behind MLflow and how its modular architecture solves the 'dependency hell' of machine learning. We'll break down the four primary pillars of Tracking, Projects, Models, and the Model Registry, and walk through a practical implementation of each so you can move your projects from local notebooks to a production-ready lifecycle.</p>
<h3 id="heading-table-of-contents">Table of Contents:</h3>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites:</a></p>
</li>
<li><p><a href="#heading-mlflow-architecture-the-big-picture">MLflow Architecture: The Big Picture</a></p>
</li>
<li><p><a href="#heading-understanding-mlflow-tracking">Understanding MLflow Tracking</a></p>
<ul>
<li><p><a href="#heading-a-tracking-example">A Tracking Example</a></p>
</li>
<li><p><a href="#heading-where-does-the-data-actually-go">Where Does the Data Actually Go?</a></p>
</li>
<li><p><a href="#heading-why-bother-with-this-setup">Why Bother with This Setup?</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-understanding-mlflow-projects">Understanding MLflow Projects</a></p>
<ul>
<li><p><a href="#heading-the-mlproject-file">The MLproject File</a></p>
</li>
<li><p><a href="#heading-why-this-actually-matters">Why this Actually Matters</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-understanding-the-mlflow-model-registry">Understanding the MLflow Model Registry</a></p>
</li>
<li><p><a href="#heading-moving-a-model-through-the-pipeline">Moving a Model through the Pipeline</a></p>
<ul>
<li><a href="#heading-why-does-this-matter">Why Does This Matter?</a></li>
</ul>
</li>
<li><p><a href="#heading-how-the-components-fit-together">How the Components Fit Together</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<p>To get the most out of this tutorial, you should have:</p>
<ul>
<li><p><strong>Basic Python proficiency:</strong> Comfort with context managers (<code>with</code> statements) and decorators.</p>
</li>
<li><p><strong>Machine Learning fundamentals:</strong> A general understanding of training/testing splits and model evaluation metrics (like accuracy or loss).</p>
</li>
<li><p><strong>Local Environment:</strong> Python 3.8+ installed. Familiarity with <code>pip</code> or <code>conda</code> for installing packages is helpful.</p>
</li>
</ul>
<h2 id="heading-mlflow-architecture-the-big-picture">MLflow Architecture: The Big Picture</h2>
<p>To understand why MLflow is so effective, you have to look at how it's actually put together. MLflow isn't one giant or rigid tool. It’s a modular system designed around four loosely coupled components that are its core pillars.</p>
<p>This is a big deal because it means you don’t have to commit to the entire ecosystem at once. If you only need to track experiments and don't care about the other features, you can just use that part and ignore the rest.</p>
<p>To make this a bit more concrete, here is how those pieces map to things you probably already use:</p>
<ul>
<li><p><strong>MLflow Tracking:</strong> Logs experiments, metrics, and parameters. (Think: <strong>Git commits for ML runs</strong>)</p>
</li>
<li><p><strong>MLflow Projects:</strong> Packages code for reproducibility. (Think: <strong>A Docker image for ML code</strong>)</p>
</li>
<li><p><strong>MLflow Models:</strong> A standard format for multiple frameworks. (Think: <strong>A universal adapter</strong>)</p>
</li>
<li><p><strong>Model Registry:</strong> Handles versioning and governing models. (Think: <strong>A CI/CD pipeline for models</strong>)</p>
</li>
</ul>
<p>Architecturally, you can think of MLflow in two layers: the Client and the Server.</p>
<p>The Client is where you spend most of your time. It’s your training script or your Jupyter notebook where you log metrics or register a model.</p>
<p>The Server is the brain in the background that handles the storage. It consists of a Tracking Server, a Backend Store (usually a database like PostgreSQL), and an Artifact Store. That’s the place where big files like model weights live, such as S3 or GCS.</p>
<p>This separation is why MLflow is so flexible. You can start with everything running locally on your laptop using just your file system. When you're ready to scale up to a larger team, you can swap that out for a centralized server and cloud storage with almost no changes to your actual code. It grows with your project instead of forcing you to start over once things get serious.</p>
<p>Now, let's look at each of these four pillars of MLflow so you understand how they work.</p>
<h2 id="heading-understanding-mlflow-tracking">Understanding MLflow Tracking</h2>
<p>For most teams, the <strong>Tracking</strong> component is the front door to MLflow. Its job is simple: it acts as a digital lab notebook that records everything happening during a training run.</p>
<p>Instead of you frantically trying to remember what your learning rate was or where you saved that accuracy plot, MLflow just sits in the background and logs it for you.</p>
<p>The core unit here is the <strong>run</strong>. Think of a run as a single execution of your training code. During that run, the architecture captures four specific types of information:</p>
<ul>
<li><p><strong>Parameters:</strong> Your inputs, like batch size or the number of trees in a forest.</p>
</li>
<li><p><strong>Metrics:</strong> Your outputs, like accuracy or loss, which can be tracked over time.</p>
</li>
<li><p><strong>Artifacts:</strong> The "heavy" stuff, such as model weights, confusion matrices, or images.</p>
</li>
<li><p><strong>Tags and Metadata:</strong> Context like which developer ran the code and which Git commit was used.</p>
</li>
</ul>
<h3 id="heading-a-tracking-example">A Tracking Example</h3>
<p>Seeing this in practice is the best way to understand how the architecture actually works. You don't need to rebuild your entire pipeline – you just wrap your training logic in a context manager.</p>
<p>Here is what a basic integration looks like in Python:</p>
<pre><code class="language-python">import mlflow 
import mlflow.sklearn 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score 

# This block opens the run and keeps things organized
with mlflow.start_run():    
    # Log parameters    
    mlflow.log_param("n_estimators", 100)    
    mlflow.log_param("max_depth", 5)    
    
    # Train the model    
    model = RandomForestClassifier(n_estimators=100, max_depth=5)    
    model.fit(X_train, y_train)    
    
    # Log metrics    
    accuracy = accuracy_score(y_test, model.predict(X_test))    
    mlflow.log_metric("accuracy", accuracy)    
    
    # Log the model as an artifact    
    mlflow.sklearn.log_model(model, "random_forest_model")
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/0c63f9c4-3f16-4591-be58-51a0acca5f80.png" alt="A comparison table in the MLflow UI showing three training runs side-by-side, highlighting differences in parameters and metrics." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The <code>mlflow.start_run()</code> context manager creates a new run and automatically closes it when the block exits. Everything logged inside that block is associated with that run and stored in the Backend Store.</p>
<h3 id="heading-where-does-the-data-actually-go">Where Does the Data Actually Go?</h3>
<p>When you’re just starting out on your laptop, MLflow keeps things simple by creating a local <code>./mlruns</code> directory. The real power shows up when you move to a team environment and point everyone to a centralized Tracking Server.</p>
<p>The system splits the data based on how "heavy" it is. Your structured data (parameters and metrics) is small and needs to be searchable, so it goes into a SQL database like PostgreSQL. Your unstructured data (the actual model files or large plots) is too bulky for a database. The architecture ships that off to an Artifact Store like Amazon S3 or Google Cloud Storage.</p>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/e8aa2e4e-09a8-4767-a1f3-b07810680615.png" alt="The MLflow Artifact Store view showing the directory structure for a logged model, including the MLmodel metadata and model.pkl file." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-why-bother-with-this-setup">Why Bother with This Setup?</h3>
<p>Relying on "vibes" and messy naming conventions is a recipe for disaster once your project grows. It might work for a day or two, but it falls apart the moment you need to compare twenty different versions of a model.</p>
<p>By separating the tracking into its own architectural pillar, MLflow gives you a queryable history. Instead of digging through old notebooks, you can just hop into the UI, filter for the best results, and see exactly which configuration got you there. It takes the guesswork out of the "science" part of data science.</p>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/cd83e4b7-38b7-4644-8166-e48ba00d581a.png" alt="An MLflow Parallel Coordinates plot visualizing the relationship between the number of estimators and model accuracy across multiple runs." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/6d1383f5-7ace-4b9d-a566-64a3807cdcd7.png" alt="An MLflow scatter plot illustrating the positive correlation between the n_estimators parameter and the resulting model accuracy." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-understanding-mlflow-projects">Understanding MLflow Projects</h2>
<p>You can train the most accurate model in the world, but if your colleague can’t reproduce your results on their machine, that model isn't worth much.</p>
<p>This is where MLflow Projects come in. They solve the reproducibility headache by providing a standard way to package your code, your dependencies, and your entry points into one neat bundle.</p>
<p>Think of an MLflow Project as a directory (or a Git repo) with a special "instruction manual" at its root called an <code>MLproject</code> file. This file tells anyone (or any server) exactly what environment is needed and how to kick off the execution.</p>
<h3 id="heading-the-mlproject-file">The MLproject File</h3>
<p>Instead of sending someone a long README with installation steps, you just give them this file. Here is what a typical MLproject setup looks like for a training pipeline:</p>
<pre><code class="language-yaml">name: my_ml_project
conda_env: conda.yaml

entry_points:
  train:
    parameters:
      learning_rate: {type: float, default: 0.01}
      epochs: {type: int, default: 50}
      data_path: {type: str}
    command: "python train.py --lr {learning_rate} --epochs {epochs} --data {data_path}"
  
  evaluate:
    parameters:
      model_path: {type: str}
    command: "python evaluate.py --model {model_path}"
</code></pre>
<p>The conda_env line points to a conda.yaml file that lists the exact Python packages and versions your code needs. If you want even more isolation, MLflow supports Docker environments too.</p>
<p>The beauty of this setup is the simplicity. Anyone with MLflow installed can run your entire project with a single command:</p>
<pre><code class="language-bash">mlflow run . -P learning_rate=0.001 -P epochs=100 -P data_path=./data/train.csv
</code></pre>
<h3 id="heading-why-this-actually-matters">Why this Actually Matters</h3>
<p>MLflow Projects really shine in two specific scenarios. The first is onboarding. A new team member can clone your repo and be up and running in minutes, rather than spending their entire first day debugging library version conflicts.</p>
<p>The second is CI/CD. Because these projects are triggered programmatically, they fit perfectly into automated retraining pipelines. When reproducibility is non-negotiable, having a "single source of truth" for how to run your code makes life a lot easier for everyone involved.</p>
<h2 id="heading-understanding-the-mlflow-model-registry">Understanding the MLflow Model Registry</h2>
<p>Tracking experiments tells you which model is the "winner," but the Model Registry is where you actually manage that winner’s journey from your notebook to a live production environment.</p>
<p>Think of it as the governance layer. It handles versioning, stage management, and creates a clear audit trail so you never have to guess which model is currently running in the wild.</p>
<p>The Registry uses a few simple concepts to keep things organized:</p>
<ul>
<li><p><strong>Registered Model:</strong> This is the overall name for your project, like CustomerChurnPredictor.</p>
</li>
<li><p><strong>Model Version:</strong> Every time you push a new iteration, MLflow auto-increments the version (v1, v2, and so on).</p>
</li>
<li><p><strong>Stage:</strong> These are labels like <strong>Staging</strong>, <strong>Production</strong>, or <strong>Archived</strong>. They tell your team exactly where a model stands in its lifecycle.</p>
</li>
<li><p><strong>Annotations:</strong> These are just notes and tags. They’re great for documenting why a specific version was promoted or what its quirks are.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/bcd77d8f-a37c-4b0f-a112-9e2ad36d8cc2.png" alt="The MLflow Model Registry interface showing Version 1 of the IrisClassifier model officially transitioned to the Production stage." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-moving-a-model-through-the-pipeline">Moving a Model through the Pipeline</h2>
<p>In a real-world workflow, you don't just "deploy" a file. You transition it through stages. Here's how that looks using the MLflow Client:</p>
<pre><code class="language-plaintext">Python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# First, we register the model from a run that went well
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/random_forest_model",
    name="CustomerChurnPredictor"
)

# Then, we move Version 1 to Staging so the QA team can look at it
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Staging"
)

# Once everything checks out, we promote it to Production
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Production"
)
</code></pre>
<h3 id="heading-why-does-this-matter">Why Does This Matter?</h3>
<p>The Model Registry solves a problem that usually gets messy the moment a team grows: knowing exactly which version is live, who approved it, and what it was compared against. Without this, that information usually ends up buried in Slack threads or outdated spreadsheets.</p>
<p>It also makes rollbacks incredibly painless. If Version 3 starts acting up in production, you don't need to redeploy your entire stack. You can just transition Version 2 back to the "Production" stage in the registry. Since your serving infrastructure is built to always pull the "Production" tag, it will automatically swap back to the stable version.</p>
<h2 id="heading-how-the-components-fit-together">How the Components Fit Together</h2>
<p>To see how all of this actually works in the real world, it helps to walk through a typical workflow from start to finish. It's essentially a relay race where each component hands off the baton to the next one.</p>
<p>It starts with a data scientist running a handful of experiments. Every time they hit run, MLflow Tracking is in the background taking notes. It logs metrics and saves model artifacts into the Backend Store automatically. At this stage, everything is about exploration and finding that one winner.</p>
<p>Once that best run is identified, the model gets officially registered in the Model Registry. This is where the team takes over. They can hop into the UI to check the annotations, review the evaluation results, and move the model into Staging. After it passes a few more validation tests, it gets the green light and is promoted to Production.</p>
<p>When it is time to actually serve the model, the deployment system simply asks the Registry for the current Production version. This happens whether you are using Kubernetes, a cloud endpoint, or MLflow’s built-in server.</p>
<p>Because the MLproject file handled the dependencies and the MLflow Models format handled the framework details, the serving infrastructure does not have to care if the model was built with Scikit-learn or PyTorch. The hand-off is smooth because all the necessary info is already there.</p>
<p>This flow is what turns MLflow from a collection of useful utilities into a full MLOps platform. It connects the messy experimental phase of data science to the rigid world of production software.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>At the end of the day, MLflow architecture is built to stay out of your way. It doesn't force you to change how you write your code or which libraries you use. Instead, it just provides the structure needed to make your machine learning projects reproducible and easier to manage as a team.</p>
<p>Whether you're just trying to get away from naming files model_final_v2.pkl or you are building a complex CI/CD pipeline for your models, understanding these four pillars is the best place to start. The best way to learn is to just fire up a local tracking server and start logging. You will probably find that once you have that "source of truth" for your experiments, you will never want to go back to the old way of doing things.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and tru ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-end-to-end-ml-platform-locally-from-experiment-tracking-to-cicd/</link>
                <guid isPermaLink="false">69b9bab4c22d3eeb8afd5284</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Platform Engineering  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ FastAPI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sandeep Bharadwaj Mannapur ]]>
                </dc:creator>
                <pubDate>Tue, 17 Mar 2026 20:33:56 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/8401d978-0bed-4534-af93-f6bfc1b77c89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and trust over time.</p>
<p>Most ML systems fail in production for boring (and painful) reasons: the training code and the serving code drift apart, input data changes shape, a “small” preprocessing tweak breaks predictions, or the model silently degrades because real-world behavior shifts. None of these problems are solved by a better algorithm, they’re solved by engineering: repeatable pipelines, validation, versioning, monitoring, and automated checks.</p>
<p>In this hands-on handbook, you’ll build a complete mini ML platform on your local machine, an end-to-end project that takes a model from training to deployment with the core “last mile” infrastructure in place. We’ll use a fraud detection example (predicting fraudulent transactions), but the same workflow works for churn prediction or any binary classification problem. Everything runs locally (no cloud required), and every step is copy-paste runnable so you can follow along and verify outputs as you go.</p>
<p>By the end, you'll have a production-ready ML pipeline running on your machine – from training the model to serving predictions, with the infrastructure to test, monitor, and iterate with confidence. And yes, we'll do it in a hands-on manner with code snippets you can copy-paste and run. Let's dive in!</p>
<p>📦 <strong>Get the Complete Code</strong><br>All code from this handbook is available in a ready-to-run repository:<br><strong>Repository:</strong> <a href="https://github.com/sandeepmb/freecodecamp-local-ml-platform">https://github.com/sandeepmb/freecodecamp-local-ml-platform</a><br>Clone it and follow along, or use it as a reference implementation.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-project-overview-and-setup">Project Overview and Setup</a></p>
</li>
<li><p><a href="#heading-1-build-a-simple-model-and-api-the-naive-approach">Build a Simple Model and API (The Naive Approach)</a></p>
<ul>
<li><p><a href="#heading-11-train-a-quick-model">Train a Quick Model</a></p>
</li>
<li><p><a href="#heading-12-serve-predictions-with-fastapi">Serve Predictions with FastAPI</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-2-where-the-naive-approach-breaks">Where the Naive Approach Breaks</a></p>
<ul>
<li><p><a href="#heading-problem-1-no-experiment-tracking-reproducibility">Problem 1: No Experiment Tracking (Reproducibility)</a></p>
</li>
<li><p><a href="#heading-problem-2-model-versioning-and-deployment-chaos">Problem 2: Model Versioning and Deployment Chaos</a></p>
</li>
<li><p><a href="#heading-problem-3-no-data-validation-garbage-in-garbage-out">Problem 3: No Data Validation – Garbage In, Garbage Out</a></p>
</li>
<li><p><a href="#heading-problem-4-model-drift-performance-decay-over-time">Problem 4: Model Drift – Performance Decay Over Time</a></p>
</li>
<li><p><a href="#heading-problem-5-no-ci-cd-or-deployment-safety">Problem 5: No CI/CD or Deployment Safety</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-3-add-experiment-tracking-and-model-registry-with-mlflow">Add Experiment Tracking and Model Registry with MLflow</a></p>
<ul>
<li><p><a href="#heading-31-how-to-set-up-the-mlflow-tracking-server">How to Set Up the MLflow Tracking Server</a></p>
</li>
<li><p><a href="#heading-32-how-to-log-experiments-in-code">How to Log Experiments in Code</a></p>
</li>
<li><p><a href="#heading-33-how-to-use-the-model-registry">How to Use the Model Registry</a></p>
</li>
<li><p><a href="#heading-34-update-api-to-load-from-registry">Update API to Load from Registry</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-4-ensure-feature-consistency-with-feast">Ensure Feature Consistency with Feast</a></p>
<ul>
<li><p><a href="#heading-41-what-is-feast-and-why-use-it">What Is Feast and Why Use It?</a></p>
</li>
<li><p><a href="#heading-42-install-and-initialize-feast">Install and Initialize Feast</a></p>
</li>
<li><p><a href="#heading-43-define-feature-definitions">Define Feature Definitions</a></p>
</li>
<li><p><a href="#heading-44-materialize-features-to-the-online-store">Materialize Features to the Online Store</a></p>
</li>
<li><p><a href="#heading-45-retrieve-features-for-training-and-serving">Retrieve Features for Training and Serving</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-5-add-data-validation-with-great-expectations">Add Data Validation with Great Expectations</a></p>
<ul>
<li><p><a href="#heading-51-define-expectations">Define Expectations</a></p>
</li>
<li><p><a href="#heading-52-integrate-validation-into-fastapi">Integrate Validation into FastAPI</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-6-monitor-model-performance-and-data-drift">Monitor Model Performance and Data Drift</a></p>
<ul>
<li><p><a href="#heading-61-the-four-pillars-of-ml-observability">The Four Pillars of ML Observability</a></p>
</li>
<li><p><a href="#heading-62-build-a-drift-monitor-with-evidently">Build a Drift Monitor with Evidently</a></p>
</li>
<li><p><a href="#heading-63-production-monitoring-strategy">Production Monitoring Strategy</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-7-automate-testing-and-deployment-with-ci-cd">Automate Testing and Deployment with CI/CD</a></p>
<ul>
<li><p><a href="#heading-71-write-tests-for-data-and-model">Write Tests for Data and Model</a></p>
</li>
<li><p><a href="#heading-72-github-actions-workflow">GitHub Actions Workflow</a></p>
</li>
<li><p><a href="#heading-73-dockerize-the-application">Dockerize the Application</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-8-incident-response-playbook">Incident Response Playbook</a></p>
<ul>
<li><p><a href="#heading-scenario-false-positive-spike">Scenario: False Positive Spike</a></p>
</li>
<li><p><a href="#heading-scenario-gradual-performance-decay">Scenario: Gradual Performance Decay</a></p>
</li>
<li><p><a href="#heading-scenario-upstream-data-schema-change">Scenario: Upstream Data Schema Change</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-9-how-to-put-it-all-together">How to Put It All Together</a></p>
</li>
<li><p><a href="#heading-10-whats-next-scale-to-production">What’s Next: Scale to Production</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ol>
<h2 id="heading-project-overview-and-setup"><strong>Project Overview and Setup</strong></h2>
<p>Before we jump into coding, let's set the stage. Our use-case is <strong>credit card fraud detection</strong> – a binary classification problem where we predict whether a transaction is fraudulent (<code>is_fraud = 1</code>) or legitimate (<code>is_fraud = 0</code>). This is a common ML task and a good proxy for production ML challenges because fraud patterns can change over time (allowing us to discuss model drift), and bad input data (for example, malformed transaction info) can cause serious issues if not handled properly.</p>
<h3 id="heading-tech-stack"><strong>Tech Stack</strong></h3>
<p>We will use Python-based tools that are popular in MLOps but still beginner-friendly:</p>
<table>
<thead>
<tr>
<th><strong>Tool</strong></th>
<th><strong>Purpose</strong></th>
<th><strong>Why We Chose It</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>MLflow</strong></td>
<td>Experiment tracking and model registry</td>
<td>Open-source, widely adopted, great UI</td>
</tr>
<tr>
<td><strong>Feast</strong></td>
<td>Feature store for consistent feature serving</td>
<td>Production-grade, runs locally, same API for offline/online</td>
</tr>
<tr>
<td><strong>FastAPI</strong></td>
<td>High-performance web framework for serving predictions</td>
<td>Fast, automatic docs, modern Python</td>
</tr>
<tr>
<td><strong>Great Expectations</strong></td>
<td>Data validation framework</td>
<td>Declarative expectations, great reports</td>
</tr>
<tr>
<td><strong>Evidently</strong></td>
<td>Monitoring for data drift and model decay</td>
<td>Beautiful reports, easy to integrate</td>
</tr>
<tr>
<td><strong>Docker</strong></td>
<td>Containerization for environment consistency</td>
<td>Industry standard, works everywhere</td>
</tr>
<tr>
<td><strong>GitHub Actions</strong></td>
<td>CI/CD automation</td>
<td>Free for public repos, tight GitHub integration</td>
</tr>
</tbody></table>
<p>Let me explain each tool briefly:</p>
<p><strong>MLflow</strong> is an open-source platform designed to manage the ML lifecycle. It provides experiment tracking (logging parameters, metrics, and artifacts), a model registry (versioning models with aliases), and model serving capabilities. We'll use it to ensure our experiments are reproducible and our models are versioned.</p>
<p><strong>Feast</strong> (Feature Store) is an open-source feature store that helps manage and serve features consistently between training and inference. This prevents a common problem called "training-serving skew" where the features used in production differ slightly from those used in training, causing silent accuracy degradation.</p>
<p><strong>FastAPI</strong> is a modern, fast web framework for building APIs with Python. It's known for being easy to use, efficient, and producing automatic interactive documentation. We'll use it to serve our model predictions.</p>
<p><strong>Great Expectations</strong> is an open-source tool for data quality testing. It allows us to define "expectations" on data (like "amount should be positive" or "hour should be between 0 and 23") and test incoming data against them.</p>
<p><strong>Evidently</strong> is an open-source library for monitoring data and model performance over time. It can detect data drift (when input distributions change) and model decay (when accuracy drops).</p>
<p><strong>Docker</strong> ensures the same environment and dependencies in development and deployment, avoiding the classic "works on my machine" problem.</p>
<p><strong>GitHub Actions</strong> provides CI/CD automation. An efficient CI/CD pipeline helps integrate and deploy changes faster and with fewer errors.</p>
<p>💡 <strong>Mental Model</strong>: Think of this as building a "safety net" around your ML model. Each tool we add catches a different failure mode, like defensive driving for machine learning.</p>
<h3 id="heading-prerequisites"><strong>Prerequisites</strong></h3>
<p>You'll need:</p>
<ul>
<li><p><strong>Python 3.9+</strong> installed on your machine</p>
</li>
<li><p><strong>Docker Desktop</strong> installed and running</p>
</li>
<li><p><strong>GitHub account</strong> (if you want to try the CI/CD pipeline)</p>
</li>
<li><p><strong>Basic familiarity with Python</strong> and ML concepts (what training and prediction mean)</p>
</li>
</ul>
<p>You don't need MLOps or Kubernetes experience. Everything will be done locally with just Python and Docker – <strong>no cloud and no Kubernetes needed</strong>.</p>
<h3 id="heading-project-structure"><strong>Project Structure</strong></h3>
<p>Let's set up a basic project structure on your local machine. Open your terminal and run:</p>
<pre><code class="language-python"># Create project directory and subfolders
mkdir ml-platform-tutorial &amp;&amp; cd ml-platform-tutorial
mkdir -p data models src tests feature_repo

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate
</code></pre>
<p>Your project structure should look like this:</p>
<pre><code class="language-python">ml-platform-tutorial/
├── data/              # Training and test datasets
├── models/            # Saved model files
├── src/               # Source code
├── tests/             # Test files
├── feature_repo/      # Feast feature repository
├── venv/              # Virtual environment
└── requirements.txt   # Dependencies
</code></pre>
<p>Next, create a <code>requirements.txt</code> with all the necessary libraries:</p>
<pre><code class="language-python"># requirements.txt

# Core ML libraries
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0

# Experiment tracking and model registry
mlflow==2.10.0

# Feature store
feast==0.36.0

# API framework
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.26.0

# Data validation
great-expectations==0.18.8

# Monitoring
evidently==0.7.20

# Testing
pytest==8.0.0
pytest-cov==4.1.0

# Utilities
pyarrow==15.0.0
pydantic==2.6.0
</code></pre>
<p>📌 <strong>Version Note:</strong> Exact versions are pinned to ensure reproducibility. Newer versions may work, but all examples were tested with the versions listed here.</p>
<p>Install the dependencies:</p>
<pre><code class="language-python">pip install -r requirements.txt
</code></pre>
<p>This might take a few minutes as it installs all the packages. Once complete, we're ready to start building our project step by step.</p>
<p><strong>Checkpoint:</strong> You should have a project folder with <code>data/</code>, <code>models/</code>, <code>src/</code>, <code>tests/</code>, and <code>feature_repo/</code> directories, and an activated virtual environment with all dependencies installed. Verify by running <code>python -c "import mlflow; import feast; import fastapi; print('All imports successful!')"</code>.</p>
<p><strong>Figure 1: The Complete ML Platform We'll Build</strong></p>
<p><em>Don't worry if this looks complex, we'll build each component step by step, starting with the simplest piece and connecting them together.</em></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771392341567/4bfdd727-32fb-4f30-a63e-c94f61a9f2db.png" alt="Architecture diagram of a local end-to-end machine learning platform for fraud detection. Transaction data flows through model training, experiment tracking and model registry in MLflow, feature management in Feast, data validation with Great Expectations, prediction serving through FastAPI, monitoring with Evidently, and automated testing and deployment with Docker and GitHub Actions." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-1-build-a-simple-model-and-api-the-naive-approach"><strong>1. Build a Simple Model and API (The Naive Approach)</strong></h2>
<p>To illustrate why we need all these tools, let's start by building a <strong>naive ML system without any MLOps infrastructure</strong>. We'll train a simple model and deploy it quickly, then observe what problems arise. This "naive approach" is how most ML projects start – and understanding its limitations will motivate the solutions we implement later.</p>
<h3 id="heading-11-train-a-quick-model"><strong>1.1 Train a Quick Model</strong></h3>
<p>First, we need some data. For simplicity, we'll generate a synthetic dataset for fraud detection so that we don't rely on any external data files. The dataset will have features like:</p>
<ul>
<li><p><code>amount</code>: Transaction amount in dollars</p>
</li>
<li><p><code>hour</code>: Hour of the day (0-23) when the transaction occurred</p>
</li>
<li><p><code>day_of_week</code>: Day of the week (0=Monday, 6=Sunday)</p>
</li>
<li><p><code>merchant_category</code>: Type of merchant (grocery, restaurant, retail, online, travel)</p>
</li>
<li><p><code>is_fraud</code>: Label indicating if the transaction is fraudulent (1) or legitimate (0)</p>
</li>
</ul>
<p>We will simulate that only ~2% of transactions are fraud, which is an imbalance typical in real fraud data. This imbalance is important because it affects how we evaluate our model.</p>
<p>Create <code>src/generate_data.py</code>:</p>
<pre><code class="language-python"># src/generate_data.py
"""
Generate synthetic fraud detection dataset.

This script creates realistic-looking transaction data where fraudulent
transactions have different patterns than legitimate ones:
- Fraud tends to have higher amounts
- Fraud tends to occur late at night
- Fraud is more common for online and travel merchants
"""
import pandas as pd
import numpy as np

def generate_transactions(n_samples=10000, fraud_ratio=0.02, seed=42):
    """
    Generate synthetic fraud detection dataset.
    
    Args:
        n_samples: Total number of transactions to generate
        fraud_ratio: Proportion of fraudulent transactions (default 2%)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with transaction features and fraud labels
    
    Fraud transactions have different patterns:
    - Higher amounts (mean \(245 vs \)33 for legit)
    - Late night hours (0-5, 23)
    - More likely to be online or travel merchants
    """
    np.random.seed(seed)
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud

    # Legitimate transactions: normal shopping patterns
    # - Amounts follow a log-normal distribution (most small, some large)
    # - Hours are uniformly distributed throughout the day
    # - Merchant categories weighted toward everyday shopping
    legit = pd.DataFrame({
        "amount": np.random.lognormal(mean=3.5, sigma=1.2, size=n_legit),  # ~$33 average
        "hour": np.random.randint(0, 24, size=n_legit),
        "day_of_week": np.random.randint(0, 7, size=n_legit),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_legit,
            p=[0.30, 0.25, 0.25, 0.15, 0.05]  # Weighted toward everyday shopping
        ),
        "is_fraud": 0
    })
    
    # Fraudulent transactions: suspicious patterns
    # - Higher amounts (fraudsters go big)
    # - Late night hours (less scrutiny)
    # - More online and travel (easier to exploit)
    fraud = pd.DataFrame({
        "amount": np.random.lognormal(mean=5.5, sigma=1.5, size=n_fraud),  # ~$245 average
        "hour": np.random.choice([0, 1, 2, 3, 4, 5, 23], size=n_fraud),  # Late night
        "day_of_week": np.random.randint(0, 7, size=n_fraud),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_fraud,
            p=[0.05, 0.05, 0.10, 0.60, 0.20]  # Weighted toward online/travel
        ),
        "is_fraud": 1
    })
    
    # Combine and shuffle
    df = pd.concat([legit, fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    return df

if __name__ == "__main__":
    # Generate dataset
    print("Generating synthetic fraud detection dataset...")
    df = generate_transactions(n_samples=10000, fraud_ratio=0.02)
    
    # Split into train (80%) and test (20%)
    train_df = df.sample(frac=0.8, random_state=42)
    test_df = df.drop(train_df.index)
    
    # Save to CSV files
    train_df.to_csv("data/train.csv", index=False)
    test_df.to_csv("data/test.csv", index=False)
    
    # Print summary statistics
    print(f"\nDataset generated successfully!")
    print(f"Training set: {len(train_df):,} transactions")
    print(f"Test set: {len(test_df):,} transactions")
    print(f"Overall fraud ratio: {df['is_fraud'].mean():.2%}")
    print(f"\nLegitimate transactions - Average amount: ${df[df['is_fraud']==0]['amount'].mean():.2f}")
    print(f"Fraudulent transactions - Average amount: ${df[df['is_fraud']==1]['amount'].mean():.2f}")
    print(f"\nMerchant category distribution (fraud):")
    print(df[df['is_fraud']==1]['merchant_category'].value_counts(normalize=True))
</code></pre>
<p>Run the data generation script:</p>
<pre><code class="language-python">python src/generate_data.py
</code></pre>
<p>You should see output like:</p>
<pre><code class="language-python">Generating synthetic fraud detection dataset...

Dataset generated successfully!
Training set: 8,000 transactions
Test set: 2,000 transactions
Overall fraud ratio: 2.00%

Legitimate transactions - Average amount: $33.45
Fraudulent transactions - Average amount: $245.67

Merchant category distribution (fraud):
online        0.60
travel        0.20
retail        0.10
restaurant    0.05
grocery       0.05
</code></pre>
<p>Now you have <code>data/train.csv</code> and <code>data/test.csv</code> with ~8000 training and ~2000 testing transactions.</p>
<p><strong>Why This Matters:</strong> The synthetic data has realistic patterns — fraud is rare (2%), high-value, late-night, and concentrated in certain merchant categories. These patterns give our model something to learn.</p>
<p>Now, let's train a quick model. We'll use a simple <strong>Random Forest classifier</strong> from scikit-learn to predict <code>is_fraud</code>. In this naive version, we won't do much feature engineering – just label encode the categorical <code>merchant_category</code> and feed everything to the model.</p>
<p>Create <code>src/train_naive.py</code>:</p>
<pre><code class="language-python"># src/train_naive.py
"""
Train a fraud detection model - NAIVE VERSION.

This script demonstrates the "quick and dirty" approach to ML:
- No experiment tracking
- No model versioning
- Just train and save to a pickle file

We'll improve on this in later sections.
"""
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report
)

def main():
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    print(f"Training samples: {len(train_df):,}")
    print(f"Test samples: {len(test_df):,}")
    print(f"Training fraud ratio: {train_df['is_fraud'].mean():.2%}")
    
    # Encode the categorical feature
    # We need to save the encoder to use the same mapping at inference time
    print("\nEncoding categorical features...")
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    print(f"Merchant category mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
    
    # Prepare features and labels
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    # Train a Random Forest classifier
    print("\nTraining Random Forest model...")
    model = RandomForestClassifier(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of each tree
        random_state=42,       # For reproducibility
        n_jobs=-1              # Use all CPU cores
    )
    model.fit(X_train, y_train)
    print("Training complete!")
    
    # Evaluate on test data
    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"  True Negatives:  {cm[0][0]:,} (correctly identified legitimate)")
    print(f"  False Positives: {cm[0][1]:,} (legitimate flagged as fraud)")
    print(f"  False Negatives: {cm[1][0]:,} (fraud missed - DANGEROUS!)")
    print(f"  True Positives:  {cm[1][1]:,} (correctly caught fraud)")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Feature importance
    print("\nFeature Importance:")
    for name, importance in sorted(
        zip(feature_cols, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    ):
        print(f"  {name}: {importance:.4f}")
    
    # Save the model and encoder together
    print("\nSaving model to models/model.pkl...")
    with open("models/model.pkl", "wb") as f:
        pickle.dump((model, encoder), f)
    
    print("\nModel trained and saved successfully!")
    print("\nWARNING: This naive approach has several problems:")
    print("  - No record of hyperparameters or metrics")
    print("  - No model versioning")
    print("  - No way to reproduce this exact model")
    print("  - We'll fix these issues in the following sections!")

if __name__ == "__main__":
    main()
</code></pre>
<p>Run the training script:</p>
<pre><code class="language-python">python src/train_naive.py
</code></pre>
<p>You should see output similar to:</p>
<pre><code class="language-python">Loading data...
Training samples: 8,000
Test samples: 2,000
Training fraud ratio: 2.00%

Encoding categorical features...
Merchant category mapping: {'grocery': 0, 'online': 1, 'restaurant': 2, 'retail': 3, 'travel': 4}

Training Random Forest model...
Training complete!

==================================================
MODEL EVALUATION
==================================================

Accuracy:  0.9820
Precision: 0.7273
Recall:    0.6154
F1-score:  0.6667

Confusion Matrix:
  True Negatives:  1,956 (correctly identified legitimate)
  False Positives: 4 (legitimate flagged as fraud)
  False Negatives: 32 (fraud missed - DANGEROUS!)
  True Positives:  8 (correctly caught fraud)

Feature Importance:
  amount: 0.5423
  hour: 0.2156
  merchant_encoded: 0.1345
  day_of_week: 0.1076
</code></pre>
<p><strong>Important observation:</strong> You'll see ~98% accuracy but a lower F1-score (around 0.5-0.7). <strong>With only 2% fraud, accuracy is extremely misleading!</strong> A model that always predicts "not fraud" would achieve 98% accuracy while catching zero fraud. This is why we focus on F1-score, precision, and recall for imbalanced classification problems.</p>
<p>💡 If you're new to imbalanced classification, remember: high accuracy can be meaningless when the positive class is rare.</p>
<p>The script outputs a file <code>models/model.pkl</code> containing both the trained model and the label encoder (we need both for inference).</p>
<p><strong>Checkpoint:</strong> You should now have:</p>
<ul>
<li><p><code>data/train.csv</code> (~8,000 rows)</p>
</li>
<li><p><code>data/test.csv</code> (~2,000 rows)</p>
</li>
<li><p><code>models/model.pkl</code> (trained model + encoder)</p>
</li>
</ul>
<p>The model should show ~98% accuracy but F1 around 0.5-0.7. Verify the files exist: <code>ls -la data/ models/</code></p>
<h3 id="heading-12-serve-predictions-with-fastapi"><strong>1.2 Serve Predictions with FastAPI</strong></h3>
<p>Now that we have a model, let's deploy it as an API so that clients can get predictions. We'll use <strong>FastAPI</strong> because it's straightforward, very fast, and produces automatic interactive documentation.</p>
<p>FastAPI is known for:</p>
<ul>
<li><p><strong>Easy to use</strong>: Pythonic syntax with type hints</p>
</li>
<li><p><strong>High performance</strong>: One of the fastest Python frameworks</p>
</li>
<li><p><strong>Automatic documentation</strong>: Swagger UI out of the box</p>
</li>
<li><p><strong>Data validation</strong>: Using Pydantic models</p>
</li>
</ul>
<p>Create <code>src/serve_naive.py</code>:</p>
<pre><code class="language-python"># src/serve_naive.py
"""
Serve fraud detection model as a REST API - NAIVE VERSION.

This is a simple API that:
1. Loads the trained model at startup
2. Accepts transaction data via POST request
3. Returns fraud prediction

We'll improve this with validation, monitoring, and better
model loading in later sections.
"""
import pickle
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional

# Load the trained model and encoder at startup
# This is loaded once when the server starts, not on every request
print("Loading model...")
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)
print("Model loaded successfully!")

# Create the FastAPI application
app = FastAPI(
    title="Fraud Detection API",
    description="""
    Predict whether a credit card transaction is fraudulent.
    
    This API accepts transaction details and returns:
    - Whether the transaction is predicted to be fraud
    - The probability of fraud (0.0 to 1.0)
    
    **Note:** This is the naive version without validation or monitoring.
    """,
    version="1.0.0"
)

# Define the input schema using Pydantic
# This provides automatic validation and documentation
class Transaction(BaseModel):
    """Schema for a transaction to be evaluated for fraud."""
    amount: float = Field(
        ..., 
        description="Transaction amount in dollars",
        example=150.00
    )
    hour: int = Field(
        ..., 
        description="Hour of the day (0-23)",
        example=14
    )
    day_of_week: int = Field(
        ..., 
        description="Day of week (0=Monday, 6=Sunday)",
        example=3
    )
    merchant_category: str = Field(
        ..., 
        description="Type of merchant",
        example="online"
    )

class PredictionResponse(BaseModel):
    """Schema for the prediction response."""
    is_fraud: bool = Field(description="Whether the transaction is predicted as fraud")
    fraud_probability: float = Field(description="Probability of fraud (0.0 to 1.0)")
    
@app.post("/predict", response_model=PredictionResponse)
def predict(transaction: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Takes transaction details and returns a fraud prediction
    along with the probability score.
    """
    # Convert the request to a dictionary
    data = transaction.dict()
    
    # Encode the merchant category using the same encoder from training
    # This ensures consistency between training and serving
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        # Handle unknown merchant categories
        # In production, we'd want better handling here
        data["merchant_encoded"] = 0
    
    # Prepare features in the same order as training
    X = [[
        data["amount"],
        data["hour"],
        data["day_of_week"],
        data["merchant_encoded"]
    ]]
    
    # Get prediction and probability
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0][1]  # Probability of class 1 (fraud)
    
    return PredictionResponse(
        is_fraud=bool(prediction),
        fraud_probability=round(float(probability), 4)
    )

@app.get("/health")
def health_check():
    """
    Health check endpoint.
    
    Returns the status of the API. Useful for:
    - Load balancer health checks
    - Kubernetes liveness probes
    - Monitoring systems
    """
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/")
def root():
    """Root endpoint with API information."""
    return {
        "message": "Fraud Detection API",
        "version": "1.0.0",
        "docs": "/docs",
        "health": "/health"
    }
</code></pre>
<p>A few important things to note about this code:</p>
<ol>
<li><p><strong>Pydantic Models</strong>: We use <code>BaseModel</code> to define the expected input JSON schema. FastAPI automatically validates incoming requests against this schema.</p>
</li>
<li><p><strong>Type Hints</strong>: The type hints (<code>float</code>, <code>int</code>, <code>str</code>) provide both documentation and runtime validation.</p>
</li>
<li><p><strong>Feature Encoding</strong>: On each request, we encode the merchant category using the same <code>LabelEncoder</code> we saved from training. This ensures consistency between training and serving.</p>
</li>
<li><p><strong>Health Endpoint</strong>: The <code>/health</code> endpoint is standard practice for production APIs - it allows load balancers and monitoring systems to check if the service is running.</p>
</li>
</ol>
<p>To run this API, use Uvicorn (an ASGI server):</p>
<pre><code class="language-python">uvicorn src.serve_naive:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>The <code>--reload</code> flag enables auto-reload during development (the server restarts when you change code).</p>
<p>You should see:</p>
<pre><code class="language-python">Loading model...
Model loaded successfully!
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process
</code></pre>
<p>Now open your browser and go to <code>http://localhost:8000/docs</code>. You'll see the <strong>Swagger UI</strong> – an auto-generated interactive documentation where you can test the API directly from your browser!</p>
<p>Test the API using curl in another terminal:</p>
<pre><code class="language-python"># Test with a legitimate-looking transaction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}'
</code></pre>
<p>Expected response:</p>
<pre><code class="language-python">{"is_fraud": false, "fraud_probability": 0.02}
</code></pre>
<pre><code class="language-python"># Test with a suspicious transaction (high amount, late night, online)
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 500.0, "hour": 3, "day_of_week": 1, "merchant_category": "online"}'
</code></pre>
<p>Expected response:</p>
<pre><code class="language-python">{"is_fraud": true, "fraud_probability": 0.78}
</code></pre>
<p><strong>We have a working model served as an API!</strong> In a real scenario, we could now integrate this API with a payment processing frontend, mobile app, or any system that needs fraud predictions.</p>
<p>But before we celebrate, let's examine this naive approach for potential pitfalls...</p>
<p><strong>Checkpoint:</strong> Your API should be running at <code>http://localhost:8000</code>. The Swagger UI at <code>/docs</code> should show both endpoints (<code>/predict</code> and <code>/health</code>). Test with curl or the Swagger UI to verify predictions are returned.</p>
<h2 id="heading-2-where-the-naive-approach-breaks"><strong>2. Where the Naive Approach Breaks</strong></h2>
<p>Our quick-and-dirty ML pipeline works on the surface: it can train a model and serve predictions. However, <strong>hidden problems will emerge</strong> if we try to maintain or scale this system in production.</p>
<p>This section is critical: understanding these issues will motivate the solutions we implement in the following sections. Let's go through the problems one by one.</p>
<h3 id="heading-problem-1-no-experiment-tracking-reproducibility"><strong>Problem 1: No Experiment Tracking (Reproducibility)</strong></h3>
<p>Try this thought experiment: Run <code>train_naive.py</code> again with different hyperparameters (change <code>n_estimators</code> to 200, or <code>max_depth</code> to 15). Would you be able to <strong>exactly reproduce the previous model's results</strong> if someone asked?</p>
<p>Probably not. Currently, we have <strong>no record</strong> of:</p>
<ul>
<li><p>Which hyperparameters we used</p>
</li>
<li><p>What metrics we achieved</p>
</li>
<li><p>What version of the data we trained on</p>
</li>
<li><p>What library versions were installed</p>
</li>
<li><p>When the training happened</p>
</li>
<li><p>Who ran the training</p>
</li>
</ul>
<p>Three months from now, if your manager asks "How was this model trained? Can you reproduce the results?" – you'd be in trouble. You might have the code, but you don't know which version of the code, which parameters, or which data produced the model that's currently in production.</p>
<p><strong>Experiment tracking</strong> is the practice of logging all these details (code versions, parameters, metrics, data versions, artifacts) so experiments can be compared and replicated. Our naive approach lacks this entirely, making our results hard to trust or build upon.</p>
<h3 id="heading-problem-2-model-versioning-and-deployment-chaos"><strong>Problem 2: Model Versioning and Deployment Chaos</strong></h3>
<p>We trained one model and saved it as <code>model.pkl</code>. Now consider this scenario:</p>
<ol>
<li><p>You train a new model with different hyperparameters</p>
</li>
<li><p>You overwrite <code>model.pkl</code> with the new model</p>
</li>
<li><p>You deploy it to production</p>
</li>
<li><p>Users start complaining about more false positives</p>
</li>
<li><p>You want to roll back to the previous model</p>
</li>
<li><p><strong>Problem:</strong> The previous model was overwritten and is gone forever</p>
</li>
</ol>
<p>There's no systematic versioning. Questions you cannot answer:</p>
<ul>
<li><p>Which model version is currently in production?</p>
</li>
<li><p>What were the metrics for model v1 vs v2?</p>
</li>
<li><p>When was each model trained and by whom?</p>
</li>
<li><p>Can we instantly roll back if the new model performs worse?</p>
</li>
<li><p>What changed between versions?</p>
</li>
</ul>
<p>Without version control for models, you're flying blind. Imagine deploying code without Git – that's what we're doing with our model.</p>
<h3 id="heading-problem-3-no-data-validation-garbage-in-garbage-out"><strong>Problem 3: No Data Validation – Garbage In, Garbage Out</strong></h3>
<p>Right now, our API will accept <strong>any input</strong> and try to make a prediction. Let's see what happens with bad data.</p>
<p>Create a test script <code>src/test_bad_data.py</code>:</p>
<pre><code class="language-python"># src/test_bad_data.py
"""Test what happens when we send garbage data to the API."""
import requests

BASE_URL = "http://localhost:8000"

print("Testing API with various bad inputs...\n")

# Test 1: Negative amount
print("Test 1: Negative amount")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -500.0,        # Negative amount - impossible!
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 2: Invalid hour
print("Test 2: Hour = 25 (should be 0-23)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 25,              # Invalid hour!
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 3: Invalid day of week
print("Test 3: day_of_week = 10 (should be 0-6)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 10,       # Invalid day!
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 4: Unknown merchant category
print("Test 4: Unknown merchant category")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "unknown_category"  # Not in training data!
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 5: All bad at once
print("Test 5: Everything wrong")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -1000.0,
    "hour": 99,
    "day_of_week": 15,
    "merchant_category": "totally_fake"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

print("Observation: The API happily accepts ALL garbage and returns predictions!")
print("This is dangerous - bad data leads to bad predictions with no warning.")
</code></pre>
<p>Run it (make sure your API is still running):</p>
<pre><code class="language-python">python src/test_bad_data.py
</code></pre>
<p>You'll see something like:</p>
<pre><code class="language-python">Testing API with various bad inputs...

Test 1: Negative amount
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.15}

Test 2: Hour = 25 (should be 0-23)
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.08}

...

Observation: The API happily accepts ALL garbage and returns predictions!
</code></pre>
<p><strong>The API accepts garbage and returns predictions with no warning!</strong> In production, this could mean:</p>
<ul>
<li><p>Incorrect predictions based on impossible data</p>
</li>
<li><p>Fraud going undetected because of malformed input</p>
</li>
<li><p>Legitimate transactions blocked based on corrupted data</p>
</li>
<li><p>No way to debug why predictions are wrong</p>
</li>
</ul>
<p>As the saying goes: <strong>"Garbage in, garbage out."</strong> But even worse – we don't even know garbage went in!</p>
<h3 id="heading-problem-4-model-drift-performance-decay-over-time"><strong>Problem 4: Model Drift – Performance Decay Over Time</strong></h3>
<p>Here's a scenario that happens in every production ML system:</p>
<ol>
<li><p><strong>January</strong>: You train your model on historical fraud data. It achieves 98% accuracy and 0.67 F1-score. Everyone's happy.</p>
</li>
<li><p><strong>February</strong>: The model is deployed and working well. Fraud is being caught.</p>
</li>
<li><p><strong>March</strong>: Fraudsters adapt. They start using different patterns – smaller amounts, different merchant categories, different times of day.</p>
</li>
<li><p><strong>April</strong>: Your model's accuracy has dropped from 98% to 85%. F1-score dropped from 0.67 to 0.35. Fraud is slipping through.</p>
</li>
<li><p><strong>May</strong>: A major fraud incident occurs. Investigation reveals the model has been underperforming for 2 months.</p>
</li>
</ol>
<p><strong>The problem:</strong> Nobody noticed for 2 months because there was no monitoring.</p>
<p>This phenomenon is called <strong>data drift</strong> (when input data distributions change) or <strong>concept drift</strong> (when the relationship between inputs and outputs changes). Both are inevitable in real-world systems.</p>
<p>Without monitoring:</p>
<ul>
<li><p>You don't know when performance degrades</p>
</li>
<li><p>You don't know why performance degrades</p>
</li>
<li><p>You can't take corrective action until users complain</p>
</li>
<li><p>By then, significant damage may have occurred</p>
</li>
</ul>
<h3 id="heading-problem-5-no-cicd-or-deployment-safety"><strong>Problem 5: No CI/CD or Deployment Safety</strong></h3>
<p>Our "deployment process" was literally:</p>
<ol>
<li><p>SSH into the server (or run locally)</p>
</li>
<li><p>Run <code>python src/train_naive.py</code></p>
</li>
<li><p>Copy model.pkl to the right place</p>
</li>
<li><p>Restart the API</p>
</li>
<li><p>Hope for the best</p>
</li>
</ol>
<p>There's:</p>
<ul>
<li><p><strong>No automated testing</strong>: A typo could break everything</p>
</li>
<li><p><strong>No staging environment</strong>: We test directly in production</p>
</li>
<li><p><strong>No gradual rollout</strong>: 100% of traffic hits the new model immediately</p>
</li>
<li><p><strong>No rollback capability</strong>: If something breaks, we have to manually fix it</p>
</li>
<li><p><strong>No audit trail</strong>: Who deployed what and when?</p>
</li>
</ul>
<p>This is how production incidents happen. A rushed deployment at 5 PM on Friday breaks the fraud detection system, and nobody notices until Monday when fraud losses have spiked.</p>
<p><strong>Figure 2:</strong> Problems with the Naive Approach</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771392425864/75c51059-5ab3-4e08-b3ad-7f5e9c3e7445.png" alt="Diagram showing the weaknesses of a naive machine learning setup: manual training and deployment, no experiment tracking, no model versioning, inconsistent features between training and serving, no data validation, no drift or performance monitoring, and no CI/CD safeguards such as automated tests, rollback, or audit trail." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-summary-what-we-need-to-fix"><strong>Summary: What We Need to Fix</strong></h3>
<p>Our simple ML service is missing critical infrastructure. Here's the mapping of problems to solutions:</p>
<table>
<thead>
<tr>
<th><strong>Problem</strong></th>
<th><strong>Impact</strong></th>
<th><strong>Solution</strong></th>
<th><strong>Section</strong></th>
</tr>
</thead>
<tbody><tr>
<td>No experiment tracking</td>
<td>Can't reproduce or compare models</td>
<td>MLflow Tracking</td>
<td>3</td>
</tr>
<tr>
<td>No model versioning</td>
<td>Can't roll back or audit</td>
<td>MLflow Registry</td>
<td>3</td>
</tr>
<tr>
<td>No feature consistency</td>
<td>Training-serving skew</td>
<td>Feast Feature Store</td>
<td>4</td>
</tr>
<tr>
<td>No data validation</td>
<td>Garbage predictions</td>
<td>Great Expectations</td>
<td>5</td>
</tr>
<tr>
<td>No monitoring</td>
<td>Drift goes unnoticed</td>
<td>Evidently</td>
<td>6</td>
</tr>
<tr>
<td>No CI/CD</td>
<td>Risky deployments</td>
<td>GitHub Actions + Docker</td>
<td>7</td>
</tr>
</tbody></table>
<p><strong>The good news:</strong> We can fix each of these by incrementally adding components to our pipeline. Each tool addresses a specific problem, and together they form a robust ML platform.</p>
<p>Let's start fixing these issues, one by one.</p>
<h2 id="heading-3-add-experiment-tracking-and-model-registry-with-mlflow"><strong>3. Add Experiment Tracking and Model Registry with MLflow</strong></h2>
<p><strong>What breaks without this:</strong> You can't reproduce yesterday's results, can't compare experiments, and can't roll back when a new model fails in production.</p>
<p>Our first fix addresses <strong>Problems 1 and 2</strong>: experiment reproducibility and model versioning.</p>
<p><strong>MLflow</strong> is an open-source platform designed to manage the ML lifecycle. We'll use two of its key components:</p>
<ol>
<li><p><strong>MLflow Tracking</strong>: Log experiments (parameters, metrics, artifacts) so you can compare runs and reproduce results</p>
</li>
<li><p><strong>MLflow Model Registry</strong>: Version your models with aliases (champion, challenger) and manage the deployment lifecycle</p>
</li>
</ol>
<p><strong>Why This Matters:</strong> Without tracking, ML is guesswork. With MLflow, every run is logged with parameters, metrics, and artifacts. You can compare runs side-by-side, understand what actually improved your model, and reproduce any past experiment. The Model Registry adds governance – you know exactly which model is in production and can roll back in seconds.</p>
<h3 id="heading-31-how-to-set-up-the-mlflow-tracking-server"><strong>3.1</strong> How to Set Up the MLflow Tracking Server</h3>
<p>MLflow can log experiments to a local directory by default, but to use the full UI and model registry, it's best to run the MLflow tracking server.</p>
<p>Open a <strong>new terminal</strong> (keep it separate from your API terminal) and run:</p>
<pre><code class="language-python"># Create a directory for MLflow data
mkdir -p mlruns

# Start the MLflow server
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns
</code></pre>
<p>Let's break down these parameters:</p>
<ul>
<li><p><code>--host 0.0.0.0</code>: Listen on all network interfaces</p>
</li>
<li><p><code>--port 5000</code>: Run on port 5000</p>
</li>
<li><p><code>--backend-store-uri sqlite:///mlflow.db</code>: Store experiment metadata in a SQLite database (for production, you'd use PostgreSQL or MySQL)</p>
</li>
<li><p><code>--default-artifact-root ./mlruns</code>: Store model artifacts (files) in the <code>mlruns</code> directory</p>
</li>
</ul>
<p>You should see:</p>
<pre><code class="language-python">[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://0.0.0.0:5000
</code></pre>
<p>Now open your browser and navigate to <code>http://localhost:5000</code>. You'll see the <strong>MLflow UI</strong> – it should be empty initially since we haven't logged any experiments yet.</p>
<h3 id="heading-32-how-to-log-experiments-in-code"><strong>3.2</strong> How to Log Experiments in Code</h3>
<p>Now let's modify our training script to log everything to MLflow. Create <code>src/train_mlflow.py</code>:</p>
<pre><code class="language-python"># src/train_mlflow.py
"""
Train fraud detection model with MLflow experiment tracking.

This script demonstrates proper ML experiment tracking:
- Log all hyperparameters
- Log all metrics (train and test)
- Log the trained model as an artifact
- Register the model in the Model Registry

Compare this to train_naive.py to see the difference!
"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score
)
import pickle
from datetime import datetime

# Configure MLflow to use our tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create or get the experiment
# All runs will be grouped under this experiment name
mlflow.set_experiment("fraud-detection")

def load_and_preprocess_data():
    """Load and preprocess the training and test data."""
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # Encode categorical feature
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    # Prepare features
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    return X_train, y_train, X_test, y_test, encoder

def train_and_log_model(
    n_estimators: int = 100,
    max_depth: int = 10,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1
):
    """
    Train a model and log everything to MLflow.
    
    Args:
        n_estimators: Number of trees in the forest
        max_depth: Maximum depth of each tree
        min_samples_split: Minimum samples required to split a node
        min_samples_leaf: Minimum samples required at a leaf node
    """
    X_train, y_train, X_test, y_test, encoder = load_and_preprocess_data()
    
    # Start an MLflow run - everything logged will be associated with this run
    with mlflow.start_run():
        # Add a descriptive run name
        run_name = f"rf_est{n_estimators}_depth{max_depth}_{datetime.now().strftime('%H%M%S')}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        # Log all hyperparameters
        # These are the "knobs" we can tune
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)
        mlflow.log_param("min_samples_leaf", min_samples_leaf)
        mlflow.log_param("model_type", "RandomForestClassifier")
        
        # Log data information
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples", len(X_test))
        mlflow.log_param("fraud_ratio", float(y_train.mean()))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Train the model
        print(f"\nTraining model: n_estimators={n_estimators}, max_depth={max_depth}")
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Evaluate and log metrics for BOTH train and test sets
        # This helps detect overfitting
        for dataset_name, X, y in [("train", X_train, y_train), ("test", X_test, y_test)]:
            y_pred = model.predict(X)
            y_prob = model.predict_proba(X)[:, 1]
            
            # Calculate all metrics
            accuracy = accuracy_score(y, y_pred)
            precision = precision_score(y, y_pred, zero_division=0)
            recall = recall_score(y, y_pred, zero_division=0)
            f1 = f1_score(y, y_pred, zero_division=0)
            roc_auc = roc_auc_score(y, y_prob)
            
            # Log metrics with dataset prefix
            mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
            mlflow.log_metric(f"{dataset_name}_precision", precision)
            mlflow.log_metric(f"{dataset_name}_recall", recall)
            mlflow.log_metric(f"{dataset_name}_f1", f1)
            mlflow.log_metric(f"{dataset_name}_roc_auc", roc_auc)
            
            print(f"  {dataset_name.upper()} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, ROC-AUC: {roc_auc:.4f}")
        
        # Log feature importance
        for feature, importance in zip(
            ["amount", "hour", "day_of_week", "merchant_encoded"],
            model.feature_importances_
        ):
            mlflow.log_metric(f"importance_{feature}", importance)
        
        # Log the model to MLflow AND register it in the Model Registry
        # This creates a new version of the model automatically
        print("\nRegistering model in MLflow Model Registry...")
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name="fraud-detection-model",
            input_example=X_train.iloc[:5]  # Example input for documentation
        )
        
        # Save and log the encoder as a separate artifact
        # We need this for inference
        with open("encoder.pkl", "wb") as f:
            pickle.dump(encoder, f)
        mlflow.log_artifact("encoder.pkl")
        
        # Get the run ID for reference
        run_id = mlflow.active_run().info.run_id
        print(f"\nMLflow Run ID: {run_id}")
        print(f"View this run: http://localhost:5000/#/experiments/1/runs/{run_id}")
        
        return model, encoder

def run_experiment_sweep():
    """
    Run multiple experiments with different hyperparameters.
    
    This demonstrates how MLflow helps compare different configurations.
    """
    print("="*60)
    print("RUNNING HYPERPARAMETER EXPERIMENT SWEEP")
    print("="*60)
    
    # Define different configurations to try
    experiments = [
        {"n_estimators": 50, "max_depth": 5},
        {"n_estimators": 100, "max_depth": 10},
        {"n_estimators": 100, "max_depth": 15},
        {"n_estimators": 200, "max_depth": 10},
        {"n_estimators": 200, "max_depth": 20},
    ]
    
    for i, params in enumerate(experiments, 1):
        print(f"\n--- Experiment {i}/{len(experiments)} ---")
        train_and_log_model(**params)
    
    print("\n" + "="*60)
    print("EXPERIMENT SWEEP COMPLETE!")
    print("="*60)
    print("\nView all experiments at: http://localhost:5000")
    print("Compare runs to find the best hyperparameters!")

if __name__ == "__main__":
    run_experiment_sweep()
</code></pre>
<p>This script:</p>
<ol>
<li><p><strong>Connects to MLflow</strong>: <code>mlflow.set_tracking_uri("</code><a href="http://localhost:5000"><code>http://localhost:5000</code></a><code>")</code></p>
</li>
<li><p><strong>Creates an experiment</strong>: <code>mlflow.set_experiment("fraud-detection")</code></p>
</li>
<li><p><strong>Logs parameters</strong>: All hyperparameters and data info</p>
</li>
<li><p><strong>Logs metrics</strong>: Accuracy, precision, recall, F1, ROC-AUC for both train and test sets</p>
</li>
<li><p><strong>Logs the model</strong>: Saves the trained model as an artifact</p>
</li>
<li><p><strong>Registers the model</strong>: Adds it to the Model Registry with automatic versioning</p>
</li>
</ol>
<p>Run the experiment sweep:</p>
<pre><code class="language-python">python src/train_mlflow.py
</code></pre>
<p>You'll see output for each experiment:</p>
<pre><code class="language-python">============================================================
RUNNING HYPERPARAMETER EXPERIMENT SWEEP
============================================================

--- Experiment 1/5 ---
Loading data...
Training model: n_estimators=50, max_depth=5
  TRAIN - Accuracy: 0.9821, F1: 0.6545, ROC-AUC: 0.9234
  TEST - Accuracy: 0.9795, F1: 0.5714, ROC-AUC: 0.8956

Registering model in MLflow Model Registry...
MLflow Run ID: abc123...

--- Experiment 5/5 ---
Training model: n_estimators=200, max_depth=20
  TRAIN - Accuracy: 0.9856, F1: 0.7123, ROC-AUC: 0.9567
  TEST - Accuracy: 0.9810, F1: 0.6667, ROC-AUC: 0.9234

============================================================
EXPERIMENT SWEEP COMPLETE!
============================================================
</code></pre>
<p>All 5 runs are now logged to MLflow with full metrics comparison available in the UI.</p>
<p>Now refresh the MLflow UI at <code>http://localhost:5000</code>. You'll see:</p>
<ol>
<li><p><strong>Experiments tab</strong>: Shows the "fraud-detection" experiment with 5 runs</p>
</li>
<li><p><strong>Each run</strong>: Shows parameters, metrics, and artifacts</p>
</li>
<li><p><strong>Compare</strong>: You can select multiple runs and compare them side-by-side</p>
</li>
<li><p><strong>Models tab</strong>: Shows "fraud-detection-model" with 5 versions</p>
</li>
</ol>
<p><strong>MLflow Tracking UI: Compare runs, metrics, and models at a glance</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771396202929/c5a7d547-31b6-4783-acea-f4e9433d81ef.png" alt="c5a7d547-31b6-4783-acea-f4e9433d81ef" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-33-how-to-use-the-model-registry"><strong>3.3</strong> How to Use the Model Registry</h3>
<p>The <strong>Model Registry</strong> provides a central hub for managing model versions and their lifecycle stages.</p>
<p>In the MLflow UI:</p>
<ol>
<li><p>Click the <strong>"Models"</strong> tab in the top navigation</p>
</li>
<li><p>Click <strong>"fraud-detection-model"</strong></p>
</li>
<li><p>You'll see all 5 versions listed with their metrics</p>
</li>
</ol>
<p><strong>Model Aliases:</strong> MLflow now uses <strong>aliases</strong> instead of stages. If you've seen older tutorials using "Staging" and "Production" stages, aliases are the newer, more flexible approach.</p>
<ul>
<li><p><strong>@champion</strong>: The production model serving live traffic</p>
</li>
<li><p><strong>@challenger</strong>: Candidate model being tested</p>
</li>
<li><p>You can create custom aliases like @baseline, @latest and so on.</p>
</li>
</ul>
<p><strong>Assign an alias:</strong></p>
<ol>
<li><p>Open MLflow UI → Models → fraud-detection-model</p>
</li>
<li><p>Click on the version you want to promote</p>
</li>
<li><p>Click <strong>"Add Alias"</strong></p>
</li>
<li><p>Enter <code>champion</code> and save</p>
</li>
</ol>
<p>Now you've assigned the <code>@champion</code> alias to your best model. Your API will load whichever version has this alias, making rollbacks as simple as moving the alias to a different version.</p>
<p><strong>Figure 3: MLflow Model Lifecycle — From Training to Production</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771396081377/da67d89f-b82d-4189-8150-ecc142ed198a.png" alt="Diagram showing the MLflow model lifecycle for a fraud detection system: a model is trained with experiment parameters, logged to MLflow tracking with metrics and artifacts, registered in the model registry as multiple versions, assigned aliases such as champion and challenger, and served in production by loading the model through the champion alias. The diagram also shows rollback by moving the alias to an earlier version and restarting the API." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-34-update-api-to-load-from-registry"><strong>3.4 Update API to Load from Registry</strong></h3>
<p>Now let's update our API to load the champion model from the MLflow Registry instead of a pickle file. Create <code>src/serve_mlflow.py</code>:</p>
<pre><code class="language-python"># src/serve_mlflow.py
"""
Serve fraud detection model from MLflow Model Registry.

This version loads the @champion model from MLflow, which means:
- Always serves the latest @champion model
- Can roll back by changing the @champion alias
- No manual file copying needed
"""
import mlflow
import mlflow.sklearn
import pickle
import os
from fastapi import FastAPI
from pydantic import BaseModel, Field

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

print("Loading model from MLflow Model Registry...")

# Load the champion model from the registry
# This automatically gets whichever version has the @champion alias
try:
    model = mlflow.sklearn.load_model("models:/fraud-detection-model@champion")
    print("Successfully loaded champion model from MLflow!")
except Exception as e:
    print(f"Error loading from MLflow: {e}")
    print("Make sure you've assigned the @champion alias to a model in the MLflow UI")
    raise

# Load the encoder (saved as an artifact)
# In a real system, you might also version this in MLflow
with open("encoder.pkl", "rb") as f:
    encoder = pickle.load(f)
print("Encoder loaded successfully!")

app = FastAPI(
    title="Fraud Detection API (MLflow)",
    description="""
    Fraud detection API that loads models from MLflow Model Registry.
    
    This version always serves the model with the @champion alias.
    To update the model:
    1. Train a new model with train_mlflow.py
    2. Compare metrics in MLflow UI
    3. Promote the best model to Production
    4. Restart this API
    
    To roll back: Move the @champion alias to a previous version in MLflow UI.
    """,
    version="2.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount in dollars", example=150.00)
    hour: int = Field(..., description="Hour of the day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Monday, 6=Sunday)", example=3)
    merchant_category: str = Field(..., description="Type of merchant", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_source: str = "MLflow Production"

@app.post("/predict", response_model=PredictionResponse)
def predict(tx: Transaction):
    """Predict whether a transaction is fraudulent using the champion model."""
    data = tx.dict()
    
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        data["merchant_encoded"] = 0
    
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        model_source="MLflow Production"
    )

@app.get("/health")
def health():
    return {"status": "healthy", "model_source": "MLflow Registry"}

@app.get("/model-info")
def model_info():
    """Get information about the currently loaded model."""
    return {
        "registry": "MLflow",
        "model_name": "fraud-detection-model",
        "alias": "champion",
        "tracking_uri": "http://localhost:5000"
    }
</code></pre>
<p>Stop your old API (Ctrl+C) and start this new one:</p>
<pre><code class="language-python">uvicorn src.serve_mlflow:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>Now deploying a new model is a <strong>controlled, auditable process</strong>:</p>
<ol>
<li><p><strong>Train new model</strong> → Automatically registered as new version</p>
</li>
<li><p><strong>Compare metrics</strong> → Use MLflow UI to compare with current Production</p>
</li>
<li><p><strong>Set as champion</strong> → Assign @champion alias in MLflow UI</p>
</li>
<li><p><strong>Restart API</strong> → Loads new Production model</p>
</li>
<li><p><strong>Roll back if needed</strong> → Move @champion alias to previous version</p>
</li>
</ol>
<p><strong>Checkpoint:</strong></p>
<ul>
<li><p>MLflow UI (<code>http://localhost:5000</code>) should show the "fraud-detection" experiment with 5 runs</p>
</li>
<li><p>The "Models" tab should show "fraud-detection-model" with 5 versions</p>
</li>
<li><p>One version should have @champion alias</p>
</li>
<li><p>The API should load and serve @champion model</p>
</li>
</ul>
<h2 id="heading-4-ensure-feature-consistency-with-feast"><strong>4. Ensure Feature Consistency with Feast</strong></h2>
<p>⚠️ <strong>First time hearing about feature stores?</strong> Don't worry.<br>You don't need to master every Feast detail on the first read.<br>Focus on <em>why</em> feature consistency matters — you can revisit the implementation later.<br><strong>Key takeaway:</strong> Training and serving must compute features the same way, or your model silently fails.</p>
<p><strong>What breaks without this:</strong> Your model sees different feature values in production than it saw during training. Accuracy drops silently. This is called "training-serving skew" and it's one of the most common causes of ML system failures.</p>
<p>One subtle but critical issue in ML systems is <strong>training-serving skew</strong> – when data transformations at training time differ from inference time. Even small discrepancies can severely degrade performance.</p>
<p><strong>Why This Matters:</strong> Imagine you're computing "average transaction amount per merchant category" as a feature. During training, you compute it using pandas in a notebook. During serving, you compute it using SQL in a different system. Small differences in how these computations handle edge cases (nulls, rounding, time windows) cause the model to see different features in production than it was trained on.</p>
<p>The result? <strong>Silent failures</strong> where accuracy drops but nothing errors out. Your model is making predictions based on features it's never seen before, and you have no idea.</p>
<p>In our naive implementation, we did handle one simple case: we saved the <code>LabelEncoder</code> to ensure <code>merchant_category</code> is encoded the same way in training and serving. But imagine if we had more complex feature engineering:</p>
<ul>
<li><p>Rolling averages over time windows</p>
</li>
<li><p>User-level aggregations</p>
</li>
<li><p>Cross-feature interactions</p>
</li>
<li><p>Real-time features from streaming data</p>
</li>
</ul>
<p>Maintaining consistency manually becomes impossible.</p>
<h3 id="heading-41-what-is-feast-and-why-use-it"><strong>4.1 What is Feast and Why Use It?</strong></h3>
<p>In production ML platforms, teams use a <strong>feature store</strong> to guarantee feature consistency between training and serving. <strong>Feast</strong> is one popular open-source option.</p>
<p>In this tutorial, we use Feast not because you <em>must</em>, but because it makes the training-serving contract explicit and teachable. The principles apply whether you use Feast, Tecton, Featureform, or a custom solution.</p>
<p>Feast provides:</p>
<table>
<thead>
<tr>
<th><strong>Capability</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Single source of truth</strong></td>
<td>Define features once, use everywhere</td>
</tr>
<tr>
<td><strong>Offline/online consistency</strong></td>
<td>Same features for training and serving</td>
</tr>
<tr>
<td><strong>Point-in-time correctness</strong></td>
<td>Prevents data leakage in training</td>
</tr>
<tr>
<td><strong>Low-latency serving</strong></td>
<td>Millisecond feature retrieval</td>
</tr>
<tr>
<td><strong>Feature versioning</strong></td>
<td>Track changes to feature definitions</td>
</tr>
</tbody></table>
<p><strong>How Feast works:</strong></p>
<ol>
<li><p><strong>Define features</strong> in Python code (feature definitions)</p>
</li>
<li><p><strong>Materialize features</strong> from your data sources to the online store</p>
</li>
<li><p><strong>Retrieve features</strong> using the same API for both training (offline) and serving (online)</p>
</li>
</ol>
<p>This ensures that training and serving use <strong>exactly the same feature computation logic</strong>.</p>
<h3 id="heading-42-install-and-initialize-feast"><strong>4.2 Install and Initialize Feast</strong></h3>
<p>We already installed Feast via requirements.txt. Now let's initialize a feature repository.</p>
<pre><code class="language-python"># Navigate to the feature_repo directory
cd feature_repo

# Initialize Feast (this creates template files)
feast init . --minimal

# Go back to project root
cd ..
</code></pre>
<p>This creates the basic Feast structure:</p>
<pre><code class="language-python">feature_repo/
├── feature_store.yaml    # Feast configuration
└── __init__.py
</code></pre>
<h3 id="heading-43-define-feature-definitions"><strong>4.3 Define Feature Definitions</strong></h3>
<p>First, let's create the Feast configuration file:</p>
<pre><code class="language-python"># feature_repo/feature_store.yaml
project: fraud_detection
registry: ../data/registry.db
provider: local
online_store:
  type: sqlite
  path: ../data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 3
</code></pre>
<p>This configuration:</p>
<ul>
<li><p>Names our project "fraud_detection"</p>
</li>
<li><p>Uses SQLite for the online store (for production, you'd use Redis or DynamoDB)</p>
</li>
<li><p>Uses local files for the offline store (for production, you'd use BigQuery or Snowflake)</p>
</li>
</ul>
<p>Now create the feature definitions:</p>
<pre><code class="language-python"># feature_repo/features.py
"""
Feast feature definitions for fraud detection.

This file defines:
- Entities: The keys we use to look up features (merchant_category)
- Data Sources: Where the raw feature data comes from (Parquet file)
- Feature Views: The features themselves and their schemas

The key insight: These definitions are the SINGLE SOURCE OF TRUTH.
Both training and serving use these exact definitions.
"""
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# =============================================================================
# ENTITIES
# =============================================================================
# An entity is the "key" we use to look up features.
# For merchant-level features, the entity is merchant_category.

merchant = Entity(
    name="merchant_category",
    description="Merchant category for the transaction (for example, 'online', 'grocery')",
    value_type=ValueType.STRING,
)

# =============================================================================
# DATA SOURCES
# =============================================================================
# Data sources tell Feast where to find the raw feature data.
# For local development, we use a Parquet file.
# For production, this could be BigQuery, Snowflake, S3, etc.

merchant_stats_source = FileSource(
    name="merchant_stats_source",
    path="../data/merchant_features.parquet",  # We'll create this file
    timestamp_field="event_timestamp",       # Required for point-in-time joins
)

# =============================================================================
# FEATURE VIEWS
# =============================================================================
# A Feature View defines a group of related features.
# It specifies:
# - Which entity the features are for
# - The schema (names and types of features)
# - Where the data comes from
# - How long features are valid (TTL)

merchant_stats_fv = FeatureView(
    name="merchant_stats",
    description="Aggregated statistics per merchant category",
    entities=[merchant],
    ttl=timedelta(days=7),  # Features are valid for 7 days
    schema=[
        Field(name="avg_amount", dtype=Float32, description="Average transaction amount"),
        Field(name="transaction_count", dtype=Int64, description="Number of transactions"),
        Field(name="fraud_rate", dtype=Float32, description="Historical fraud rate"),
    ],
    source=merchant_stats_source,
    online=True,  # Enable online serving (low-latency retrieval)
)
</code></pre>
<h3 id="heading-44-materialize-features-to-online-store"><strong>4.4 Materialize Features to Online Store</strong></h3>
<p>Now we need to:</p>
<ol>
<li><p>Compute the features from our training data</p>
</li>
<li><p>Save them in a format Feast can read</p>
</li>
<li><p>Apply the Feast definitions</p>
</li>
<li><p>Materialize features to the online store</p>
</li>
</ol>
<p>Create <code>src/prepare_feast_features.py</code>:</p>
<pre><code class="language-python"># src/prepare_feast_features.py
"""
Prepare feature data for Feast.

This script:
1. Computes aggregated merchant features from training data
2. Saves them in Parquet format (Feast's offline store format)
3. Applies Feast feature definitions
4. Materializes features to the online store for low-latency serving

Run this whenever your training data changes or you want to refresh features.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import subprocess
import os

def compute_merchant_features(df: pd.DataFrame) -&gt; pd.DataFrame:
    """
    Compute aggregated features by merchant category.
    
    THIS IS THE SINGLE SOURCE OF TRUTH FOR FEATURE COMPUTATION.
    
    Both training and serving will use features computed by this exact logic.
    Any change here automatically applies everywhere.
    
    Args:
        df: Transaction DataFrame with columns: amount, merchant_category, is_fraud
        
    Returns:
        DataFrame with computed features per merchant category
    """
    print("Computing merchant-level features...")
    
    # Group by merchant category and compute aggregates
    stats = df.groupby('merchant_category').agg({
        'amount': ['mean', 'count'],
        'is_fraud': 'mean'
    }).reset_index()
    
    # Flatten column names
    stats.columns = ['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']
    
    # Add timestamp for Feast (required for point-in-time correct joins)
    stats['event_timestamp'] = datetime.now()
    
    # Convert types to match Feast schema
    stats['avg_amount'] = stats['avg_amount'].astype('float32')
    stats['transaction_count'] = stats['transaction_count'].astype('int64')
    stats['fraud_rate'] = stats['fraud_rate'].astype('float32')
    
    return stats

def main():
    print("="*60)
    print("FEAST FEATURE PREPARATION")
    print("="*60)
    
    # Load training data
    print("\n1. Loading training data...")
    train_df = pd.read_csv('data/train.csv')
    print(f"   Loaded {len(train_df):,} transactions")
    
    # Compute merchant features
    print("\n2. Computing merchant features...")
    merchant_features = compute_merchant_features(train_df)
    
    print("\n   Computed features:")
    print(merchant_features.to_string(index=False))
    
    # Save as Parquet (required format for Feast file source)
    print("\n3. Saving features to Parquet...")
    os.makedirs('data', exist_ok=True)
    output_path = 'data/merchant_features.parquet'
    merchant_features.to_parquet(output_path, index=False)
    print(f"   Saved to {output_path}")
    
    # Apply Feast feature definitions
    print("\n4. Applying Feast feature definitions...")
    try:
        result = subprocess.run(
            ['feast', 'apply'],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Feature definitions applied successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error applying Feast: {e.stderr}")
        raise
    
    # Materialize features to online store
    print("\n5. Materializing features to online store...")
    try:
        result = subprocess.run(
            ['feast', 'materialize-incremental', datetime.now().isoformat()],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Features materialized successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error materializing: {e.stderr}")
        raise
    
    print("\n" + "="*60)
    print("FEAST FEATURE PREPARATION COMPLETE!")
    print("="*60)
    print("\nYou can now:")
    print("  - Retrieve features for training: get_training_features()")
    print("  - Retrieve features for serving: get_online_features()")
    print("  - View feature stats: feast feature-views list")

if __name__ == "__main__":
    main()
</code></pre>
<p>Run the feature preparation:</p>
<pre><code class="language-python">python src/prepare_feast_features.py
</code></pre>
<p>You should see:</p>
<pre><code class="language-python">============================================================
FEAST FEATURE PREPARATION
============================================================

1. Loading training data... 8,000 transactions
2. Computing merchant features...
   grocery: avg=$31.24, fraud_rate=0.85%
   online: avg=$98.45, fraud_rate=4.87%
   restaurant: avg=$28.12, fraud_rate=0.50%
   retail: avg=$45.67, fraud_rate=1.02%
   travel: avg=$156.23, fraud_rate=4.18%
3. Saving to data/merchant_features.parquet ✓
4. Applying Feast definitions... ✓
5. Materializing to online store... ✓

FEAST FEATURE PREPARATION COMPLETE!
</code></pre>
<h3 id="heading-45-retrieve-features-for-training-and-serving"><strong>4.5 Retrieve Features for Training and Serving</strong></h3>
<p>Now let's create utilities to retrieve features consistently for both training and serving:</p>
<pre><code class="language-python"># src/feast_features.py
"""
Feast feature retrieval for training and serving.

This module provides functions to retrieve features from Feast:
- get_training_features(): For offline training (historical features)
- get_online_features(): For real-time serving (low-latency)

IMPORTANT: Both functions use the SAME feature definitions,
ensuring consistency between training and serving.
"""
import pandas as pd
from feast import FeatureStore
from datetime import datetime

# Initialize Feast store (points to our feature_repo)
store = FeatureStore(repo_path="feature_repo")

def get_training_features(df: pd.DataFrame) -&gt; pd.DataFrame:
    """
    Get features for training using Feast's offline store.
    
    Uses point-in-time correct joins to prevent data leakage.
    This means features are looked up as of the time each transaction occurred,
    not as of "now" - preventing you from accidentally using future data.
    
    Args:
        df: DataFrame with at least 'merchant_category' column
        
    Returns:
        DataFrame with original columns plus Feast features
    """
    print("Retrieving training features from Feast offline store...")
    
    # Prepare entity dataframe with timestamps
    # Each row needs: entity key(s) + event_timestamp
    entity_df = df[['merchant_category']].copy()
    entity_df['event_timestamp'] = datetime.now()  # See note below
    entity_df = entity_df.drop_duplicates()
    
    # ⚠️ Simplification: For clarity, we use the current timestamp here.
    # In real systems, this would be the actual event time of each transaction.
    
    # Retrieve historical features
    # Feast handles the point-in-time join automatically
    training_data = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
    ).to_df()
    
    # Merge features back with original dataframe
    result = df.merge(
        training_data[['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']],
        on='merchant_category',
        how='left'
    )
    
    print(f"Retrieved features for {len(entity_df)} unique merchants")
    return result

def get_online_features(merchant_category: str) -&gt; dict:
    """
    Get features for real-time serving using Feast's online store.
    
    This is optimized for low-latency retrieval (milliseconds).
    Use this in your prediction API for real-time inference.
    
    Args:
        merchant_category: The merchant category to look up
        
    Returns:
        Dictionary with feature names and values
    """
    # Retrieve from online store (low-latency)
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": merchant_category}],
    ).to_dict()
    
    # Format the response
    return {
        'merchant_avg_amount': feature_vector['avg_amount'][0],
        'merchant_tx_count': feature_vector['transaction_count'][0],
        'merchant_fraud_rate': feature_vector['fraud_rate'][0],
    }

def get_online_features_batch(merchant_categories: list) -&gt; pd.DataFrame:
    """
    Get features for multiple merchants at once (batch serving).
    
    More efficient than calling get_online_features() in a loop.
    
    Args:
        merchant_categories: List of merchant categories to look up
        
    Returns:
        DataFrame with features for each merchant
    """
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": mc} for mc in merchant_categories],
    ).to_df()
    
    return feature_vector

if __name__ == "__main__":
    # Test the feature retrieval functions
    print("="*60)
    print("TESTING FEAST FEATURE RETRIEVAL")
    print("="*60)
    
    # Test offline retrieval (for training)
    print("\n1. Testing OFFLINE feature retrieval (for training)...")
    train_df = pd.read_csv('data/train.csv').head(10)
    enriched = get_training_features(train_df)
    print("\n   Sample enriched training data:")
    print(enriched[['amount', 'merchant_category', 'avg_amount', 'fraud_rate']].head())
    
    # Test online retrieval (for serving)
    print("\n2. Testing ONLINE feature retrieval (for serving)...")
    for category in ['online', 'grocery', 'travel', 'restaurant', 'retail']:
        features = get_online_features(category)
        print(f"   {category}: avg_amount=${features['merchant_avg_amount']:.2f}, "
              f"fraud_rate={features['merchant_fraud_rate']:.2%}")
    
    # Test batch retrieval
    print("\n3. Testing BATCH online retrieval...")
    batch_features = get_online_features_batch(['online', 'grocery', 'travel'])
    print(batch_features)
    
    print("\n" + "="*60)
    print("FEAST FEATURE RETRIEVAL TEST COMPLETE!")
    print("="*60)
</code></pre>
<p>Test the feature retrieval:</p>
<pre><code class="language-python">python src/feast_features.py
</code></pre>
<p>You should see:</p>
<pre><code class="language-python">============================================================
TESTING FEAST FEATURE RETRIEVAL
============================================================

1. Testing OFFLINE feature retrieval (for training)...
Retrieving training features from Feast offline store...
Retrieved features for 5 unique merchants

   Sample enriched training data:
   amount merchant_category  avg_amount  fraud_rate
    45.23           grocery       31.24      0.0085
   123.45            online       98.45      0.0487
    ...

2. Testing ONLINE feature retrieval (for serving)...
   online: avg_amount=$98.45, fraud_rate=4.87%
   grocery: avg_amount=$31.24, fraud_rate=0.85%
   travel: avg_amount=$156.23, fraud_rate=4.18%
   restaurant: avg_amount=$28.12, fraud_rate=0.50%
   retail: avg_amount=$45.67, fraud_rate=1.02%

3. Testing BATCH online retrieval...
  merchant_category  avg_amount  transaction_count  fraud_rate
               online       98.45               1234      0.0487
              grocery       31.24               2345      0.0085
               travel      156.23                478      0.0418
</code></pre>
<h3 id="heading-why-feast-over-custom-code"><strong>Why Feast Over Custom Code?</strong></h3>
<table>
<thead>
<tr>
<th><strong>Aspect</strong></th>
<th><strong>Custom Code</strong></th>
<th><strong>Feast</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Consistency</strong></td>
<td>Manual effort to keep in sync</td>
<td>Automatic - same definitions everywhere</td>
</tr>
<tr>
<td><strong>Point-in-time correctness</strong></td>
<td>Must implement yourself</td>
<td>Built-in</td>
</tr>
<tr>
<td><strong>Online serving</strong></td>
<td>Must build your own cache</td>
<td>Built-in online store</td>
</tr>
<tr>
<td><strong>Feature versioning</strong></td>
<td>Not supported</td>
<td>Built-in</td>
</tr>
<tr>
<td><strong>Scalability</strong></td>
<td>Limited</td>
<td>Production-ready (BigQuery, Redis, etc.)</td>
</tr>
<tr>
<td><strong>Team collaboration</strong></td>
<td>Difficult</td>
<td>Feature registry with documentation</td>
</tr>
<tr>
<td><strong>Monitoring</strong></td>
<td>Manual</td>
<td>Built-in feature statistics</td>
</tr>
</tbody></table>
<p>💡 <strong>Mental Model</strong>: Treat feature definitions like database schemas.<br>You wouldn't compute a column one way in your application and a different way in your reports. Features deserve the same discipline — define once, use everywhere.</p>
<p><strong>Checkpoint:</strong> After running <code>prepare_feast_</code><a href="http://features.py"><code>features.py</code></a>, you should have:</p>
<ul>
<li><p><code>data/merchant_features.parquet</code> (computed features)</p>
</li>
<li><p><code>data/registry.db</code> (Feast registry)</p>
</li>
<li><p><code>data/online_store.db</code> (SQLite online store)</p>
</li>
</ul>
<p>Running <code>python src/feast_</code><a href="http://features.py"><code>features.py</code></a> should successfully retrieve features for all merchant categories.</p>
<h2 id="heading-5-add-data-validation-with-great-expectations"><strong>5. Add Data Validation with Great Expectations</strong></h2>
<p><strong>What breaks without this:</strong> Your API accepts garbage input (negative amounts, invalid hours) and returns meaningless predictions. Worse, you have no idea it happened.</p>
<p>Recall that our API currently trusts input blindly. We saw how garbage data produces a prediction with no warning. <strong>Great Expectations</strong> is an open-source tool for data quality testing – defining rules (expectations) and testing data against them.</p>
<p><strong>Why This Matters:</strong> Data validation acts as a gatekeeper. Bad data is rejected <strong>before</strong> it can harm predictions. As the saying goes, "Garbage in, garbage out" – feeding unreliable data yields unreliable results. With validation, we transform this to "Garbage in, <strong>error out</strong>" – much better for debugging and reliability.</p>
<h3 id="heading-51-define-expectations"><strong>5.1 Define Expectations</strong></h3>
<p>What are reasonable expectations for our transaction data? Based on domain knowledge:</p>
<table>
<thead>
<tr>
<th><strong>Field</strong></th>
<th><strong>Expectation</strong></th>
<th><strong>Reason</strong></th>
</tr>
</thead>
<tbody><tr>
<td><code>amount</code></td>
<td>Positive (&gt; 0)</td>
<td>Negative transactions don't make sense</td>
</tr>
<tr>
<td><code>amount</code></td>
<td>Below $50,000</td>
<td>Extremely large amounts are outliers/errors</td>
</tr>
<tr>
<td><code>hour</code></td>
<td>0-23 inclusive</td>
<td>Valid hours in a day</td>
</tr>
<tr>
<td><code>day_of_week</code></td>
<td>0-6 inclusive</td>
<td>Valid days (Mon=0, Sun=6)</td>
</tr>
<tr>
<td><code>merchant_category</code></td>
<td>One of known categories</td>
<td>Must match training data</td>
</tr>
<tr>
<td>All fields</td>
<td>Not null</td>
<td>Required for prediction</td>
</tr>
</tbody></table>
<p>Create <code>src/data_validation.py</code>:</p>
<pre><code class="language-python"># src/data_validation.py
"""
Data validation for fraud detection.

This module provides functions to validate input data BEFORE making predictions.
Invalid data is rejected with clear error messages.

The key insight: It's better to reject bad input than to make garbage predictions.
"""
import pandas as pd
from typing import Dict, List, Any, Optional

# Define the valid merchant categories (must match training data!)
VALID_CATEGORIES = ["grocery", "restaurant", "retail", "online", "travel"]

def validate_transaction(data: Dict[str, Any]) -&gt; Dict[str, Any]:
    """
    Validate a single transaction for fraud prediction.
    
    Checks all business rules and data quality requirements.
    Returns a dictionary with 'valid' (bool) and 'errors' (list).
    
    Args:
        data: Dictionary with transaction fields
        
    Returns:
        {"valid": bool, "errors": list of error messages}
        
    Example:
        &gt;&gt;&gt; validate_transaction({"amount": -100, "hour": 25, ...})
        {"valid": False, "errors": ["amount must be positive", "hour must be 0-23"]}
    """
    errors = []
    
    # ==========================================================================
    # Amount Validation
    # ==========================================================================
    amount = data.get("amount")
    if amount is None:
        errors.append("amount is required")
    elif not isinstance(amount, (int, float)):
        errors.append(f"amount must be a number (got {type(amount).__name__})")
    elif amount &lt;= 0:
        errors.append("amount must be positive")
    elif amount &gt; 50000:
        errors.append(f"amount exceeds maximum allowed value of \(50,000 (got \){amount:,.2f})")
    
    # ==========================================================================
    # Hour Validation
    # ==========================================================================
    hour = data.get("hour")
    if hour is None:
        errors.append("hour is required")
    elif not isinstance(hour, int):
        errors.append(f"hour must be an integer (got {type(hour).__name__})")
    elif not (0 &lt;= hour &lt;= 23):
        errors.append(f"hour must be between 0 and 23 (got {hour})")
    
    # ==========================================================================
    # Day of Week Validation
    # ==========================================================================
    day = data.get("day_of_week")
    if day is None:
        errors.append("day_of_week is required")
    elif not isinstance(day, int):
        errors.append(f"day_of_week must be an integer (got {type(day).__name__})")
    elif not (0 &lt;= day &lt;= 6):
        errors.append(f"day_of_week must be between 0 (Monday) and 6 (Sunday) (got {day})")
    
    # ==========================================================================
    # Merchant Category Validation
    # ==========================================================================
    category = data.get("merchant_category")
    if category is None:
        errors.append("merchant_category is required")
    elif not isinstance(category, str):
        errors.append(f"merchant_category must be a string (got {type(category).__name__})")
    elif category not in VALID_CATEGORIES:
        errors.append(
            f"merchant_category must be one of {VALID_CATEGORIES} (got '{category}')"
        )
    
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }

def validate_batch(df: pd.DataFrame) -&gt; Dict[str, Any]:
    """
    Validate a batch of transactions using Great Expectations.
    
    This is useful for validating training data or batch prediction requests.
    Uses Great Expectations for more sophisticated validation.
    
    Args:
        df: DataFrame with transaction data
        
    Returns:
        Dictionary with validation results
    """
    import great_expectations as gx
    
    # Convert to Great Expectations dataset
    ge_df = gx.from_pandas(df)
    
    results = []
    
    # Amount expectations
    r = ge_df.expect_column_values_to_be_between(
        'amount', min_value=0.01, max_value=50000, mostly=0.99
    )
    results.append(('amount_range', r.success, r.result))
    
    # Hour expectations
    r = ge_df.expect_column_values_to_be_between(
        'hour', min_value=0, max_value=23
    )
    results.append(('hour_range', r.success, r.result))
    
    # Day of week expectations
    r = ge_df.expect_column_values_to_be_between(
        'day_of_week', min_value=0, max_value=6
    )
    results.append(('day_range', r.success, r.result))
    
    # Merchant category expectations
    r = ge_df.expect_column_values_to_be_in_set(
        'merchant_category', VALID_CATEGORIES
    )
    results.append(('category_valid', r.success, r.result))
    
    # No nulls in critical fields
    for col in ['amount', 'hour', 'day_of_week', 'merchant_category']:
        r = ge_df.expect_column_values_to_not_be_null(col)
        results.append((f'{col}_not_null', r.success, r.result))
    
    # Summarize results
    passed = sum(1 for _, success, _ in results if success)
    total = len(results)
    
    return {
        'success': passed == total,
        'passed': passed,
        'total': total,
        'pass_rate': passed / total,
        'details': {name: {'passed': success, 'result': result} 
                   for name, success, result in results}
    }

if __name__ == "__main__":
    print("="*60)
    print("TESTING DATA VALIDATION")
    print("="*60)
    
    # Test single transaction validation
    print("\n1. Single Transaction Validation")
    print("-"*40)
    
    test_cases = [
        {
            "name": "Valid transaction",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Negative amount",
            "data": {"amount": -100.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Invalid hour",
            "data": {"amount": 50.0, "hour": 25, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Unknown merchant",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "unknown"}
        },
        {
            "name": "Everything wrong",
            "data": {"amount": -999, "hour": 99, "day_of_week": 15, "merchant_category": "fake"}
        },
    ]
    
    for tc in test_cases:
        result = validate_transaction(tc["data"])
        status = "PASS" if result["valid"] else "FAIL"
        print(f"\n{tc['name']}: {status}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  - {error}")
    
    # Test batch validation
    print("\n\n2. Batch Validation with Great Expectations")
    print("-"*40)
    
    train_df = pd.read_csv('data/train.csv')
    results = validate_batch(train_df)
    
    print(f"\nTraining data validation: {results['passed']}/{results['total']} checks passed")
    print(f"Pass rate: {results['pass_rate']:.1%}")
    
    if not results['success']:
        print("\nFailed checks:")
        for name, detail in results['details'].items():
            if not detail['passed']:
                print(f"  - {name}")
</code></pre>
<h3 id="heading-when-to-use-which-validation-approach"><strong>When to Use Which Validation Approach</strong></h3>
<table>
<thead>
<tr>
<th><strong>Approach</strong></th>
<th><strong>Use Case</strong></th>
<th><strong>Latency</strong></th>
<th><strong>When to Use</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Custom Python</strong> (<code>validate_transaction</code>)</td>
<td>Real-time API requests</td>
<td>&lt;1ms</td>
<td>Every prediction request</td>
</tr>
<tr>
<td><strong>Great Expectations</strong></td>
<td>Batch data quality</td>
<td>Seconds</td>
<td>Training data, periodic audits, CI/CD</td>
</tr>
</tbody></table>
<p>We use <strong>both</strong> in this tutorial because they serve different purposes:</p>
<ul>
<li><p>Custom validation is your <strong>runtime gatekeeper</strong> — fast enough for every request</p>
</li>
<li><p>Great Expectations is your <strong>batch auditor</strong> — thorough checks on datasets</p>
</li>
</ul>
<h3 id="heading-52-integrate-validation-into-fastapi"><strong>5.2 Integrate Validation into FastAPI</strong></h3>
<p>Now let's update our API to reject invalid input with clear error messages:</p>
<pre><code class="language-python"># src/serve_validated.py
"""
Serve fraud detection model with input validation.

This version adds data validation BEFORE making predictions:
- Invalid inputs are rejected with HTTP 400 and clear error messages
- Valid inputs are processed and predictions returned

This is much safer than the naive version which accepted garbage.
"""
import pickle
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.data_validation import validate_transaction

# Load model
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)

app = FastAPI(
    title="Fraud Detection API (Validated)",
    description="""
    Fraud detection API with input validation.
    
    All inputs are validated before prediction:
    - amount: Must be positive and below $50,000
    - hour: Must be 0-23
    - day_of_week: Must be 0-6
    - merchant_category: Must be one of: grocery, restaurant, retail, online, travel
    
    Invalid inputs return HTTP 400 with detailed error messages.
    """,
    version="3.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount (must be positive)", example=150.00)
    hour: int = Field(..., description="Hour of day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Mon, 6=Sun)", example=3)
    merchant_category: str = Field(..., description="Merchant type", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    validation_passed: bool = True

class ValidationErrorResponse(BaseModel):
    detail: dict

@app.post("/predict", response_model=PredictionResponse, responses={400: {"model": ValidationErrorResponse}})
def predict(tx: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Input is validated before prediction. Invalid inputs return HTTP 400.
    """
    data = tx.dict()
    
    # VALIDATE INPUT BEFORE MAKING PREDICTION
    validation = validate_transaction(data)
    
    if not validation["valid"]:
        raise HTTPException(
            status_code=400,
            detail={
                "message": "Validation failed",
                "errors": validation["errors"],
                "input": data
            }
        )
    
    # Input is valid - make prediction
    data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        validation_passed=True
    )

@app.get("/health")
def health():
    return {"status": "healthy", "validation": "enabled"}
</code></pre>
<p>Start the validated API:</p>
<pre><code class="language-python">uvicorn src.serve_validated:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>Now test with bad data:</p>
<pre><code class="language-python">curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}'
</code></pre>
<p>Response (HTTP 400):</p>
<pre><code class="language-python">{
  "detail": {
    "message": "Validation failed",
    "errors": [
      "amount must be positive",
      "hour must be between 0 and 23 (got 25)",
      "day_of_week must be between 0 (Monday) and 6 (Sunday) (got 10)",
      "merchant_category must be one of ['grocery', 'restaurant', 'retail', 'online', 'travel'] (got 'fake')"
    ],
    "input": {"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}
  }
}
</code></pre>
<p><strong>This is a huge improvement!</strong> Instead of silently accepting garbage and returning meaningless predictions, we now:</p>
<ul>
<li><p>Reject invalid input immediately</p>
</li>
<li><p>Provide clear, actionable error messages</p>
</li>
<li><p>Return the original input for debugging</p>
</li>
<li><p>Use proper HTTP status codes (400 for client error)</p>
</li>
</ul>
<p><strong>Checkpoint:</strong> Your validated API should:</p>
<ul>
<li><p>Accept valid transactions and return predictions</p>
</li>
<li><p>Reject invalid transactions with HTTP 400 and detailed error messages</p>
</li>
<li><p>Show validation errors for each invalid field</p>
</li>
</ul>
<h2 id="heading-6-monitor-model-performance-and-data-drift"><strong>6. Monitor Model Performance and Data Drift</strong></h2>
<p><strong>What breaks without this:</strong> Your model's accuracy drops from 98% to 70% over two months. Nobody notices until customers complain. By then, significant damage has occurred.</p>
<p>Even with a great model and clean input data, <strong>time can be an enemy</strong>. Model performance can decline as real-world data evolves – this is known as <strong>model drift</strong> or <strong>model decay</strong>.</p>
<p><strong>Why This Matters:</strong> In traditional software, you monitor CPU, memory, error rates, and response times. In ML, you must <strong>also</strong> monitor:</p>
<ul>
<li><p>Data quality (are inputs within expected ranges?)</p>
</li>
<li><p>Model performance (is accuracy holding up?)</p>
</li>
<li><p>Data drift (has input distribution changed?)</p>
</li>
<li><p>Prediction drift (has the distribution of predictions changed?)</p>
</li>
</ul>
<p>Without monitoring, your model could be silently failing for weeks before anyone notices. By then, significant damage may have occurred – fraud slipping through, good customers blocked, revenue lost.</p>
<h3 id="heading-61-the-four-pillars-of-ml-observability"><strong>6.1 The Four Pillars of ML Observability</strong></h3>
<table>
<thead>
<tr>
<th><strong>Pillar</strong></th>
<th><strong>What to Monitor</strong></th>
<th><strong>Why It Matters</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Data Quality</strong></td>
<td>Are inputs valid? Nulls? Outliers?</td>
<td>Bad data causes bad predictions</td>
</tr>
<tr>
<td><strong>Model Performance</strong></td>
<td>Accuracy, precision, recall, F1</td>
<td>Is the model still working?</td>
</tr>
<tr>
<td><strong>Data Drift</strong></td>
<td>Has input distribution changed from training?</td>
<td>Model may not generalize to new data</td>
</tr>
<tr>
<td><strong>Prediction Drift</strong></td>
<td>Has prediction distribution changed?</td>
<td>May indicate data or concept drift</td>
</tr>
</tbody></table>
<h3 id="heading-62-build-a-drift-monitor-with-evidently"><strong>6.2 Build a Drift Monitor with Evidently</strong></h3>
<p><strong>Evidently</strong> is an open-source library specifically designed for ML monitoring. It can detect drift, generate reports, and integrate with monitoring systems.</p>
<p>Create <code>src/monitoring.py</code>:</p>
<pre><code class="language-python"># src/monitoring.py
"""
Model monitoring with Evidently.

This module provides tools to:
1. Detect data drift between training and production data
2. Generate detailed HTML reports
3. Track drift over time
4. Alert when drift exceeds thresholds

In production, you would run drift checks periodically (hourly, daily)
and alert when significant drift is detected.
"""
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric
)
from datetime import datetime
from typing import List, Dict, Any, Optional

class DriftMonitor:
    """
    Monitor for detecting data drift between reference (training) and current data.
    
    Implementation Note: We use two approaches here:
    1. Scipy's KS-test — A lightweight statistical method that works anywhere (our fallback)
    2. Evidently — A full-featured library with beautiful reports (our primary tool)
    
    The KS-test is included as defensive coding — if Evidently fails to generate 
    a report, we still get drift detection.
    
    Usage:
        monitor = DriftMonitor(training_data)
        result = monitor.check_drift(production_data)
        if result['drift_detected']:
            alert("Drift detected!")
    """
    
    def __init__(self, reference_data: pd.DataFrame, feature_columns: Optional[List[str]] = None):
        """
        Initialize the drift monitor with reference (training) data.
        
        Args:
            reference_data: The training data to compare against
            feature_columns: Columns to monitor (default: all numeric columns)
        """
        self.reference = reference_data
        self.feature_columns = feature_columns or reference_data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        self.history: List[Dict[str, Any]] = []
        
        print(f"Drift monitor initialized with {len(self.reference):,} reference samples")
        print(f"Monitoring columns: {self.feature_columns}")
    
    def check_drift(self, current_data: pd.DataFrame, threshold: float = 0.1) -&gt; Dict[str, Any]:
        """
        Check for drift between reference and current data.
        
        Args:
            current_data: Current/production data to check
            threshold: Drift share threshold for alerting (default 10%)
            
        Returns:
            Dictionary with drift results
        """
        from scipy import stats
        
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        # Simple statistical drift detection using KS test
        drifted_columns = []
        for col in self.feature_columns:
            statistic, p_value = stats.ks_2samp(
                ref_subset[col].dropna(),
                cur_subset[col].dropna()
            )
            if p_value &lt; 0.05:  # 5% significance level
                drifted_columns.append(col)
        
        n_features = len(self.feature_columns)
        n_drifted = len(drifted_columns)
        drift_share = n_drifted / n_features if n_features &gt; 0 else 0
        
        result = {
            'timestamp': datetime.now().isoformat(),
            'drift_detected': n_drifted &gt; 0,
            'drift_share': drift_share,
            'drifted_columns': drifted_columns,
            'n_features': n_features,
            'n_drifted': n_drifted,
            'current_samples': len(current_data),
            'threshold': threshold,
            'alert': drift_share &gt; threshold
        }
        
        self.history.append(result)
        
        return result
    
    def generate_report(self, current_data: pd.DataFrame, output_path: str = "drift_report.html"):
        """
        Generate a detailed HTML drift report using Evidently.
        
        Opens in browser for visual inspection of drift patterns.
        """
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        try:
            report = Report(metrics=[DataDriftPreset()])
            report.run(reference_data=ref_subset, current_data=cur_subset)
            
            # Save HTML report
            with open(output_path, 'w') as f:
                f.write(report.show(mode='inline').data)
            
            print(f"Drift report saved to {output_path}")
            print(f"Open this file in a browser to view detailed visualizations.")
        except Exception as e:
            print(f"Could not generate Evidently report: {e}")
            print(f"Using simplified drift detection instead.")
    
    def get_alerts(self, threshold: float = 0.1) -&gt; List[Dict[str, Any]]:
        """
        Get all alerts from history where drift exceeded threshold.
        """
        return [
            {
                'timestamp': r['timestamp'],
                'severity': 'HIGH' if r['drift_share'] &gt; 0.3 else 'MEDIUM',
                'drift_share': r['drift_share'],
                'message': f"Drift detected: {r['drift_share']:.1%} of features drifted",
                'drifted_columns': r['drifted_columns']
            }
            for r in self.history
            if r['drift_share'] &gt; threshold
        ]
    
    def summary(self) -&gt; Dict[str, Any]:
        """Get summary statistics from monitoring history."""
        if not self.history:
            return {"message": "No drift checks performed yet"}
        
        drift_shares = [r['drift_share'] for r in self.history]
        alerts = [r for r in self.history if r['alert']]
        
        return {
            'total_checks': len(self.history),
            'total_alerts': len(alerts),
            'avg_drift_share': np.mean(drift_shares),
            'max_drift_share': np.max(drift_shares),
            'first_check': self.history[0]['timestamp'],
            'last_check': self.history[-1]['timestamp']
        }


def simulate_drift_scenarios():
    """
    Demonstrate drift detection with different scenarios.
    
    This simulates what happens when production data differs from training data.
    """
    from src.generate_data import generate_transactions
    
    print("="*70)
    print("DRIFT DETECTION SIMULATION")
    print("="*70)
    
    # Load reference (training) data
    print("\n1. Loading reference data (training set)...")
    reference = pd.read_csv('data/train.csv')
    feature_cols = ['amount', 'hour', 'day_of_week']
    
    # Initialize drift monitor
    monitor = DriftMonitor(reference, feature_cols)
    
    # Scenario 1: Similar data (should show minimal drift)
    print("\n" + "-"*70)
    print("SCENARIO 1: Test data (similar distribution)")
    print("-"*70)
    test_data = pd.read_csv('data/test.csv')
    result = monitor.check_drift(test_data)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 2: Fraud spike (10% fraud instead of 2%)
    print("\n" + "-"*70)
    print("SCENARIO 2: Fraud spike (10% fraud rate instead of 2%)")
    print("-"*70)
    fraud_spike = generate_transactions(n_samples=2000, fraud_ratio=0.10, seed=101)
    result = monitor.check_drift(fraud_spike)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 3: Amount inflation (everything costs more)
    print("\n" + "-"*70)
    print("SCENARIO 3: Amount inflation (2x multiplier)")
    print("-"*70)
    inflated = test_data.copy()
    inflated['amount'] = inflated['amount'] * 2
    result = monitor.check_drift(inflated)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 4: Time shift (more late-night transactions)
    print("\n" + "-"*70)
    print("SCENARIO 4: Time shift (mostly late-night transactions)")
    print("-"*70)
    night_shift = test_data.copy()
    night_shift['hour'] = np.random.choice([0, 1, 2, 3, 22, 23], size=len(night_shift))
    result = monitor.check_drift(night_shift)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Generate detailed report for the most drifted scenario
    print("\n" + "-"*70)
    print("GENERATING DETAILED REPORT")
    print("-"*70)
    monitor.generate_report(night_shift, "drift_report.html")
    
    # Print summary
    print("\n" + "-"*70)
    print("MONITORING SUMMARY")
    print("-"*70)
    summary = monitor.summary()
    print(f"  Total checks: {summary['total_checks']}")
    print(f"  Total alerts: {summary['total_alerts']}")
    print(f"  Average drift share: {summary['avg_drift_share']:.1%}")
    print(f"  Maximum drift share: {summary['max_drift_share']:.1%}")
    
    # Print alerts
    alerts = monitor.get_alerts()
    if alerts:
        print(f"\n  Alerts ({len(alerts)}):")
        for alert in alerts:
            print(f"    [{alert['severity']}] {alert['message']}")
    
    print("\n" + "="*70)
    print("DRIFT DETECTION SIMULATION COMPLETE")
    print("="*70)
    print("\nOpen drift_report.html in your browser to see detailed visualizations!")


if __name__ == "__main__":
    simulate_drift_scenarios()
</code></pre>
<p>Run the drift simulation:</p>
<pre><code class="language-python">python src/monitoring.py
</code></pre>
<p>You'll see output showing how drift detection works in different scenarios. Then open <code>drift_report.html</code> in your browser to see beautiful visualizations of the drift patterns.</p>
<h3 id="heading-63-production-monitoring-strategy"><strong>6.3 Production Monitoring Strategy</strong></h3>
<p>In a production environment, you would:</p>
<ol>
<li><p><strong>Log all predictions</strong> to a database or data warehouse</p>
</li>
<li><p><strong>Run drift checks periodically</strong> (hourly for high-traffic systems, daily for lower traffic)</p>
</li>
<li><p><strong>Set up alerts</strong> when drift exceeds thresholds (integrate with PagerDuty, Slack, etc.)</p>
</li>
<li><p><strong>Trigger retraining</strong> if drift is severe or sustained</p>
</li>
<li><p><strong>Create dashboards</strong> to track drift over time (Grafana, Datadog, etc.)</p>
</li>
</ol>
<p><strong>Checkpoint:</strong> Running <code>python src/</code><a href="http://monitoring.py"><code>monitoring.py</code></a> should:</p>
<ul>
<li><p>Show minimal drift for similar data (test set)</p>
</li>
<li><p>Show significant drift for modified data (fraud spike, inflation, time shift)</p>
</li>
<li><p>Generate an HTML report that you can view in your browser</p>
</li>
</ul>
<h2 id="heading-7-automate-testing-and-deployment-with-cicd"><strong>7. Automate Testing and Deployment with CI/CD</strong></h2>
<p><strong>What breaks without this:</strong> A typo in your code breaks the API. You deploy on Friday at 5 PM. Nobody notices until Monday. Fraud losses spike over the weekend.</p>
<p><strong>CI/CD</strong> (Continuous Integration/Continuous Deployment) ensures reliable, repeatable releases. As JFrog notes: <em>"A strong CI/CD pipeline enables ML teams to build robust, bug-free models more quickly and efficiently."</em></p>
<p><strong>Why This Matters:</strong> In ML, changes aren't just code – they're also data and models. CI/CD ensures that when you change training logic, data preprocessing, or hyperparameters, tests verify the change doesn't break anything before it reaches production. It's the difference between deploying with confidence and deploying with crossed fingers.</p>
<h3 id="heading-71-write-tests-for-data-and-model"><strong>7.1 Write Tests for Data and Model</strong></h3>
<p>Create <code>tests/test_data_and_</code><a href="http://model.py"><code>model.py</code></a>:</p>
<pre><code class="language-python"># tests/test_data_and_model.py
"""
Tests for data quality and model performance.

These tests run in CI/CD to ensure:
1. Data meets quality requirements
2. Model meets performance thresholds
3. No regressions are introduced

Run with: pytest tests/test_data_and_model.py -v
"""
import pandas as pd
import pickle
import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

class TestDataQuality:
    """Tests for training data quality."""
    
    @pytest.fixture
    def train_data(self):
        return pd.read_csv("data/train.csv")
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_train_data_has_expected_columns(self, train_data):
        """Training data must have all required columns."""
        required_columns = {"amount", "hour", "day_of_week", "merchant_category", "is_fraud"}
        actual_columns = set(train_data.columns)
        missing = required_columns - actual_columns
        assert not missing, f"Missing columns: {missing}"
    
    def test_train_data_not_empty(self, train_data):
        """Training data must have rows."""
        assert len(train_data) &gt; 0, "Training data is empty"
        assert len(train_data) &gt;= 1000, f"Training data too small: {len(train_data)} rows"
    
    def test_no_negative_amounts(self, train_data):
        """Transaction amounts must be non-negative."""
        negative_count = (train_data["amount"] &lt; 0).sum()
        assert negative_count == 0, f"Found {negative_count} negative amounts"
    
    def test_amounts_reasonable(self, train_data):
        """Transaction amounts should be within reasonable bounds."""
        max_amount = train_data["amount"].max()
        assert max_amount &lt;= 100000, f"Max amount {max_amount} exceeds reasonable limit"
    
    def test_hours_valid(self, train_data):
        """Hours must be 0-23."""
        invalid = train_data[(train_data["hour"] &lt; 0) | (train_data["hour"] &gt; 23)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid hours"
    
    def test_days_valid(self, train_data):
        """Days of week must be 0-6."""
        invalid = train_data[(train_data["day_of_week"] &lt; 0) | (train_data["day_of_week"] &gt; 6)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid days"
    
    def test_merchant_categories_valid(self, train_data):
        """Merchant categories must be from known set."""
        valid_categories = {"grocery", "restaurant", "retail", "online", "travel"}
        actual_categories = set(train_data["merchant_category"].unique())
        invalid = actual_categories - valid_categories
        assert not invalid, f"Invalid merchant categories: {invalid}"
    
    def test_fraud_ratio_reasonable(self, train_data):
        """Fraud ratio should be realistic (between 0.1% and 50%)."""
        fraud_ratio = train_data["is_fraud"].mean()
        assert 0.001 &lt;= fraud_ratio &lt;= 0.5, f"Fraud ratio {fraud_ratio:.2%} is unrealistic"
    
    def test_no_nulls_in_critical_columns(self, train_data):
        """Critical columns must not have null values."""
        critical = ["amount", "hour", "day_of_week", "merchant_category", "is_fraud"]
        for col in critical:
            null_count = train_data[col].isnull().sum()
            assert null_count == 0, f"Column {col} has {null_count} null values"


class TestModelPerformance:
    """Tests for model performance thresholds."""
    
    @pytest.fixture
    def model_and_encoder(self):
        with open("models/model.pkl", "rb") as f:
            return pickle.load(f)
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_model_loads_successfully(self, model_and_encoder):
        """Model file must load without errors."""
        model, encoder = model_and_encoder
        assert model is not None, "Model is None"
        assert encoder is not None, "Encoder is None"
    
    def test_model_can_predict(self, model_and_encoder, test_data):
        """Model must be able to make predictions."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        predictions = model.predict(X)
        assert len(predictions) == len(X), "Prediction count mismatch"
    
    def test_accuracy_threshold(self, model_and_encoder, test_data):
        """Model accuracy must be at least 90%."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        accuracy = model.score(X, y)
        assert accuracy &gt;= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"
    
    def test_f1_threshold(self, model_and_encoder, test_data):
        """Model F1-score must be at least 0.3 (sanity check for imbalanced data)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred)
        assert f1 &gt;= 0.3, f"F1-score {f1:.2f} below 0.3 threshold"
    
    def test_precision_not_zero(self, model_and_encoder, test_data):
        """Model precision must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        precision = precision_score(y, y_pred, zero_division=0)
        assert precision &gt; 0, "Model has zero precision (predicts no fraud)"
    
    def test_recall_not_zero(self, model_and_encoder, test_data):
        """Model recall must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        recall = recall_score(y, y_pred, zero_division=0)
        assert recall &gt; 0, "Model has zero recall (misses all fraud)"
</code></pre>
<p>Create <code>tests/test_</code><a href="http://api.py"><code>api.py</code></a>:</p>
<pre><code class="language-python"># tests/test_api.py
"""
Tests for the FastAPI prediction service.

These tests ensure the API:
1. Returns correct responses for valid inputs
2. Rejects invalid inputs with proper error messages
3. Health check works

Run with: pytest tests/test_api.py -v
Note: Requires the API to be running on localhost:8000
"""
import pytest
import httpx

BASE_URL = "http://localhost:8000"

class TestPredictionEndpoint:
    """Tests for the /predict endpoint."""
    
    def test_valid_prediction_returns_200(self):
        """Valid input should return HTTP 200 with prediction."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        assert "is_fraud" in data
        assert "fraud_probability" in data
        assert isinstance(data["is_fraud"], bool)
        assert 0 &lt;= data["fraud_probability"] &lt;= 1
    
    def test_high_risk_transaction(self):
        """High-risk transaction should have higher fraud probability."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 500.0,
            "hour": 3,  # Late night
            "day_of_week": 1,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        # High-risk transactions should have elevated probability
        # (not asserting exact value as model may vary)
        assert data["fraud_probability"] &gt;= 0.0
    
    def test_negative_amount_rejected(self):
        """Negative amount should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": -100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
        assert "errors" in response.json()["detail"]
    
    def test_invalid_hour_rejected(self):
        """Invalid hour should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 25,  # Invalid
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_invalid_merchant_rejected(self):
        """Unknown merchant category should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "unknown_category"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_missing_field_rejected(self):
        """Missing required field should be rejected."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14
            # Missing day_of_week and merchant_category
        }, timeout=10)
        
        assert response.status_code == 422  # Pydantic validation error


class TestHealthEndpoint:
    """Tests for the /health endpoint."""
    
    def test_health_returns_200(self):
        """Health endpoint should return 200."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        assert response.status_code == 200
    
    def test_health_returns_healthy_status(self):
        """Health endpoint should indicate healthy status."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        data = response.json()
        assert data["status"] == "healthy"
</code></pre>
<p>Run tests locally:</p>
<pre><code class="language-python"># Run data and model tests (API not needed)
pytest tests/test_data_and_model.py -v

# Run API tests (requires API to be running)
pytest tests/test_api.py -v
</code></pre>
<h3 id="heading-72-github-actions-workflow"><strong>7.2 GitHub Actions Workflow</strong></h3>
<p>⚠️ <strong>Note for Production Teams</strong><br>In real ML teams, you typically don't retrain full models inside CI — it's slow and resource-intensive.<br>Here we do it to keep everything local, reproducible, and self-contained for learning.<br>Production pipelines usually separate training (scheduled jobs) from testing (CI/CD).</p>
<p>Create <code>.github/workflows/ci.yml</code>:</p>
<pre><code class="language-python"># .github/workflows/ci.yml
name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      
      - name: Generate training data
        run: python src/generate_data.py
      
      - name: Train model
        run: python src/train_naive.py
      
      - name: Run data quality tests
        run: pytest tests/test_data_and_model.py -v --tb=short
      
      - name: Build Docker image
        run: docker build -t fraud-detection-api .
      
      - name: Run container for API tests
        run: |
          docker run -d -p 8000:8000 --name test-api fraud-detection-api
          sleep 10  # Wait for API to start
          curl -f http://localhost:8000/health || exit 1
      
      - name: Run API tests
        run: pytest tests/test_api.py -v --tb=short
      
      - name: Cleanup
        if: always()
        run: docker stop test-api || true
</code></pre>
<h3 id="heading-73-dockerize-the-application"><strong>7.3 Dockerize the Application</strong></h3>
<p>Create <code>Dockerfile</code>:</p>
<pre><code class="language-python"># Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update &amp;&amp; apt-get install -y \
    curl \
    &amp;&amp; rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ src/
COPY models/ models/
COPY data/ data/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the API
CMD ["uvicorn", "src.serve_validated:app", "--host", "0.0.0.0", "--port", "8000"]
</code></pre>
<p>Create <code>.dockerignore</code>:</p>
<pre><code class="language-python"># .dockerignore
venv/
__pycache__/
*.pyc
.git/
.github/
mlruns/
*.db
*.html
.pytest_cache/
</code></pre>
<p>Build and run locally:</p>
<pre><code class="language-python"># Build the Docker image
docker build -t fraud-detection-api .

# Run the container
docker run -p 8000:8000 fraud-detection-api

# Test it
curl http://localhost:8000/health
</code></pre>
<p><strong>Checkpoint:</strong></p>
<ul>
<li><p>All tests pass: <code>pytest tests/test_data_and_</code><a href="http://model.py"><code>model.py</code></a> <code>-v</code></p>
</li>
<li><p>Docker image builds successfully</p>
</li>
<li><p>Container runs and responds to health checks</p>
</li>
</ul>
<h2 id="heading-8-incident-response-playbook"><strong>8. Incident Response Playbook</strong></h2>
<p>When things go wrong in production (and they will), you need a plan. This section provides playbooks for common ML incidents.</p>
<h3 id="heading-scenario-false-positive-spike"><strong>Scenario: False Positive Spike</strong></h3>
<p><strong>Symptoms:</strong> Your fraud model suddenly flags 40% of legitimate transactions as fraud, blocking customers and overwhelming your manual review team.</p>
<p><strong>Severity:</strong> HIGH - Direct customer impact</p>
<p><strong>Phase 1: Mitigation (0-5 minutes)</strong></p>
<ol>
<li><p><strong>Acknowledge the incident</strong> - Notify stakeholders that you're aware and responding</p>
</li>
<li><p><strong>Roll back to previous model</strong> - In MLflow UI, move the @champion alias to the previous model version</p>
</li>
<li><p><strong>Restart the API</strong> - <code>docker restart fraud-api</code> or redeploy</p>
</li>
<li><p><strong>Verify</strong> - Check that false positive rate has returned to normal</p>
</li>
<li><p><strong>Communicate</strong> - "Issue detected and mitigated. Investigating root cause."</p>
</li>
</ol>
<p><strong>Phase 2: Diagnosis (5-60 minutes)</strong></p>
<ol>
<li><p><strong>Check drift report</strong> - Run <code>python src/</code><a href="http://monitoring.py"><code>monitoring.py</code></a> with recent production data</p>
</li>
<li><p><strong>Check data validation logs</strong> - Did upstream data format change?</p>
</li>
<li><p><strong>Check recent deployments</strong> - Was there a new model or code deployed recently?</p>
</li>
<li><p><strong>Compare metrics</strong> - What's different between the rolled-back and problematic model?</p>
</li>
</ol>
<p><strong>Example root causes:</strong></p>
<ul>
<li><p>Upstream system sent amounts in cents instead of dollars</p>
</li>
<li><p>New merchant category appeared that wasn't in training data</p>
</li>
<li><p>Holiday shopping patterns differed significantly from training data</p>
</li>
</ul>
<p><strong>Phase 3: Remediation (1-24 hours)</strong></p>
<ol>
<li><p><strong>Fix the root cause</strong> - Add validation for the edge case, or update training data</p>
</li>
<li><p><strong>Retrain if needed</strong> - Include new patterns in training data</p>
</li>
<li><p><strong>Add test case</strong> - Prevent this from happening again</p>
</li>
<li><p><strong>Document</strong> - Add to runbook for future reference</p>
</li>
</ol>
<h3 id="heading-scenario-gradual-performance-decay"><strong>Scenario: Gradual Performance Decay</strong></h3>
<p><strong>Symptoms:</strong> Monitoring shows fraud recall dropping 2% per week over a month. No sudden failures, just slow degradation.</p>
<p><strong>Severity:</strong> MEDIUM - Gradual impact, time to respond</p>
<p><strong>Response:</strong></p>
<ol>
<li><p><strong>Investigate drift report</strong> - Look for gradual distribution changes</p>
<pre><code class="language-python">python src/monitoring.py
</code></pre>
</li>
<li><p><strong>Collect recent labeled data</strong> - Get confirmed fraud cases from the past month</p>
</li>
<li><p><strong>Analyze patterns</strong> - What's different about recent fraud?</p>
<ul>
<li><p>New attack vectors?</p>
</li>
<li><p>Different time patterns?</p>
</li>
<li><p>New merchant categories?</p>
</li>
</ul>
</li>
<li><p><strong>Retrain on combined data</strong> - Include both old and new patterns</p>
<pre><code class="language-python">python src/train_mlflow.py
</code></pre>
</li>
<li><p><strong>Deploy via canary</strong> - Route 10% of traffic to the new model first</p>
<ul>
<li><p>Monitor metrics for 1-2 days</p>
</li>
<li><p>If metrics improve, increase to 50%, then 100%</p>
</li>
<li><p>If metrics worsen, roll back</p>
</li>
</ul>
</li>
<li><p><strong>Set up recurring retraining</strong> - Schedule weekly or monthly retraining</p>
</li>
</ol>
<h3 id="heading-scenario-upstream-data-schema-change"><strong>Scenario: Upstream Data Schema Change</strong></h3>
<p><strong>Symptoms:</strong> API starts returning 500 errors. Logs show <code>KeyError: 'merchant_category'</code>.</p>
<p><strong>Severity:</strong> HIGH - Service is down</p>
<p><strong>Response:</strong></p>
<ol>
<li><p><strong>Check error logs</strong> - Identify the exact error</p>
<pre><code class="language-python">KeyError: 'merchant_category'
</code></pre>
</li>
<li><p><strong>Check upstream data</strong> - Did the field name change?</p>
<ul>
<li><p><code>merchant_category</code> -&gt; <code>category</code></p>
</li>
<li><p><code>amount</code> -&gt; <code>transaction_amount</code></p>
</li>
</ul>
</li>
<li><p><strong>Immediate fix</strong> - Add field name mapping</p>
<pre><code class="language-python"># Quick fix in API
if 'category' in data and 'merchant_category' not in data:
    data['merchant_category'] = data['category']
</code></pre>
</li>
<li><p><strong>Long-term fix</strong> - Add validation that catches schema changes</p>
<pre><code class="language-python">required_fields = ['amount', 'hour', 'day_of_week', 'merchant_category']
missing = [f for f in required_fields if f not in data]
if missing:
    raise ValidationError(f"Missing fields: {missing}")
</code></pre>
</li>
<li><p><strong>Add integration test</strong> - Test with upstream system in CI/CD</p>
</li>
</ol>
<h2 id="heading-9-how-to-put-it-all-together"><strong>9.</strong> How to Put It All Together</h2>
<p>Let's step back and appreciate what we've built. Our initial naive system has transformed into a <strong>local ML platform</strong> with production-grade components.</p>
<blockquote>
<p>💡 <strong>Mental Model</strong>: Each tool in this stack is a "catch net" for a specific failure mode:</p>
<ul>
<li><p>MLflow catches "which model is this?"</p>
</li>
<li><p>Feast catches "are features consistent?"</p>
</li>
<li><p>Great Expectations catches "is this data valid?"</p>
</li>
<li><p>Evidently catches "has the world changed?"</p>
</li>
<li><p>CI/CD catches "did we break something?"</p>
</li>
</ul>
<p>Together, they form defense-in-depth for ML systems.</p>
</blockquote>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Tool</strong></th>
<th><strong>Problem Solved</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Experiment Tracking</strong></td>
<td>MLflow</td>
<td>Every run logged, reproducible</td>
</tr>
<tr>
<td><strong>Model Registry</strong></td>
<td>MLflow</td>
<td>Versioned models, rollback capability</td>
</tr>
<tr>
<td><strong>Feature Store</strong></td>
<td>Feast</td>
<td>Consistent features, no training-serving skew</td>
</tr>
<tr>
<td><strong>Data Validation</strong></td>
<td>Great Expectations</td>
<td>Bad data rejected with clear errors</td>
</tr>
<tr>
<td><strong>Monitoring</strong></td>
<td>Evidently</td>
<td>Drift detected before it causes problems</td>
</tr>
<tr>
<td><strong>Containerization</strong></td>
<td>Docker</td>
<td>Environment consistency everywhere</td>
</tr>
<tr>
<td><strong>CI/CD</strong></td>
<td>GitHub Actions</td>
<td>Automated testing and safe deployments</td>
</tr>
</tbody></table>
<h3 id="heading-the-complete-workflow"><strong>The Complete Workflow</strong></h3>
<p>Here's how all the pieces work together in practice:</p>
<ol>
<li><p><strong>Data arrives</strong> - New transaction data comes in from upstream systems</p>
</li>
<li><p><strong>Validation gate</strong> - Great Expectations rules check data quality. Bad data is rejected with clear error messages before it can cause harm.</p>
</li>
<li><p><strong>Feature computation</strong> - Feast computes features using the same definitions for both training and serving. No more training-serving skew.</p>
</li>
<li><p><strong>Training</strong> - When you retrain, MLflow logs all parameters, metrics, and artifacts. Every experiment is reproducible and comparable.</p>
</li>
<li><p><strong>Model registry</strong> - Trained models are automatically versioned. You can compare metrics, promote the best to Production, and roll back if needed.</p>
</li>
<li><p><strong>Serving</strong> - FastAPI loads the @champion model from MLflow. Each request is validated, features are retrieved from Feast, and predictions are returned.</p>
</li>
<li><p><strong>Monitoring</strong> - Evidently checks for drift periodically. If input distributions change significantly, alerts are triggered.</p>
</li>
<li><p><strong>Retraining loop</strong> - When drift is detected, you retrain on new data, compare metrics, and promote if better. The cycle continues.</p>
</li>
<li><p><strong>CI/CD safety net</strong> - All code changes go through automated tests. Docker ensures environment consistency. Nothing reaches production without passing the pipeline.</p>
</li>
</ol>
<h2 id="heading-10-whats-next-scale-to-production"><strong>10. What's Next: Scale to Production</strong></h2>
<p>This project runs locally, but the principles and tools extend directly to production deployments. Here's how each component scales:</p>
<h3 id="heading-scaling-feast-for-production"><strong>Scaling Feast for Production</strong></h3>
<p>We used Feast with local SQLite stores. For production:</p>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Local</strong></th>
<th><strong>Production</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Online Store</td>
<td>SQLite</td>
<td>Redis, DynamoDB, or PostgreSQL</td>
</tr>
<tr>
<td>Offline Store</td>
<td>Parquet files</td>
<td>BigQuery, Snowflake, or Redshift</td>
</tr>
<tr>
<td>Feature Server</td>
<td>Embedded</td>
<td>Dedicated Feast serving cluster</td>
</tr>
</tbody></table>
<p>Benefits at scale:</p>
<ul>
<li><p>Sub-10ms feature retrieval</p>
</li>
<li><p>Horizontal scaling for high throughput</p>
</li>
<li><p>Feature monitoring and statistics</p>
</li>
<li><p>Point-in-time joins at petabyte scale</p>
</li>
</ul>
<h3 id="heading-scaling-mlflow-for-production"><strong>Scaling MLflow for Production</strong></h3>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Local</strong></th>
<th><strong>Production</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Backend Store</td>
<td>SQLite</td>
<td>PostgreSQL or MySQL</td>
</tr>
<tr>
<td>Artifact Store</td>
<td>Local filesystem</td>
<td>S3, GCS, or Azure Blob</td>
</tr>
<tr>
<td>Tracking Server</td>
<td>Single instance</td>
<td>Load-balanced cluster</td>
</tr>
</tbody></table>
<h3 id="heading-kubernetes-deployment"><strong>Kubernetes Deployment</strong></h3>
<p>When you outgrow Docker Compose:</p>
<ul>
<li><p><strong>KServe or Seldon</strong> for serverless model serving with auto-scaling</p>
</li>
<li><p><strong>Horizontal Pod Autoscaler</strong> to scale based on CPU/memory/custom metrics</p>
</li>
<li><p><strong>Canary deployments</strong> to safely roll out new models (route 10% traffic first)</p>
</li>
<li><p><strong>GPU scheduling</strong> for inference-heavy models</p>
</li>
</ul>
<h3 id="heading-advanced-monitoring"><strong>Advanced Monitoring</strong></h3>
<p>Expand observability with:</p>
<ul>
<li><p><strong>Prometheus + Grafana</strong> for real-time dashboards</p>
</li>
<li><p><strong>OpenTelemetry</strong> for distributed tracing</p>
</li>
<li><p><strong>PagerDuty/Slack integration</strong> for alerts</p>
</li>
<li><p><strong>Labeled data collection</strong> for continuous model evaluation</p>
</li>
</ul>
<h3 id="heading-ab-testing-and-multi-armed-bandits"><strong>A/B Testing and Multi-Armed Bandits</strong></h3>
<p>How to Use the Model Registry:</p>
<ul>
<li><p>Serve <strong>multiple models</strong> concurrently (champion vs challengers)</p>
</li>
<li><p><strong>Route traffic</strong> dynamically based on context</p>
</li>
<li><p><strong>Collect metrics</strong> for each model variant</p>
</li>
<li><p><strong>Automatically promote</strong> the best performer</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Congratulations on building a production-ready ML system on your local machine!</p>
<p>What we assembled here is a microcosm of real-world ML platforms:</p>
<ul>
<li><p>We started with just a model saved to a pickle file</p>
</li>
<li><p>We ended up with <strong>MLOps best practices</strong>: experiment tracking, model versioning, feature stores, data validation, monitoring, containerization, and CI/CD</p>
</li>
</ul>
<p><strong>The tools we used are production-grade:</strong></p>
<ul>
<li><p><strong>MLflow</strong> powers ML platforms at companies like Microsoft, Facebook, and Databricks</p>
</li>
<li><p><strong>Feast</strong> is used by companies like Gojek, Shopify, and Robinhood</p>
</li>
<li><p><strong>FastAPI</strong> is one of the fastest Python web frameworks</p>
</li>
<li><p><strong>Great Expectations</strong> is used at companies like GitHub and Shopify</p>
</li>
<li><p><strong>Evidently</strong> is used for monitoring ML in production at scale</p>
</li>
</ul>
<p><strong>The principles apply at any scale:</strong></p>
<ul>
<li><p>Always track experiments</p>
</li>
<li><p>Always version models</p>
</li>
<li><p>Always validate data</p>
</li>
<li><p>Always monitor for drift</p>
</li>
<li><p>Always containerize for consistency</p>
</li>
<li><p>Always automate testing</p>
</li>
</ul>
<h3 id="heading-next-steps-you-can-try"><strong>Next Steps You Can Try</strong></h3>
<ol>
<li><p><strong>Deploy to the cloud</strong> - Push your Docker container to AWS ECS, Google Cloud Run, or Azure Container Instances</p>
</li>
<li><p><strong>Add model explainability</strong> - Use SHAP or LIME to explain individual predictions</p>
</li>
<li><p><strong>Implement A/B testing</strong> - Serve multiple models and compare performance</p>
</li>
<li><p><strong>Add feature importance monitoring</strong> - Track how feature importance changes over time</p>
</li>
<li><p><strong>Set up real-time alerting</strong> - Connect Evidently to Slack or PagerDuty</p>
</li>
<li><p><strong>Implement continuous training</strong> - Automatically retrain when drift is detected</p>
</li>
<li><p><strong>Add bias and fairness monitoring</strong> - Ensure your model treats all groups fairly</p>
</li>
</ol>
<p>Remember that productionizing ML is an <strong>iterative process</strong>. There's always another layer of robustness to add, another edge case to handle, another metric to track. But with the foundation you've built here, you're well on your way to taking models from promising notebook experiments to deployed, monitored, and maintainable production applications.</p>
<p>Happy building, and may your models be accurate and your pipelines resilient!</p>
<h2 id="heading-get-the-complete-code">Get the Complete Code</h2>
<p>The entire project from this handbook is available as a public GitHub repository:</p>
<p><strong>🔗</strong> <a href="http://github.com/sandeepmb/freecodecamp-local-ml-platform"><strong>github.com/sandeepmb/freecodecamp-local-ml-platform</strong></a></p>
<p>The repository includes:</p>
<ul>
<li><p>All source code (<code>src/</code> directory)</p>
</li>
<li><p>Test files (<code>tests/</code> directory)</p>
</li>
<li><p>Feast feature definitions (<code>feature_repo/</code>)</p>
</li>
<li><p>Docker and CI/CD configuration</p>
</li>
<li><p>Ready-to-run scripts</p>
</li>
</ul>
<p><strong>Quick Start:</strong></p>
<pre><code class="language-bash">git clone https://github.com/sandeepmb/freecodecamp-local-ml-platform.git
cd freecodecamp-local-ml-platform
python -m venv venv &amp;&amp; source venv/bin/activate
pip install -r requirements.txt
python src/generate_data.py
python src/train_naive.py
</code></pre>
<hr>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p><a href="https://mlflow.org/docs/latest/">MLflow Documentation</a> - Experiment tracking and model registry</p>
</li>
<li><p><a href="https://docs.feast.dev/">Feast Documentation</a> - Feature store</p>
</li>
<li><p><a href="https://docs.feast.dev/getting-started/quickstart">Feast Quickstart</a> - Getting started with Feast</p>
</li>
<li><p><a href="https://fastapi.tiangolo.com/">FastAPI Documentation</a> - Modern Python web framework</p>
</li>
<li><p><a href="https://greatexpectations.io/">Great Expectations</a> - Data validation</p>
</li>
<li><p><a href="https://docs.evidentlyai.com/">Evidently AI Documentation</a> - ML monitoring</p>
</li>
<li><p><a href="https://jfrog.com/learn/mlops/cicd-for-machine-learning/">CI/CD for Machine Learning (JFrog)</a> - CI/CD best practices</p>
</li>
<li><p><a href="https://www.qwak.com/post/training-serving-skew-in-machine-learning">Training-Serving Skew Explained</a> - Understanding skew</p>
</li>
<li><p><a href="https://docs.docker.com/">Docker Documentation</a> - Containerization</p>
</li>
<li><p><a href="https://docs.github.com/en/actions">GitHub Actions Documentation</a> - CI/CD automation</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Containerize Your MLOps Pipeline from Training to Serving ]]>
                </title>
                <description>
                    <![CDATA[ Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deplo ]]>
                </description>
                <link>https://www.freecodecamp.org/news/containerize-mlops-pipeline-from-training-to-serving/</link>
                <guid isPermaLink="false">69b33f5993256dfc5313bee2</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ production ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ NVIDIA ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Thu, 12 Mar 2026 22:34:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/156eaca3-8884-4f57-9010-9766278dbf5a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deploy it.</p>
<p>The model depended on a specific version of scikit-learn that conflicted with the production Python environment. The feature engineering pipeline required a NumPy build compiled against OpenBLAS, but the deployment servers ran MKL. A preprocessing step used a system library that existed on the data scientist's MacBook but not on the Ubuntu deployment target.</p>
<p>Three weeks of debugging later, we had the model running in production. Three weeks. For a model that was technically finished.</p>
<p>That experience is what pushed me to containerize our entire MLOps pipeline end to end. Not because Docker is trendy in ML circles, but because the alternative (hand-tuning environments, writing installation scripts that break on the next OS update, praying that what worked in training works in production) was costing us more time than the actual model development.</p>
<p>In this tutorial, you'll learn how to structure training and serving containers with multi-stage builds, how to set up experiment tracking with MLflow, how to version your training data with DVC, how to configure GPU passthrough for training, and how to tie it all together into a single Compose file with profiles. This is based on a year of running containerized ML pipelines across three teams.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Docker Engine 24+ or Docker Desktop 4.20+ with Compose v2.22.0+</p>
</li>
<li><p>For GPU training, you'll need the <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html">NVIDIA Container Toolkit</a> installed on the host and a compatible GPU driver. Run <code>nvidia-smi</code> to verify your GPU is visible, and <code>docker compose version</code> to check your Compose version.</p>
</li>
<li><p>Familiarity with Python, basic Docker concepts, and ML workflows (training, evaluation, serving) is assumed.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-the-mlops-lifecycle-where-containers-fit">The MLOps Lifecycle: Where Containers Fit</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-build-the-training-container">How to Build the Training Container</a></p>
<ul>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-separate-training-from-serving-requirements">Separate Training from Serving Requirements</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-cuda-and-driver-compatibility">CUDA and Driver Compatibility</a></p>
</li>
</ul>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-set-up-experiment-tracking-with-mlflow">How to Set Up Experiment Tracking with MLflow</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-version-training-data-with-dvc">How to Version Training Data with DVC</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-build-the-serving-container">How to Build the Serving Container</a></p>
<ul>
<li><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-decouple-models-from-containers">Decouple Models from Containers</a></li>
</ul>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-configure-gpu-passthrough-for-training">How to Configure GPU Passthrough for Training</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-tie-it-all-together-with-compose-profiles">How to Tie It All Together with Compose Profiles</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-reproducibility-the-whole-point">Reproducibility: The Whole Point</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-where-this-breaks-down">Where This Breaks Down</a></p>
</li>
</ul>
<h2 id="heading-the-mlops-lifecycle-where-containers-fit">The MLOps Lifecycle: Where Containers Fit</h2>
<p>If you've built a machine learning model, you know the process has a lot of stages. But if you're coming from a software engineering background (or you're a data scientist who mostly works in notebooks), it helps to see the full picture of what an MLOps pipeline looks like and where Docker fits into each stage.</p>
<p>An MLOps pipeline is a chain of interdependent stages:</p>
<ol>
<li><p><strong>Data ingestion and validation.</strong> Raw data comes in from databases, APIs, or file systems. You clean it, validate it, and store it in a format your model can use.</p>
</li>
<li><p><strong>Feature engineering.</strong> You transform raw data into features the model can learn from. This might be as simple as normalizing numbers or as complex as generating embeddings.</p>
</li>
<li><p><strong>Experiment tracking.</strong> You log every training run's configuration (hyperparameters, data version, code version) and results (accuracy, loss, evaluation metrics) so you can compare experiments and reproduce the best ones.</p>
</li>
<li><p><strong>Model training.</strong> The model learns from your features. This is the compute-heavy part that often needs GPUs.</p>
</li>
<li><p><strong>Evaluation.</strong> You measure the trained model against test data to see if it's good enough to deploy.</p>
</li>
<li><p><strong>Packaging and serving.</strong> You wrap the trained model in an API so other systems can send it data and get predictions back.</p>
</li>
<li><p><strong>Monitoring.</strong> You watch the model in production to catch problems like data drift (when the real-world data starts looking different from the training data) or performance degradation.</p>
</li>
</ol>
<p>Each stage has different computational needs. Training might require GPUs and terabytes of memory. Serving needs low latency and horizontal scaling. Feature engineering might need distributed processing tools like Spark or Dask.</p>
<p>The thing that changed our approach: you don't containerize the entire pipeline as one monolithic image. You containerize each stage independently, with shared interfaces between them.</p>
<p>Think of it like microservices applied to ML infrastructure. Each container does one thing, does it well, and communicates with the others through well-defined interfaces: model artifacts stored in a registry, metrics logged to MLflow, data versioned in object storage.</p>
<p>This gives you the flexibility to:</p>
<ul>
<li><p>Scale training on expensive GPU instances while running serving on cheaper CPU nodes</p>
</li>
<li><p>Update your feature engineering code without rebuilding your training environment</p>
</li>
<li><p>Version each stage independently in your container registry</p>
</li>
<li><p>Let data scientists and ML engineers work on training while platform engineers optimize serving</p>
</li>
</ul>
<h2 id="heading-how-to-build-the-training-container">How to Build the Training Container</h2>
<p>The training container is where most teams start, and where most teams make their first mistake.</p>
<p>The temptation is to create one massive image with every possible library, every CUDA version, every data processing tool. I've seen training images exceed 15GB. They take twenty minutes to build, ten minutes to push, and break whenever someone adds a new dependency.</p>
<p>Here's the pattern that works: use multi-stage builds to separate the build environment from the runtime environment, and use cache mounts to avoid re-downloading packages on every build.</p>
<p>If you're new to these concepts: a <strong>multi-stage build</strong> lets you use one Docker image to build your software and a different, smaller image to run it. You copy only the final artifacts from the build stage to the runtime stage, leaving behind compilers, build tools, and other things you don't need in production.</p>
<p>A <strong>cache mount</strong> tells Docker to keep a directory (like pip's download cache) between builds, so it doesn't re-download packages that haven't changed.</p>
<p>Here's the training Dockerfile:</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

# System dependencies (rarely change)
RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Dependencies (change occasionally)
COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

# Training code (changes frequently)
COPY src/ /app/src/
COPY configs/ /app/configs/

WORKDIR /app
ENTRYPOINT ["python", "-m", "src.train"]
</code></pre>
<p>Notice the layer ordering. Docker builds images in layers, and it caches each layer. If a layer hasn't changed, Docker reuses the cached version instead of rebuilding it. But here's the catch: if one layer changes, Docker rebuilds that layer and every layer after it.</p>
<p>That's why we put things in order of how often they change:</p>
<ol>
<li><p><strong>System packages at the top</strong> (they almost never change). Installing <code>python3.11</code> and <code>git</code> takes time, but you only do it once.</p>
</li>
<li><p><strong>Python dependencies in the middle</strong> (they change when you add or update a library). This layer rebuilds when <code>requirements-train.txt</code> changes.</p>
</li>
<li><p><strong>Your actual code at the bottom</strong> (changes on every commit). This is the layer that rebuilds most often.</p>
</li>
</ol>
<p>With this ordering, a code change only rebuilds the final layer, not the entire image. If you put <code>COPY src/</code> before <code>pip install</code>, every code change would trigger a full reinstall of all Python packages. That's the mistake I see most often in ML Dockerfiles.</p>
<p>The <code>--mount=type=cache,target=/root/.cache/pip</code> line on the <code>pip install</code> command tells Docker to persist pip's download cache between builds. When you do update requirements, pip checks the cache first and only downloads packages that are new or changed. On a project with hundreds of ML dependencies (PyTorch alone pulls in dozens of sub-packages), this saves five to ten minutes per build.</p>
<h3 id="heading-separate-training-from-serving-requirements">Separate Training from Serving Requirements</h3>
<p>Your training environment needs libraries that your serving environment does not. Training needs experiment tracking tools like MLflow, data processing libraries like pandas and polars, visualization libraries for debugging, and hyperparameter tuning frameworks. Serving needs a lightweight inference runtime, an API framework like FastAPI, health check endpoints, and minimal overhead.</p>
<p>It's a good idea to maintain separate requirements files:</p>
<pre><code class="language-plaintext"># requirements-train.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
pandas==2.2.3
polars==1.20.0
dvc[s3]==3.59.1
optuna==4.2.0
matplotlib==3.10.0

# requirements-serve.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
fastapi==0.115.0
uvicorn[standard]==0.34.0
pydantic==2.10.0
</code></pre>
<p>The overlap is smaller than you'd think. <code>torch</code> and <code>scikit-learn</code> appear in both because the model needs them for inference. Everything else in the training file is baggage that slows down serving deployments and increases the attack surface.</p>
<h3 id="heading-cuda-and-driver-compatibility">CUDA and Driver Compatibility</h3>
<p>One thing that will bite you if you ignore it: the CUDA runtime version inside your container must be compatible with the GPU driver version on the host. The rule is that the host driver must be equal to or newer than the CUDA version in the container. For example, CUDA 12.6 requires driver version 560.28+ on Linux.</p>
<p>Make sure you check your host driver version before choosing your base image:</p>
<pre><code class="language-bash"># On the host machine
nvidia-smi
# Look for "Driver Version: 560.35.03" and "CUDA Version: 12.6"

# The CUDA version shown by nvidia-smi is the maximum CUDA version
# your driver supports, not the version installed
</code></pre>
<p>If your host driver is 535.x, don't use a <code>cuda:12.6</code> base image. Use <code>cuda:12.2</code> or upgrade the driver. Mismatched versions produce cryptic errors like <code>CUDA error: no kernel image is available for execution on the device</code> that are painful to debug.</p>
<p>Pin your base images to specific tags (not <code>latest</code>) and document the minimum driver version in your README. When you deploy to new hardware, the driver version check should be part of your provisioning checklist.</p>
<h2 id="heading-how-to-set-up-experiment-tracking-with-mlflow">How to Set Up Experiment Tracking with MLflow</h2>
<p>If you've ever trained a model and thought "wait, which hyperparameters gave me that good result last week?", you need experiment tracking. Without it, ML development turns into a mess of Jupyter notebooks, screenshots of metrics, and spreadsheets that nobody keeps up to date.</p>
<p><a href="https://mlflow.org/">MLflow</a> is the most widely adopted open-source tool for this. It logs three things for every training run: <strong>parameters</strong> (learning rate, batch size, number of epochs), <strong>metrics</strong> (accuracy, loss, F1 score), and <strong>artifacts</strong> (the trained model file, plots, evaluation reports). It stores all of this in a database and gives you a web UI to compare runs side by side.</p>
<p>Running MLflow as a containerized service means the tracking server is persistent and shared across your team, not running on one person's laptop:</p>
<pre><code class="language-yaml">services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: &gt;
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  mlflow-artifacts:
  postgres-data:
</code></pre>
<p>Let me break down what's happening here.</p>
<p>The <code>mlflow</code> service runs the MLflow tracking server. It stores experiment metadata (parameters, metrics) in a Postgres database and saves artifacts (model files, plots) to a Docker volume.</p>
<p>The <code>depends_on</code> with <code>condition: service_healthy</code> tells Compose to wait until Postgres is actually ready to accept connections before starting MLflow. Without this, MLflow would crash on startup because the database isn't ready yet.</p>
<p>The <code>db</code> service runs Postgres with a health check that uses <code>pg_isready</code>, a built-in Postgres utility that checks if the database is accepting connections. The <code>start_period</code> gives Postgres 10 seconds to initialize before health checks start counting failures.</p>
<p>Your training code connects to MLflow by setting one environment variable:</p>
<pre><code class="language-python">import os
import mlflow

# This tells MLflow where to log experiments
# When running inside Docker Compose, "mlflow" resolves to the mlflow container
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow:5000"

# Example: logging a training run
with mlflow.start_run(run_name="fraud-detector-v2"):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)
    mlflow.log_param("epochs", 50)

    # ... train your model here ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.log_metric("precision", 0.93)
    mlflow.log_metric("recall", 0.89)

    # Log the trained model as an artifact
    mlflow.sklearn.log_model(model, "model")
    # Or for PyTorch: mlflow.pytorch.log_model(model, "model")
</code></pre>
<p>After the run completes, open <code>http://localhost:5000</code> in your browser. You'll see a table of all your runs with their parameters and metrics. Click any run to see details, compare it with other runs, or download the model artifact. No more "I think experiment 7 was the good one" conversations.</p>
<p>A note on the password in the YAML: for local development this is fine. For staging and production, use Docker secrets or inject the credentials from your CI environment. Don't commit real database passwords to your repo.</p>
<h2 id="heading-how-to-version-training-data-with-dvc">How to Version Training Data with DVC</h2>
<p>Models are reproducible only if you can also reproduce the data they were trained on. This is a problem Git can't solve on its own, because training datasets are often gigabytes or terabytes in size and Git isn't designed for large binary files.</p>
<p><a href="https://dvc.org/">DVC (Data Version Control)</a> fills this gap. It works like Git, but for data. Here's the concept: instead of storing your 10GB training dataset in Git, DVC stores a small text file (a <code>.dvc</code> file) that acts as a pointer to the actual data. The real data lives in cloud storage (S3, Google Cloud Storage, Azure Blob). When you check out a specific Git commit, DVC knows which version of the data goes with that commit and can pull it from remote storage.</p>
<p>The workflow on your local machine looks like this:</p>
<pre><code class="language-bash"># Initialize DVC in your project (one time)
dvc init

# Add your training data to DVC tracking
dvc add data/training_data.parquet
# This creates data/training_data.parquet.dvc (small pointer file)
# and adds training_data.parquet to .gitignore

# Push the actual data to remote storage
dvc push

# Commit the pointer file to Git
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"
</code></pre>
<p>Now your Git repo contains the pointer file, and the real data lives in S3. When someone else (or a container) needs the data, they run <code>dvc pull</code> and DVC downloads it from remote storage.</p>
<p>The training Dockerfile includes DVC, and the entrypoint pulls the correct data version before training begins:</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

COPY src/ /app/src/
COPY configs/ /app/configs/

# DVC tracking files (these are small text files in Git)
COPY data/*.dvc /app/data/
COPY .dvc/ /app/.dvc/

WORKDIR /app
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]
</code></pre>
<p>The entrypoint script pulls the data and then starts training:</p>
<pre><code class="language-bash">#!/bin/bash
set -e

echo "Pulling training data from remote storage..."
dvc pull data/

echo "Starting training run..."
python -m src.train "$@"
</code></pre>
<p>For DVC to pull from S3, the container needs AWS credentials. You can pass them as environment variables in your Compose file or mount them from the host:</p>
<pre><code class="language-yaml">training:
  build: { context: ., dockerfile: Dockerfile.train }
  environment:
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - AWS_DEFAULT_REGION=us-east-1
</code></pre>
<p>Combined with MLflow's experiment logging, you get a complete provenance chain: this model was trained on this version of the data (tracked by DVC), with these parameters (logged in MLflow), producing these metrics.</p>
<p>You can reproduce any past experiment by checking out the Git commit and running the training container.</p>
<h2 id="heading-how-to-build-the-serving-container">How to Build the Serving Container</h2>
<p>"Serving" means wrapping your trained model in an API so other systems can send it data and get predictions back. For example, a fraud detection model might expose a <code>/predict</code> endpoint that accepts transaction data and returns a fraud probability.</p>
<p>The serving container has different priorities than the training container. Training optimizes for flexibility and raw compute. Serving optimizes for speed, small size, and reliability:</p>
<pre><code class="language-dockerfile">FROM python:3.11-slim AS serving

WORKDIR /app

# Install curl for healthcheck
RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

COPY requirements-serve.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-serve.txt

COPY src/serving/ /app/src/serving/

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.serving.app:app", "--host", "0.0.0.0"]
</code></pre>
<p>A few things to understand if you're new to this:</p>
<p><code>uvicorn</code> is a lightweight Python web server that runs <a href="https://fastapi.tiangolo.com/">FastAPI</a> applications. FastAPI is a framework for building APIs in Python. Together, they let you turn your model into a web service that responds to HTTP requests.</p>
<p><code>HEALTHCHECK</code> tells Docker to periodically check if your container is actually working, not just running. Every 30 seconds, Docker runs the <code>curl</code> command against the <code>/health</code> endpoint. If it fails three times in a row, Docker marks the container as unhealthy. This matters because your model server might be running but not ready (maybe the model file is still downloading), and you don't want to send traffic to a server that can't respond.</p>
<p><code>start-period</code> of 60 seconds is important for ML serving containers. Model loading can take time, especially for large models (loading a 2GB model from a registry takes a while). Without <code>start-period</code>, the health check would start failing immediately, count those failures toward the retry limit, and the orchestrator might kill the container before the model finishes loading. The start period gives the container grace time to initialize.</p>
<p>Notice we're using <code>python:3.11-slim</code> here, not the NVIDIA CUDA image. Most trained models can run inference on CPU. If you need GPU inference (for example, running a large language model or doing real-time video processing), use the CUDA base image instead, but be aware that it makes the serving container much larger.</p>
<p>If you want to skip the <code>curl</code> dependency, use Python's built-in <code>urllib</code> for the health check:</p>
<pre><code class="language-dockerfile">HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
</code></pre>
<h3 id="heading-decouple-models-from-containers">Decouple Models from Containers</h3>
<p>This is one of the most important patterns in this article, and the one beginners most often get wrong.</p>
<p>The temptation is to copy your trained model file (the <code>.pkl</code>, <code>.pt</code>, or <code>.onnx</code> file that contains the learned weights) directly into the Docker image during the build. Don't do this. When you embed model files in your Docker image, every model update requires a new image build and push. For a 2GB model, that means rebuilding the container, uploading 2GB to a registry, and redeploying, even though only the model changed and the code is identical.</p>
<p>Instead, have your serving container download the model from a model registry (like MLflow) or cloud storage (like S3) at startup. The container image stays small and generic. Model updates become a configuration change (pointing to a new model version) rather than a deployment.</p>
<p>Here's a full serving app using FastAPI with the modern lifespan pattern. If you've used Flask, FastAPI is similar but faster and with built-in request validation:</p>
<pre><code class="language-python">import os
from contextlib import asynccontextmanager

import mlflow
from fastapi import FastAPI

# MODEL_URI points to a specific model version in MLflow's registry
# Format: "models:/&lt;model-name&gt;/&lt;stage&gt;" where stage is Staging or Production
MODEL_URI = os.environ.get("MODEL_URI", "models:/fraud-detector/production")
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    # This runs once when the server starts up
    global model
    print(f"Loading model from {MODEL_URI}...")
    model = mlflow.pyfunc.load_model(MODEL_URI)
    print("Model loaded successfully.")
    yield
    # This runs when the server shuts down
    print("Shutting down model server.")


app = FastAPI(lifespan=lifespan)


@app.get("/health")
async def health():
    """Used by Docker HEALTHCHECK to verify the server is ready."""
    if model is None:
        return {"status": "loading"}, 503
    return {"status": "healthy"}


@app.post("/predict")
async def predict(features: dict):
    """Accept features as JSON, return model prediction."""
    import pandas as pd

    # Convert the input dict into a DataFrame (what most sklearn/mlflow models expect)
    df = pd.DataFrame([features])
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}
</code></pre>
<p>When a client sends a POST request to <code>/predict</code> with JSON like <code>{"amount": 500, "merchant_category": "electronics", "hour": 23}</code>, the model returns a prediction. The <code>/health</code> endpoint returns 503 while the model is loading and 200 once it's ready, which is exactly what the Docker <code>HEALTHCHECK</code> checks for.</p>
<p>Promoting a new model version means updating the <code>MODEL_URI</code> environment variable and restarting the container. The MLflow model registry supports stage transitions (Staging, Production, Archived), so you can promote a model in the MLflow UI and then point your serving container at the new version.</p>
<p>For zero-downtime model updates, implement a reload endpoint that swaps models without restarting:</p>
<pre><code class="language-python">@app.post("/admin/reload")
async def reload_model():
    global model
    model = mlflow.pyfunc.load_model(MODEL_URI)
    return {"status": "reloaded"}
</code></pre>
<h2 id="heading-how-to-configure-gpu-passthrough-for-training">How to Configure GPU Passthrough for Training</h2>
<p>By default, Docker containers can't see the GPU hardware on the host machine. "GPU passthrough" means giving a container access to the host's GPUs so that libraries like PyTorch and TensorFlow can use them for accelerated computation.</p>
<p>This requires two things on the host (the machine running Docker, not inside the container):</p>
<ol>
<li><p><strong>NVIDIA GPU drivers</strong> installed and working. Verify with <code>nvidia-smi</code>. If that command shows your GPUs, you're good.</p>
</li>
<li><p><strong>NVIDIA Container Toolkit</strong> installed. This is the bridge between Docker and the GPU drivers. Install it from the <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html">NVIDIA docs</a> and verify with <code>docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi</code>. If you see your GPU listed, the toolkit is working.</p>
</li>
</ol>
<p>Once the host is set up, GPU access in Docker Compose looks like this:</p>
<pre><code class="language-yaml">services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
</code></pre>
<p>The <code>deploy.resources.reservations.devices</code> block is saying: "this container needs all available NVIDIA GPUs." Inside the container, PyTorch and TensorFlow will see the GPUs and use them automatically. You can verify by adding <code>print(torch.cuda.is_available())</code> to your training script, which should print <code>True</code>.</p>
<p>If you're running Compose v2.30.0+, you can use the shorter <code>gpus</code> syntax:</p>
<pre><code class="language-yaml">services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    gpus: all
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
</code></pre>
<p>For multi-GPU training with frameworks like PyTorch's DistributedDataParallel, you can assign specific GPUs using <code>device_ids</code>. This matters when running multiple training jobs at the same time:</p>
<pre><code class="language-yaml">services:
  training-job-1:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

  training-job-2:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
</code></pre>
<p>Note that <code>CUDA_VISIBLE_DEVICES</code> inside the container is relative to the devices assigned by Docker, not the host GPU indices. Both containers see their GPUs as device 0 and 1, even though they're using different physical GPUs.</p>
<h2 id="heading-how-to-tie-it-all-together-with-compose-profiles">How to Tie It All Together with Compose Profiles</h2>
<p>If you're new to Compose profiles: by default, <code>docker compose up</code> starts every service defined in your <code>docker-compose.yml</code>. But you don't always want everything running. Your MLflow server and serving API should run all the time, but the training container should only launch when you're actually training a model (and it needs a GPU, which you might not have on your laptop).</p>
<p>Profiles solve this. When you add <code>profiles: ["train"]</code> to a service, that service is excluded from <code>docker compose up</code> by default. It only starts when you explicitly activate the profile with <code>docker compose --profile train</code>. This means one file defines your entire ML infrastructure, but you control what runs and when.</p>
<p>Here's the complete <code>docker-compose.yml</code> that ties every piece from this article together:</p>
<pre><code class="language-yaml">services:
  # --- Always-on infrastructure ---
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: &gt;
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  serving:
    build: { context: ., dockerfile: Dockerfile.serve }
    ports:
      - "8000:8000"
    environment:
      - MODEL_URI=models:/fraud-detector/production
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      mlflow: { condition: service_started }

  # --- Training (on-demand) ---
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    profiles: ["train"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      mlflow: { condition: service_started }

volumes:
  postgres-data:
  mlflow-artifacts:
</code></pre>
<p>The day-to-day workflow with this file:</p>
<pre><code class="language-bash"># Step 1: Start the infrastructure (MLflow + Postgres + serving API)
# The -d flag runs everything in the background
docker compose up -d

# Step 2: Open the MLflow UI to see past experiments
open http://localhost:5000    # macOS
# xdg-open http://localhost:5000  # Linux

# Step 3: Check that the serving API is healthy
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Step 4: Run a training job (pulls data via DVC, logs to MLflow)
# This only starts the "training" service because of the profile flag
docker compose --profile train run training

# Step 5: Watch training progress in the MLflow UI at localhost:5000
# You'll see metrics updating in real time if your training code logs them

# Step 6: After training completes, promote the model in MLflow UI
# Click the model, go to "Register Model", set stage to "Production"

# Step 7: Restart the serving container to pick up the new model version
docker compose restart serving

# Step 8: Test the new model
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "hour": 23}'
</code></pre>
<p>This single-file approach means a new team member can clone the repo, run <code>docker compose up -d</code>, and have the complete ML infrastructure running locally within minutes. The same containers deploy to staging and production with only environment variable changes (database credentials, model URIs, GPU allocation).</p>
<h2 id="heading-reproducibility-the-whole-point">Reproducibility: The Whole Point</h2>
<p>Everything in this article serves one goal: reproducibility. The ability to take any commit hash, build the same containers, pull the same data, and produce the same model.</p>
<p>Here are the practices that make this work:</p>
<h3 id="heading-pin-everything">Pin Everything</h3>
<p>Pin your base images to specific digests, not just tags. Pin your Python packages to exact versions with <code>pip freeze &gt; requirements.txt</code>. Use fixed random seeds in your training code and log them in MLflow.</p>
<h3 id="heading-log-everything">Log Everything</h3>
<p>Every training run should log the exact library versions (<code>pip freeze</code>), the Git commit hash, the DVC data version, all hyperparameters, and all evaluation metrics to MLflow. You can automate this:</p>
<pre><code class="language-python">import subprocess
import mlflow

with mlflow.start_run():
    # Log environment info automatically
    pip_freeze = subprocess.check_output(["pip", "freeze"]).decode()
    mlflow.log_text(pip_freeze, "pip_freeze.txt")

    git_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    mlflow.log_param("git_commit", git_hash)

    # ... rest of training ...
</code></pre>
<h3 id="heading-version-everything">Version Everything</h3>
<p>Git for code, DVC for data, MLflow for experiments, Docker digests for environments. The combination creates a complete provenance chain. When a stakeholder asks why a model made a particular prediction, you can trace it back to the exact code, data, and hyperparameters that produced it. For regulated industries like finance and healthcare, that traceability is a compliance requirement, not a nice-to-have.</p>
<h2 id="heading-where-this-breaks-down">Where This Breaks Down</h2>
<p>This approach works well for small-to-medium teams running on single hosts or small clusters. Here's where you'll hit walls:</p>
<p><strong>Large datasets.</strong> Don't mount multi-terabyte datasets into containers. Use object storage (S3, GCS) and stream data during training. DVC handles the versioning, but the data itself should live outside Docker entirely.</p>
<p><strong>GPU driver mismatches.</strong> Your container's CUDA version must be compatible with the host driver. Test on identical hardware and driver versions to what you'll run in production. Document the minimum driver version in your README.</p>
<p><strong>Multi-node training.</strong> When you need to distribute training across multiple machines, you'll outgrow Compose. Kubernetes with Kubeflow or KServe is the standard path for distributed training and auto-scaled serving.</p>
<p><strong>Serving at scale.</strong> A single container running uvicorn handles moderate traffic. For high-throughput inference, you'll need a load balancer, multiple replicas, and potentially a dedicated serving framework like NVIDIA Triton Inference Server or TensorFlow Serving. Compose can run multiple replicas with <code>docker compose up --scale serving=3</code>, but it doesn't give you the routing, health-based load balancing, or rolling updates that a real orchestrator provides.</p>
<p><strong>Secrets in production.</strong> The Compose file above uses plaintext passwords for local development. In production, use Docker secrets, HashiCorp Vault, or your cloud provider's secret manager. Never commit credentials to your repo.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Containerizing your MLOps pipeline turns fragile, environment-dependent models into reproducible, deployable artifacts. Multi-stage builds keep images lean. MLflow gives you experiment tracking and model lineage. DVC links code to data. GPU passthrough preserves training performance. A single Compose file with profiles ties the whole workflow together.</p>
<p>That fraud detection model I mentioned at the start? We eventually containerized the entire pipeline around it. The next model we shipped went from "notebook finished" to "running in production" in two days instead of three weeks. Most of that time was spent on evaluation and review, not fighting environments.</p>
<p>Containerization doesn't make your models better. It gets the infrastructure out of the way so you can focus on the work that does.</p>
<p>But even with these caveats, containerized MLOps eliminates the most common source of ML project delays: environment mismatch between development and production. The three weeks we spent debugging that fraud detection model deployment? That doesn't happen anymore.</p>
<p>If you found this useful, you can find me writing about MLOps, containerized workflows, and production AI systems on my blog.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn MLOps with MLflow and Databricks ]]>
                </title>
                <description>
                    <![CDATA[ As the industry standard for managing the machine learning life cycle, MLflow provides the necessary architecture to build systems that are both reproducible and scalable. We just posted a course on t ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-mlops-with-mlflow-and-databricks/</link>
                <guid isPermaLink="false">69a999073728a9dc358915c1</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 05 Mar 2026 14:53:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5f68e7df6dfc523d0a894e7c/8f3332fe-2f88-451d-9c2c-7ac8e08f0286.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>As the industry standard for managing the machine learning life cycle, MLflow provides the necessary architecture to build systems that are both reproducible and scalable.</p>
<p>We just posted a course on the <a href="http://freeCosdeCamp.org">freeCodeCamp.org</a> YouTube channel that will help you master the art of taking machine learning models out of the research phase and into a real production environment with this new end-to-end course on MLflow.</p>
<p>The curriculum begins with the fundamentals of experiment tracking, explaining why moving beyond basic Jupyter notebooks is critical for professional workflows. You will learn how to properly manage model parameters, metrics, and decision history so that every model pushed to production is fully auditable and traceable.</p>
<p>This course also covers LLM ops. You will discover how to use the prompt registry to version templates, manage different model providers through the AI Gateway, and implement LLM-as-a-judge for automated prompt evaluation. By integrating these tools with Databricks and Hugging Face, you will gain the hands-on expertise needed to serve and monitor complex models in an enterprise setting.</p>
<p>Watch the <a href="https://youtu.be/tVskbekONlw">full course over at freeCodeCamp.org to start</a> building production-ready ML systems today (5-hour watch).</p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/tVskbekONlw" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div> ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build End-to-End Machine Learning Lineage ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance. While many services for tracking ML lineage exist, creating a comprehensive and manageabl... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-end-to-end-machine-learning-lineage/</link>
                <guid isPermaLink="false">68f0f6719ac2ae80d4c5be03</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Thu, 16 Oct 2025 13:43:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760622158648/b990ff01-06f0-495d-8554-f832813609ab.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance.</p>
<p>While many services for tracking ML lineage exist, creating a comprehensive and manageable lineage often proves complicated.</p>
<p>In this article, I’ll walk you through integrating a comprehensive ML lineage solution for an ML application deployed on serverless AWS Lambda, covering the end-to-end pipeline stages:</p>
<ul>
<li><p>ETL pipeline</p>
</li>
<li><p>Data drift detection</p>
</li>
<li><p>Preprocessing</p>
</li>
<li><p>Model tuning</p>
</li>
<li><p>Risk and fairness evaluation.</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-well-build">What We’ll Build</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture - AI Pricing for Retailers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-ml-lineage">The ML Lineage</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-workflow-in-action">Workflow in Action</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-the-ml-lineage">Step 2: The ML Lineage</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-3-preprocessing">Stage 3: Preprocessing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-5-performing-inference">Stage 5: Performing Inference</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-configuring-scheduled-run-with-prefect">Step 4: Configuring Scheduled Run with Prefect</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local-1">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-deploying-the-application">Step 5: Deploying the Application</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-test-in-local-2">Test in Local</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<ul>
<li><p>Knowledge of key Machine Learning / Deep Learning concepts including the full lifecycle: data handling, model training, tuning, and validation.</p>
</li>
<li><p>Proficiency in Python, with experience using major ML libraries.</p>
</li>
<li><p>Basic understanding of DevOps principles.</p>
</li>
</ul>
<h3 id="heading-tools-well-use">Tools we’ll use:</h3>
<p>Here is a summary of the tools we’re going to use to track the ML lineage:</p>
<ul>
<li><p><strong>DVC</strong>: An open-source version system for data. Used to track the ML lineage.</p>
</li>
<li><p><strong>AWS S3</strong>: A secure object storage service from AWS. Used as a remote storage.</p>
</li>
<li><p><strong>Evently AI</strong>: An open-source ML and LLM observability framework. Used to detect data drift.</p>
</li>
<li><p><strong>Prefect</strong>: A workflow orchestration engine. Used to manage the schedule run of the lineage.</p>
</li>
</ul>
<h2 id="heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</h2>
<p><strong>Machine learning (ML) lineage</strong> is a framework for tracking and understanding the complete lifecycle of a machine learning model.</p>
<p>It contains information at different levels such as:</p>
<ul>
<li><p><strong>Code:</strong> The scripts, libraries, and configurations for model training.</p>
</li>
<li><p><strong>Data:</strong> The original data, transformations, and features.</p>
</li>
<li><p><strong>Experiments:</strong> Training runs, hyperparameter tuning results.</p>
</li>
<li><p><strong>Models:</strong> The trained models and their versions.</p>
</li>
<li><p><strong>Predictions:</strong> The outputs of deployed models.</p>
</li>
</ul>
<p>ML lineage is essential for multiple reasons:</p>
<ul>
<li><p><strong>Reproducibility:</strong> Recreate the same model and prediction for validation.</p>
</li>
<li><p><strong>Root cause analysis:</strong> Trace back to the data, code, or configuration change when a model fails in production.</p>
</li>
<li><p><strong>Compliance:</strong> Some regulated industries require proof of model training to ensure fairness, transparency, and adherence to laws like GDPR and the EU AI Act.</p>
</li>
</ul>
<h2 id="heading-what-well-build">What We’ll Build</h2>
<p>In this project, I’ll integrate an ML lineage into <a target="_blank" href="https://levelup.gitconnected.com/building-a-dynamic-pricing-system-with-a-multi-layered-neural-network-c2a4c70bfcec">this price prediction system built on AWS Lambda architecture</a> using DVC, an open-source version control system for ML applications.</p>
<p>The below diagram illustrates the system architecture and the ML lineage we’ll integrate:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759825040233/5027e5dd-a2fc-4d35-b7a3-4d9184f5f179.png" alt="Figure A. A comprehensive ML lineage for an ML application on serverless Lambda (Created by Kuriko IWAI)" class="image--center mx-auto" width="25020" height="7926" loading="lazy"></p>
<p><strong>Figure A:</strong> A comprehensive ML lineage for an ML application on serverless Lambda (Created by <a target="_blank" href="https://kuriko-iwai.vercel.app/">Kuriko IWAI</a>)</p>
<h3 id="heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture: AI Pricing for Retailers</h3>
<p>The system operates as a containerized, serverless microservice designed to provide optimal price recommendations to maximize retailer sales.</p>
<p>Its core intelligence comes from AI models trained on historical purchase data to predict the quantity of the product sold at various prices, allowing sellers to determine the best price.</p>
<p>For consistent deployment, the prediction logic and its dependencies are packaged into a Docker container image and stored in AWS ECR (Elastic Container Registry).</p>
<p>The prediction is then served by an AWS Lambda function, which retrieves and runs the container from ECR and exposes the result via AWS API Gateway for the Flask application to consume.</p>
<p>If you want to see how to build this from the ground up, you can follow along with my tutorial <a target="_blank" href="https://www.freecodecamp.org/news/how-to-build-a-machine-learning-system-on-serverless-architecture/">How to Build a Machine Learning System on Serverless Architecture</a>.</p>
<h3 id="heading-the-ml-lineage">The ML Lineage</h3>
<p>In the system, GitHub handles the code lineage, while DVC captures the lineage of:</p>
<ul>
<li><p><strong>Data</strong> (blue boxes): ETL and preprocessing.</p>
</li>
<li><p><strong>Experiments</strong> (light orange): Hyperparamters tuning and validation.</p>
</li>
<li><p><strong>Models</strong> and <strong>Prediction</strong> (dark orange): Final model artifacts and prediction results.</p>
</li>
</ul>
<p><strong>DVC</strong> tracks the lineage through separate stages, from data extraction to fairness testing (yellow rows in Figure A).</p>
<p>For each stage, DVC uses an <strong>MD5</strong> or <strong>SHA256 hash</strong> to track and push metadata like artifacts, metrics, and reports to its remote on <strong>AWS S3</strong>.</p>
<p>The pipeline incorporates <strong>Evently AI</strong> to handle data drift tests, which are essential for identifying shifts in data distributions that could compromise the model's generalization capabilities in production.</p>
<p>Only models that successfully pass both the data drift and fairness tests can serve predictions via the AWS API gateway (red box in Figure A).</p>
<p>Lastly, this entire lineage process is triggered weekly by the open-source workflow scheduler, <strong>Prefect</strong>.</p>
<p>Prefect prompts DVC to check for updates in data and scripts, and executes the full lineage process if changes are detected.</p>
<h2 id="heading-workflow-in-action">Workflow in Action</h2>
<p>The building process involves five main steps:</p>
<ol>
<li><p>Initiate a DVC project</p>
</li>
<li><p>Define the lineage stages with the DVC script <code>dvc.yaml</code> and corresponding Python script</p>
</li>
<li><p>Deploy the DVC project</p>
</li>
<li><p>Configure scheduled run with Prefect</p>
</li>
<li><p>Deploy the application</p>
</li>
</ol>
<p>Let’s walk through each step together.</p>
<h2 id="heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</h2>
<p>The first step is to initiate a DVC project:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> init
</code></pre>
<p>This command automatically creates a <code>.dvc</code> directory at the root of the project folder:</p>
<pre><code class="lang-bash">.
.dvc/
│
└── cache/         <span class="hljs-comment"># [.gitignore] store dvc caches (cached actual data files)</span>
└── tmp/           <span class="hljs-comment"># [.gitignore]</span>
└── .gitignore     <span class="hljs-comment"># gitignore cache, tmp, and config.local</span>
└── config         <span class="hljs-comment"># dvc config for production</span>
└── config.local   <span class="hljs-comment"># [.gitignore] dvc config for local</span>
</code></pre>
<p>DVC maintains a fast, lightweight Git repository by separating the original data in large files from the repository.</p>
<p>The process involves caching the original data in the local <code>.dvc/cache</code> directory, creating a small <code>.dvc</code> metadata file which contains an MD5 hash and a link to the original data file path, pushing <em>only</em> the small metadata files to Git, and pushing the original data to the DVC remote.</p>
<h2 id="heading-step-2-the-ml-lineage">Step 2: The ML Lineage</h2>
<p>Next, we’ll configure the ML lineage with the following stages:</p>
<ol>
<li><p><code>etl_pipeline</code>: Extract, clean, impute the original data and perform feature engineering.</p>
</li>
<li><p><code>data_drift_check</code>: Run data drift tests. If they fail, the system exits.</p>
</li>
<li><p><code>preprocess</code>: Create training, validation, and test datasets.</p>
</li>
<li><p><code>tune_primary_model</code>: Tune hyperparameters and train the model.</p>
</li>
<li><p><code>inference_primary_model</code>: Perform inference on the test dataset.</p>
</li>
<li><p><code>assess_model_risk</code>: Runs risk and fairness tests.</p>
</li>
</ol>
<p>Each stage requires defining the DVC command and its corresponding Python script.</p>
<p>Let’s get started.</p>
<h3 id="heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</h3>
<p>The first stage is to extract, clean, impute the original data, and perform feature engineering.</p>
<h4 id="heading-dvc-configuration"><strong>DVC Configuration</strong></h4>
<p>We’ll create the <code>dvc.yaml</code> file at the root of the project directory and add the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code></p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-comment"># output paths for dvc to track</span>
    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/original_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/processed_df.parquet</span>
</code></pre>
<p>The <code>dvc.yaml</code> file defines a sequence of steps (stages) using sections like:</p>
<ul>
<li><p><code>cmd</code>: The shell command to be executed for that stage</p>
</li>
<li><p><code>deps</code>: Dependencies that need to run the <code>cmd</code></p>
</li>
<li><p><code>prams</code>: Default parameters for the <code>cmd</code> defined in the <code>params.yaml</code> file</p>
</li>
<li><p><code>metrics</code>: The metrics files to track</p>
</li>
<li><p><code>reports</code>: The report files to track</p>
</li>
<li><p><code>plots</code>: The DVC plot files for visualization</p>
</li>
<li><p><code>outs</code>: The output files produced by the <code>cmd</code>, which DVC will track</p>
</li>
</ul>
<p>The configuration helps DVC ensure reproducibility by explicitly listing dependencies, outputs, and the commands of each stage. It also helps it manage the lineage by establishing a <strong>Directed Acyclic Graph (DAG)</strong> of the workflow, linking each stage to the next.</p>
<h4 id="heading-python-scripts"><strong>Python Scripts</strong></h4>
<p>Next, let’s add Python scripts, ensuring the data is stored using the file paths specified in the <code>outs</code> section of the <code>dvc.yaml</code> file:</p>
<p><code>src/data_handling/etl_pipeline.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">etl_pipeline</span>():</span>
    <span class="hljs-comment"># extract the entire data</span>
    df = scripts.extract_original_dataframe()

    <span class="hljs-comment"># load perquet file</span>
    ORIGINAL_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'original_df.parquet'</span>)
    df.to_parquet(ORIGINAL_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>

    <span class="hljs-comment"># transform</span>
    df = scripts.structure_missing_values(df=df)
    df = scripts.handle_feature_engineering(df=df)

    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df.to_parquet(PROCESSED_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>
    <span class="hljs-keyword">return</span> df

<span class="hljs-comment"># for dvc execution</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:  
    parser = argparse.ArgumentParser(description=<span class="hljs-string">"run etl pipeline"</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">"specific stockcode to process. empty runs full pipeline."</span>)
    parser.add_argument(<span class="hljs-string">'--impute'</span>, action=<span class="hljs-string">'store_true'</span>, help=<span class="hljs-string">"flag to create imputation values"</span>)
    args = parser.parse_args()

    etl_pipeline(stockcode=args.stockcode, impute_stockcode=args.impute)
</code></pre>
<h4 id="heading-outputs"><strong>Outputs</strong></h4>
<p>The original and structured data in Pandas’ DataFrames are stored in the DVC cache:</p>
<ul>
<li><p><code>data/original_df.parquet</code></p>
</li>
<li><p><code>data/processed_df.parquet</code></p>
</li>
</ul>
<h3 id="heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</h3>
<p>Before jumping into preprocessing, we’ll run data drift tests to ensure any notable drift is in the data. To do this, we’ll use <strong>EventlyAI</strong>, an open-source ML and LLM observability framework.</p>
<h4 id="heading-what-is-data-drift">What is Data Drift?</h4>
<p>Data drift refers to any changes in the statistical properties like the mean, variance, or distribution of the data that the model is trained on.</p>
<p>There are three main types of data drift:</p>
<ul>
<li><p><strong>Covariate Drift</strong> (Feature Drift): A change in the input feature distribution.</p>
</li>
<li><p><strong>Prior Probability Drift</strong> (Label Drift): A change in the target variable distribution.</p>
</li>
<li><p><strong>Concept Drift</strong>: A change in the relationship between the input data and the target variable.</p>
</li>
</ul>
<p>Data drift compromises the model's generalization capabilities over time, making its detection after deployment crucial.</p>
<h4 id="heading-dvc-configuration-1">DVC Configuration</h4>
<p>We’ll add the <code>data_drift_check</code> stage right after the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
     <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/report_data_drift.py
      data/processed/processed_df.csv 
      data/processed_df_${params.stockcode}.parquet
      reports/data_drift_report_${params.stockcode}.html
      metrics/data_drift_${params.stockcode}.json
      ${params.stockcode}
</span>
    <span class="hljs-comment"># default values to the parameters (defined in the param.yaml file)</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/report_data_drift.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-comment"># output file pathes for dvc to track</span>
    <span class="hljs-attr">plots:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/data_drift_report_${params.stockcode}.html:</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/data_drift_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then, add default values to the parameters passed to the DVC command:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">stockcode:</span> <span class="hljs-string">&lt;STOCKCODE</span> <span class="hljs-string">OF</span> <span class="hljs-string">CHOICE&gt;</span>
</code></pre>
<h4 id="heading-python-scripts-1">Python Scripts</h4>
<p>After <a target="_blank" href="https://docs.evidentlyai.com/quickstart_ml#1-1-set-up-evidently-cloud">generating an API token from the EventlyAI workplace,</a> we’ll add a Python script to detect data drift and store the results in the <code>metrics</code> variable:</p>
<p><code>src/data_handling/report_data_drift.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> evidently <span class="hljs-keyword">import</span> Dataset, DataDefinition, Report
<span class="hljs-keyword">from</span> evidently.presets <span class="hljs-keyword">import</span> DataDriftPreset
<span class="hljs-keyword">from</span> evidently.ui.workspace <span class="hljs-keyword">import</span> CloudWorkspace

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># initiate evently cloud workspace</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ws = CloudWorkspace(token=os.getenv(<span class="hljs-string">'EVENTLY_API_TOKEN'</span>), url=<span class="hljs-string">'https://app.evidently.cloud'</span>)

    <span class="hljs-comment"># retrieve evently project</span>
    project = ws.get_project(<span class="hljs-string">'EVENTLY AI PROJECT ID'</span>)

    <span class="hljs-comment"># retrieve paths from the command line args</span>
    REFERENCE_DATA_PATH = sys.argv[<span class="hljs-number">1</span>]
    CURRENT_DATA_PATH = sys.argv[<span class="hljs-number">2</span>]
    REPORT_OUTPUT_PATH = sys.argv[<span class="hljs-number">3</span>]
    METRICS_OUTPUT_PATH = sys.argv[<span class="hljs-number">4</span>]
    STOCKCODE = sys.argv[<span class="hljs-number">5</span>]

    <span class="hljs-comment"># create folders if not exist</span>
    os.makedirs(os.path.dirname(REPORT_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    os.makedirs(os.path.dirname(METRICS_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># extract datasets</span>
    reference_data_full = pd.read_csv(REFERENCE_DATA_PATH)
    reference_data_stockcode = reference_data_full[reference_data_full[<span class="hljs-string">'stockcode'</span>] == STOCKCODE]
    current_data_stockcode = pd.read_parquet(CURRENT_DATA_PATH)

    <span class="hljs-comment"># define data schema</span>
    nums, cats = scripts.categorize_num_cat_cols(df=reference_data_stockcode)
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> nums: current_data_stockcode[col] = pd.to_numeric(current_data_stockcode[col], errors=<span class="hljs-string">'coerce'</span>)

    schema = DataDefinition(numerical_columns=nums, categorical_columns=cats)

    <span class="hljs-comment"># define evently dataset w/ the data schema</span>
    eval_data_1 = Dataset.from_pandas(reference_data_stockcode, data_definition=schema)
    eval_data_2 = Dataset.from_pandas(current_data_stockcode, data_definition=schema)

    <span class="hljs-comment"># execute drift detection</span>
    report = Report(metrics=[DataDriftPreset()])
    data_eval = report.run(reference_data=eval_data_1, current_data=eval_data_2)
    data_eval.save_html(REPORT_OUTPUT_PATH)

    <span class="hljs-comment"># create metrics for dvc tracking</span>
    report_dict = json.loads(data_eval.json())
    num_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'count'</span>]
    shared_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'share'</span>]
    metrics = dict(
        drift_detected=bool(num_drifts &gt; <span class="hljs-number">0.0</span>), num_drifts=num_drifts, shared_drifts=shared_drifts,
        num_cols=nums,
        cat_cols=cats,
        stockcode=STOCKCODE,
        timestamp=datetime.datetime.now().isoformat(),
    )

    <span class="hljs-comment"># load metrics file</span>
    <span class="hljs-keyword">with</span> open(METRICS_OUTPUT_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... drift metrics saved to <span class="hljs-subst">{METRICS_OUTPUT_PATH}</span>... '</span>)

    <span class="hljs-comment"># stop the system if data drift is found</span>
    <span class="hljs-keyword">if</span> num_drifts &gt; <span class="hljs-number">0.0</span>: sys.exit(<span class="hljs-string">'❌ FATAL: data drift detected. stopping pipeline'</span>)
</code></pre>
<p>If data drift is found, the script immediately exits using the final <code>sys.exit</code> command.</p>
<h4 id="heading-outputs-1">Outputs</h4>
<p>The script generates two files that DVC will track:</p>
<ul>
<li><p><code>reports/data_drift_report.html</code>: The data drift report in a HTML file.</p>
</li>
<li><p><code>metrics/data_drift.json</code>: The data drift metics in a JSON file including drift results along with feature columns and a timestamp:</p>
</li>
</ul>
<p><code>metrics/data_drift.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"drift_detected"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-attr">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>
}
</code></pre>
<p>The drift test results are also available on the Evently workplace dashboard for further analysis:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*2C1ICzvVazAUH7fk.png" alt="Figure B. Screenshot of the Evently workspace dashboard" width="600" height="400" loading="lazy"></p>
<p><strong>Figure B.</strong> Screenshot of the Evently workspace dashboard</p>
<h3 id="heading-stage-3-preprocessing">Stage 3: Preprocessing</h3>
<p>If no data drift is detected, the linage moves onto the preprocessing stage.</p>
<h4 id="heading-dvc-configuration-2">DVC Configuration</h4>
<p>We’ll add the <code>preprocess</code> stage right after the <code>data_drift_check</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/preprocess.py --target_col ${params.target_col} --should_scale ${params.should_scale} --verbose ${params.verbose}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/preprocess.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils</span>

    <span class="hljs-comment"># params from params.yaml</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.target_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.should_scale</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.verbose</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-comment"># train, val, test datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_test_df.parquet</span>

      <span class="hljs-comment"># preprocessed input datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_processed.parquet</span>

      <span class="hljs-comment"># trained preprocessor and human readable feature names for shap analysis</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/column_transformer.pkl</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/feature_names.json</span>
</code></pre>
<p>And then add default values of the parameters used in the <code>cmd</code>:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-2">Python Scripts</h4>
<p>Next, we’ll add a Python script to create training, validation, and test datasets and preprocess input data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess</span>(<span class="hljs-params">stockcode: str = <span class="hljs-string">''</span>, target_col: str = <span class="hljs-string">'quantity'</span>, should_scale: bool = True, verbose: bool = False</span>):</span>
    <span class="hljs-comment"># initiate metrics to track (dvc)</span>
    DATA_DRIFT_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'data_drift_<span class="hljs-subst">{args.stockcode}</span>.json'</span>)

    <span class="hljs-keyword">if</span> os.path.exists(DATA_DRIFT_METRICS_PATH):
        <span class="hljs-keyword">with</span> open(DATA_DRIFT_METRICS_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
            metrics = json.load(f)
    <span class="hljs-keyword">else</span>: metrics = dict()

    <span class="hljs-comment"># load processed df from dvc cache</span>
    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df = pd.read_parquet(PROCESSED_DF_PATH)

    <span class="hljs-comment"># categorize num and cat columns</span>
    num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
    <span class="hljs-keyword">if</span> verbose: main_logger.info(<span class="hljs-string">f'num_cols: <span class="hljs-subst">{num_cols}</span> \ncat_cols: <span class="hljs-subst">{cat_cols}</span>'</span>)

    <span class="hljs-comment"># structure cat cols</span>
    <span class="hljs-keyword">if</span> cat_cols:
        <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cat_cols: df[col] = df[col].astype(<span class="hljs-string">'string'</span>)

    <span class="hljs-comment"># initiate preprocessor (either load from the dvc cache or create from scratch)</span>
    PREPROCESSOR_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'column_transformer.pkl'</span>)
    <span class="hljs-keyword">try</span>:
        preprocessor = joblib.load(PREPROCESSOR_PATH)
    <span class="hljs-keyword">except</span>:
        preprocessor = scripts.create_preprocessor(num_cols=num_cols <span class="hljs-keyword">if</span> should_scale <span class="hljs-keyword">else</span> [], cat_cols=cat_cols)

    <span class="hljs-comment"># creates train, val, test datasets</span>
    y = df[target_col]
    X = df.copy().drop(target_col, axis=<span class="hljs-string">'columns'</span>)

    <span class="hljs-comment"># split</span>
    test_size, random_state = <span class="hljs-number">50000</span>, <span class="hljs-number">42</span>
    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)
    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># store train, val, test datasets (dvc track)</span>
    X_train.to_parquet(<span class="hljs-string">'data/x_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_val.to_parquet(<span class="hljs-string">'data/x_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_test.to_parquet(<span class="hljs-string">'data/x_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_train.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_val.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_test.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># preprocess</span>
    X_train = preprocessor.fit_transform(X_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    <span class="hljs-comment"># store preprocessed input data (dvc track)</span>
    pd.DataFrame(X_train).to_parquet(<span class="hljs-string">f'data/x_train_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_val).to_parquet(<span class="hljs-string">f'data/x_val_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_test).to_parquet(<span class="hljs-string">f'data/x_test_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># save feature names (dvc track) for shap</span>
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'preprocessors/feature_names.json'</span>, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        feature_names = preprocessor.get_feature_names_out()
        json.dump(feature_names.tolist(), f)

    <span class="hljs-keyword">return</span>  X_train, X_val, X_test, y_train, y_val, y_test, preprocessor


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'run data preprocessing'</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">'specific stockcode'</span>)
    parser.add_argument(<span class="hljs-string">'--target_col'</span>, type=str, default=<span class="hljs-string">'quantity'</span>, help=<span class="hljs-string">'the target column name'</span>)
    parser.add_argument(<span class="hljs-string">'--should_scale'</span>, type=bool, default=<span class="hljs-literal">True</span>, help=<span class="hljs-string">'flag to scale numerical features'</span>)
    parser.add_argument(<span class="hljs-string">'--verbose'</span>, type=bool, default=<span class="hljs-literal">False</span>, help=<span class="hljs-string">'flag for verbose logging'</span>)
    args = parser.parse_args()

    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = preprocess(
        target_col=args.target_col,
        should_scale=args.should_scale,
        verbose=args.verbose,
        stockcode=args.stockcode,
    )
</code></pre>
<h4 id="heading-outputs-2">Outputs</h4>
<p>This stage generates the necessary datasets for both model training and inference:</p>
<p>Input features:</p>
<ul>
<li><p><code>data/x_train_df.parquet</code></p>
</li>
<li><p><code>data/x_val_df.parquet</code></p>
</li>
<li><p><code>data/x_test_df.parquet</code></p>
</li>
</ul>
<p>Preprocessed input features:</p>
<ul>
<li><p><code>data/x_train_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_val_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_test_processed_df.parquet</code></p>
</li>
</ul>
<p>Target variables:</p>
<ul>
<li><p><code>data/y_train_df.parquet</code></p>
</li>
<li><p><code>data/y_val_df.parquet</code></p>
</li>
<li><p><code>data/y_test_df.parquet</code></p>
</li>
</ul>
<p>The preprocessor and human-readable feature names are also stored in cache for inference and SHAP feature impact analysis later:</p>
<ul>
<li><p><code>preprocessors/column_transformer.pk</code></p>
</li>
<li><p><code>preprocessors/feature_names.json</code></p>
</li>
</ul>
<p>Lastly, DVC adds the <code>preprocess_status</code> , <code>x_train_processed_path</code>, and <code>preprocessor_path</code> to the data summary metrics file <code>data.json</code> created in Step 2 to track the end-to-end process of Steps 2 and 3:</p>
<p><code>metrics/data.json</code>:</p>
<pre><code class="lang-python">{
    <span class="hljs-string">"drift_detected"</span>: false,
    <span class="hljs-string">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-string">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-string">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>,

    <span class="hljs-comment"># updates</span>
    <span class="hljs-string">"preprocess_status"</span>: <span class="hljs-string">"completed"</span>,
    <span class="hljs-string">"x_train_processed_path"</span>: <span class="hljs-string">"data/x_train_processed_85123A.parquet"</span>,
    <span class="hljs-string">"preprocessor_path"</span>: <span class="hljs-string">"preprocessors/column_transformer.pkl"</span>
}
</code></pre>
<p>Next, let’s move onto the model/experiment lineage.</p>
<h3 id="heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</h3>
<p>Now that we’ve created the datasets, we’ll tune and train the primary model. It’s a multi-layered feedforward network on <strong>PyTorch</strong>, using training and validation datasets created in the <code>preprocess</code> stage.</p>
<h4 id="heading-dvc-configuration-3">DVC Configuration</h4>
<p>First, we’ll add the <code>tuning_primary_model</code> stage right after the <code>preprocess</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/main.py
      data/x_train_processed_${params.stockcode}.parquet
      data/x_val_processed_${params.stockcode}.parquet
      data/y_train_df_${params.stockcode}.parquet
      data/y_val_df_${params.stockcode}.parquet
      ${tuning.should_local_save}
      ${tuning.grid}
      ${tuning.n_trials}
      ${tuning.num_epochs}
      ${params.stockcode}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/main.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.n_trials</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.grid</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.should_local_save</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/dfn_best_${params.stockcode}.pth</span> <span class="hljs-comment"># dvc track</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_val_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-3">Python Scripts</h4>
<p>Next, we’ll add the Python scripts to tune the model using <strong>Bayesian optimization</strong> and then train the optimal model on the complete <code>X_train</code> and <code>y_train</code> datasets created in the <code>preprocess</code> stage.</p>
<p><code>src/model/torch_model/main.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tune_and_train</span>(<span class="hljs-params">
        X_train, X_val, y_train, y_val,
        stockcode: str = <span class="hljs-string">''</span>,
        should_local_save: bool = True,
        grid: bool = False,
        n_trials: int = <span class="hljs-number">50</span>,
        num_epochs: int = <span class="hljs-number">3000</span>
    </span>) -&gt; tuple[nn.Module, dict]:</span>

    <span class="hljs-comment"># perform bayesian optimization</span>
    best_dfn, best_optimizer, best_batch_size, best_checkpoint = scripts.bayesian_optimization(
        X_train, X_val, y_train, y_val, n_trials=n_trials, num_epochs=num_epochs
    )

    <span class="hljs-comment"># save the model artifact (dvc track)</span>
    DFN_FILE_PATH = os.path.join(<span class="hljs-string">'models'</span>, <span class="hljs-string">'production'</span>, <span class="hljs-string">f'dfn_best_<span class="hljs-subst">{stockcode}</span>.pth'</span> <span class="hljs-keyword">if</span> stockcode <span class="hljs-keyword">else</span> <span class="hljs-string">'dfn_best.pth'</span>)
    os.makedirs(os.path.dirname(DFN_FILE_PATH), exist_ok=<span class="hljs-literal">True</span>)
    torch.save(best_checkpoint, DFN_FILE_PATH)

    <span class="hljs-keyword">return</span> best_dfn, best_checkpoint



<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">track_metrics_by_stockcode</span>(<span class="hljs-params">X_val, y_val, best_model, checkpoint: dict, stockcode: str</span>):</span>
    MODEL_VAL_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_val_<span class="hljs-subst">{stockcode}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_VAL_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># validate the tuned model</span>
    _, mse, exp_mae, rmsle = scripts.perform_inference(model=best_model, X=X_val, y=y_val)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    metrics = dict(
        stockcode=stockcode,
        mse_val=mse,
        mae_val=exp_mae,
        rmsle_val=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-comment"># store the validation results (dvc track)</span>
    <span class="hljs-keyword">with</span> open(MODEL_VAL_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... validation metrics saved to <span class="hljs-subst">{MODEL_VAL_METRICS_PATH}</span> ...'</span>)


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># fetch command arg values</span>
    X_TRAIN_PATH = sys.argv[<span class="hljs-number">1</span>]
    X_VAL_PATH = sys.argv[<span class="hljs-number">2</span>]
    Y_TRAIN_PATH = sys.argv[<span class="hljs-number">3</span>]
    Y_VAL_PATH = sys.argv[<span class="hljs-number">4</span>]
    SHOULD_LOCAL_SAVE = sys.argv[<span class="hljs-number">5</span>] == <span class="hljs-string">'True'</span>
    GRID = sys.argv[<span class="hljs-number">6</span>] == <span class="hljs-string">'True'</span>
    N_TRIALS = int(sys.argv[<span class="hljs-number">7</span>])
    NUM_EPOCHS = int(sys.argv[<span class="hljs-number">8</span>])
    STOCKCODE = str(sys.argv[<span class="hljs-number">9</span>])

    <span class="hljs-comment"># extract training and validation datasets from dvc cache</span>
    X_train, X_val = pd.read_parquet(X_TRAIN_PATH), pd.read_parquet(X_VAL_PATH)
    y_train, y_val = pd.read_parquet(Y_TRAIN_PATH), pd.read_parquet(Y_VAL_PATH)

    <span class="hljs-comment"># tuning</span>
    best_model, checkpoint = tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode=STOCKCODE, should_local_save=SHOULD_LOCAL_SAVE, grid=GRID, n_trials=N_TRIALS, num_epochs=NUM_EPOCHS
    )

    <span class="hljs-comment"># metrics tracking</span>
    track_metrics_by_stockcode(X_val, y_val, best_model=best_model, checkpoint=checkpoint, stockcode=STOCKCODE)
</code></pre>
<h4 id="heading-outputs-3">Outputs</h4>
<p>The stage generates two files:</p>
<ul>
<li><p><code>models/production/dfn_best.pth</code>: Includes model artifacts and checkpoint like the optimal hyperparameter set.</p>
</li>
<li><p><code>metrics/dfn_val.json</code>: Contains tuning results, model version, timestamp, and validation results for MSE, MAE, and RMSLE:</p>
</li>
</ul>
<p><code>metrics/dfn_val.json</code>:</p>
<pre><code class="lang-yaml">{
    <span class="hljs-attr">"stockcode":</span> <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_val":</span> <span class="hljs-number">0.6137686967849731</span>,
    <span class="hljs-attr">"mae_val":</span> <span class="hljs-number">9.092489242553711</span>,
    <span class="hljs-attr">"rmsle_val":</span> <span class="hljs-number">0.6953299045562744</span>,
    <span class="hljs-attr">"model_version":</span> <span class="hljs-string">"dfn_85123A_35604"</span>,
    <span class="hljs-attr">"hparams":</span> {
        <span class="hljs-attr">"num_layers":</span> <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm":</span> <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0":</span> <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0":</span> <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1":</span> <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1":</span> <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2":</span> <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2":</span> <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3":</span> <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3":</span> <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate":</span> <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr":</span> <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp":</span> <span class="hljs-string">"2025-10-07T00:31:08.700294"</span>
}
</code></pre>
<h3 id="heading-stage-5-performing-inference">Stage 5: Performing Inference</h3>
<p>After the model tuning phase is complete, we’ll configure the test inference for a final evaluation.</p>
<p>The final evaluation uses the MSE, MAE, and RMSLE metrics, as well as SHAP for feature impact and interpretability analysis.</p>
<p><strong>SHAP</strong> <strong>(SHapley Additive exPlanations)</strong> is a framework for quantifying how much each feature contributes to a model’s prediction by using the concept of Shapley values from game theory.</p>
<p>The SHAP values are leveraged for future EDA and feature engineering.</p>
<h4 id="heading-dvc-configuration-4">DVC Configuration</h4>
<p>First, we’ll add the <code>inference_primary_model</code> stage to the DVC configuration.</p>
<p>This stage has the <code>plots</code> section where DVC will track and version the generated visualization files on the SHAP values.</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/inference.py
      data/x_test_processed_${params.stockcode}.parquet
      data/y_test_df_${params.stockcode}.parquet
      models/production/dfn_best_${params.stockcode}.pth
      ${params.stockcode}
      ${tracking.sensitive_feature_col}
      ${tracking.privileged_group}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/inference.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_inf_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>

    <span class="hljs-attr">plots:</span>
      <span class="hljs-comment"># shap summary / beeswarm plot for global interpretability</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_summary_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">simple</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">shap_value</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Beeswarm</span> <span class="hljs-string">Plot</span>

      <span class="hljs-comment"># shap mean absolute vals - feature importance bar plot</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_mean_abs_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">bar</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">mean_abs_shap</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">Mean</span> <span class="hljs-string">Absolute</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Importance</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_raw_shap_values_${params.stockcode}.parquet</span> <span class="hljs-comment"># save raw shap vals for detailed analysis later</span>
</code></pre>
<h4 id="heading-python-scripts-4"><strong>Python Scripts</strong></h4>
<p>Next, we’ll add scripts where the trained model performs inference:</p>
<p><code>src/model/torch_model/inference.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> shap

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># load test dataset</span>
    X_TEST_PATH = sys.argv[<span class="hljs-number">1</span>]
    Y_TEST_PATH = sys.argv[<span class="hljs-number">2</span>]
    X_test, y_test = pd.read_parquet(X_TEST_PATH), pd.read_parquet(Y_TEST_PATH)

    <span class="hljs-comment"># create X_test w/ column names for shap analysis and sensitive feature tracking</span>
    X_test_with_col_names = X_test.copy()
    FEATURE_NAMES_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'feature_names.json'</span>)
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">with</span> open(FEATURE_NAMES_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f: feature_names = json.load(f)
    <span class="hljs-keyword">except</span> FileNotFoundError: feature_names = X_test.columns.tolist()
    <span class="hljs-keyword">if</span> len(X_test_with_col_names.columns) == len(feature_names): X_test_with_col_names.columns = feature_names

    <span class="hljs-comment"># reconstruct the optimal model tuned in the previous stage</span>
    MODEL_PATH = sys.argv[<span class="hljs-number">3</span>]
    checkpoint = torch.load(MODEL_PATH)
    model = scripts.load_model(checkpoint=checkpoint)

    <span class="hljs-comment"># perform inference</span>
    y_pred, mse, exp_mae, rmsle = scripts.perform_inference(model=model, X=X_test, y=y_test, batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>])

    <span class="hljs-comment"># create result df w/ y_pred, y_true, and sensitive features</span>
    STOCKCODE = sys.argv[<span class="hljs-number">4</span>]
    SENSITIVE_FEATURE = sys.argv[<span class="hljs-number">5</span>]
    PRIVILEGED_GROUP = sys.argv[<span class="hljs-number">6</span>]
    inference_df = pd.DataFrame(y_pred.cpu().numpy().flatten(), columns=[<span class="hljs-string">'y_pred'</span>])
    inference_df[<span class="hljs-string">'y_true'</span>] = y_test
    inference_df[SENSITIVE_FEATURE] = X_test_with_col_names[<span class="hljs-string">f'cat__<span class="hljs-subst">{SENSITIVE_FEATURE}</span>_<span class="hljs-subst">{str(PRIVILEGED_GROUP)}</span>'</span>].astype(bool)
    inference_df.to_parquet(path=os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">f'dfn_inference_results_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>))

    <span class="hljs-comment"># record inference metrics</span>
    MODEL_INF_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_inf_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_INF_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{STOCKCODE}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    inf_metrics = dict(
        stockcode=STOCKCODE,
        mse_inf=mse,
        mae_inf=exp_mae,
        rmsle_inf=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-keyword">with</span> open(MODEL_INF_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f: <span class="hljs-comment"># dvc track</span>
        json.dump(inf_metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... inference metrics saved to <span class="hljs-subst">{MODEL_INF_METRICS_PATH}</span> ...'</span>)


    <span class="hljs-comment">## shap analysis</span>
    <span class="hljs-comment"># compute shap vals</span>
    model.eval()

    <span class="hljs-comment"># prepare backgdound data</span>
    X_test_tensor = torch.from_numpy(X_test.values.astype(np.float32)).to(device_type)

    <span class="hljs-comment"># take the small samples from x_test as background</span>
    background = X_test_tensor[np.random.choice(X_test_tensor.shape[<span class="hljs-number">0</span>], <span class="hljs-number">100</span>, replace=<span class="hljs-literal">False</span>)].to(device_type)

    <span class="hljs-comment"># define deepexplainer</span>
    explainer = shap.DeepExplainer(model, background)

    <span class="hljs-comment"># compute shap vals</span>
    shap_values = explainer.shap_values(X_test_tensor) <span class="hljs-comment"># outputs = numpy array or tensor</span>

    <span class="hljs-comment"># convert shap array to pandas df</span>
    <span class="hljs-keyword">if</span> isinstance(shap_values, list): shap_values = shap_values[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">if</span> isinstance(shap_values, torch.Tensor): shap_values = shap_values.cpu().numpy()
    shap_values = shap_values.squeeze(axis=<span class="hljs-number">-1</span>) <span class="hljs-comment"># type: ignore</span>
    shap_df = pd.DataFrame(shap_values, columns=feature_names)

    <span class="hljs-comment"># shap raw data (dvc track)</span>
    RAW_SHAP_OUT_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_raw_shap_values_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>)
    os.makedirs(os.path.dirname(RAW_SHAP_OUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    shap_df.to_parquet(RAW_SHAP_OUT_PATH, index=<span class="hljs-literal">False</span>)
    main_logger.info(<span class="hljs-string">f'... shap values saved to <span class="hljs-subst">{RAW_SHAP_OUT_PATH}</span> ...'</span>)

    <span class="hljs-comment"># bar plot of mean abs shap vals (dvc report)</span>
    mean_abs_shap = shap_df.abs().mean().sort_values(ascending=<span class="hljs-literal">False</span>)
    shap_mean_abs_df = pd.DataFrame({<span class="hljs-string">'feature_name'</span>: feature_names, <span class="hljs-string">'mean_abs_shap'</span>: mean_abs_shap.values })
    MEAN_ABS_SHAP_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_shap_mean_abs_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    shap_mean_abs_df.to_json(MEAN_ABS_SHAP_PATH, orient=<span class="hljs-string">'records'</span>, indent=<span class="hljs-number">4</span>)
</code></pre>
<h4 id="heading-outputs-4"><strong>Outputs</strong></h4>
<p>This stage generates five output files:</p>
<ul>
<li><p><code>data/dfn_inference_result_${params_stockcode}.parquet</code>: Stores prediction results, labeled targets, and any columns with sensitive features like gender, age, income, and more. I’ll use this file for the fairness test in the last stage.</p>
</li>
<li><p><code>metrics/dfn_inf.json</code>: Stores evaluation metrics and tuning results:</p>
</li>
</ul>
<pre><code class="lang-json">{
    <span class="hljs-attr">"stockcode"</span>: <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_inf"</span>: <span class="hljs-number">0.6841545701026917</span>,
    <span class="hljs-attr">"mae_inf"</span>: <span class="hljs-number">11.5866117477417</span>,
    <span class="hljs-attr">"rmsle_inf"</span>: <span class="hljs-number">0.7423332333564758</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35834"</span>,
    <span class="hljs-attr">"hparams"</span>: {
        <span class="hljs-attr">"num_layers"</span>: <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm"</span>: <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0"</span>: <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0"</span>: <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1"</span>: <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1"</span>: <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2"</span>: <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2"</span>: <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3"</span>: <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3"</span>: <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate"</span>: <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr"</span>: <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:12.946405"</span>
}
</code></pre>
<ul>
<li><code>reports/dfn_shap_mean_abs.json</code>:  Stores the mean SHAP values:</li>
</ul>
<pre><code class="lang-json">[
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__invoicedate"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.219255722</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__unitprice"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1069829418</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_avg_quantity_last_month"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1021453096</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_max_price_all_time"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.0855356899</span>
    },
...
]
</code></pre>
<ul>
<li><p><code>reports/dfn_shap_summary.json</code>: Contains the data points necessary to draw the beeswarm/bar plots.</p>
</li>
<li><p><code>reports/dfn_raw_shap_values.parquet</code>: Stores raw SHAP values.</p>
</li>
</ul>
<h3 id="heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</h3>
<p>The last stage is to assess risk and fairness of the final inference results.</p>
<h4 id="heading-the-fairness-testing">The Fairness Testing</h4>
<p>Fairness testing in ML is the process of systematically evaluating a model’s predictions to ensure they are not unfairly biased toward specific groups defined by sensitive attributes like race and gender.</p>
<p>In this project, we’ll use the registration status <code>is_registered</code> column as a sensitive feature and make sure the <strong>Mean Outcome Difference (MOD)</strong> is within the specified threshold of <code>0.1</code>.</p>
<p>The MOD is calculated as the absolute difference between the mean prediction values of the privileged (registered) and unprivileged (unregistered) groups.</p>
<h4 id="heading-dvc-configuration-5">DVC Configuration</h4>
<p>First, we’ll add the <code>assess_model_risk</code> stage right after the <code>inference_primary_model</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">assess_model_risk:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/assess_risk_and_fairness.py
      data/dfn_inference_results_${params.stockcode}.parquet
      metrics/dfn_risk_fairness_${params.stockcode}.json
      ${tracking.sensitive_feature_col}
      ${params.stockcode}
      ${tracking.privileged_group}
      ${tracking.mod_threshold}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/assess_risk_and_fairness.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span> <span class="hljs-comment"># ensure the result df as dependency</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.mod_threshold</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_risk_fairness_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>param.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>

<span class="hljs-comment"># adding default values to the tracking metrics</span>
<span class="hljs-attr">tracking:</span>
  <span class="hljs-attr">sensitive_feature_col:</span> <span class="hljs-string">"is_registered"</span>
  <span class="hljs-attr">privileged_group:</span> <span class="hljs-number">1</span> <span class="hljs-comment"># member</span>
  <span class="hljs-attr">mod_threshold:</span> <span class="hljs-number">0.1</span>
</code></pre>
<h4 id="heading-python-script">Python Script</h4>
<p>The corresponding Python script contains the <code>calculate_fairness_metrics</code> function which performs the risk and fairness assessment:</p>
<p><code>src/model/torch_model/assess_risk_and_fairness.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_absolute_error, mean_squared_error, root_mean_squared_log_error

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_fairness_metrics</span>(<span class="hljs-params">
        df: pd.DataFrame,
        sensitive_feature_col: str,
        label_col: str = <span class="hljs-string">'y_true'</span>,
        prediction_col: str = <span class="hljs-string">'y_pred'</span>,
        privileged_group: int = <span class="hljs-number">1</span>,
        mod_threshold: float = <span class="hljs-number">0.1</span>,
    </span>) -&gt; dict:</span>

    metrics = dict()
    unprivileged_group = <span class="hljs-number">0</span> <span class="hljs-keyword">if</span> privileged_group == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-number">1</span>

    <span class="hljs-comment">## 1. risk assessment - predictive performance metrics by group</span>
    <span class="hljs-keyword">for</span> group, name <span class="hljs-keyword">in</span> zip([unprivileged_group, privileged_group], [<span class="hljs-string">'unprivileged'</span>, <span class="hljs-string">'privileged'</span>]):
        subset = df[df[sensitive_feature_col] == group]
        <span class="hljs-keyword">if</span> len(subset) == <span class="hljs-number">0</span>: <span class="hljs-keyword">continue</span>

        y_true = subset[label_col].values
        y_pred = subset[prediction_col].values

        metrics[<span class="hljs-string">f'mse_<span class="hljs-subst">{name}</span>'</span>] = float(mean_squared_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'mae_<span class="hljs-subst">{name}</span>'</span>] = float(mean_absolute_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'rmsle_<span class="hljs-subst">{name}</span>'</span>] = float(root_mean_squared_log_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>

        <span class="hljs-comment"># mean prediction (outcome disparity component)</span>
        metrics[<span class="hljs-string">f'mean_prediction_<span class="hljs-subst">{name}</span>'</span>] = float(y_pred.mean()) <span class="hljs-comment"># type: ignore</span>

    <span class="hljs-comment">## 2. bias assessment - fairness metrics</span>
    <span class="hljs-comment"># absolute mean error difference</span>
    mae_diff = metrics.get(<span class="hljs-string">'mae_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mae_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mae_diff'</span>] = float(mae_diff)

    <span class="hljs-comment"># mean outcome difference</span>
    mod = metrics.get(<span class="hljs-string">'mean_prediction_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mean_prediction_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mean_outcome_difference'</span>] = float(mod)
    metrics[<span class="hljs-string">'is_mod_acceptable'</span>] = <span class="hljs-number">1</span> <span class="hljs-keyword">if</span> abs(mod) &lt;= mod_threshold <span class="hljs-keyword">else</span> <span class="hljs-number">0</span>

    <span class="hljs-keyword">return</span> metrics


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'assess bias and fairness metrics on model inference results.'</span>)
    parser.add_argument(<span class="hljs-string">'inference_file_path'</span>, type=str, help=<span class="hljs-string">'parquet file path to the inference results w/ y_true, y_pred, and sensitive feature cols.'</span>)
    parser.add_argument(<span class="hljs-string">'metrics_output_path'</span>, type=str, help=<span class="hljs-string">'json file path to save the metrics output.'</span>)
    parser.add_argument(<span class="hljs-string">'sensitive_feature_col'</span>, type=str, help=<span class="hljs-string">'column name of sensitive features'</span>)
    parser.add_argument(<span class="hljs-string">'stockcode'</span>, type=str)
    parser.add_argument(<span class="hljs-string">'privileged_group'</span>, type=int, default=<span class="hljs-number">1</span>)
    parser.add_argument(<span class="hljs-string">'mod_threshold'</span>, type=float, default=<span class="hljs-number">.1</span>)
    args = parser.parse_args()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># load inf df</span>
        df_inference = pd.read_parquet(args.inference_file_path)
        LABEL_COL = <span class="hljs-string">'y_true'</span>
        PREDICTION_COL = <span class="hljs-string">'y_pred'</span>
        SENSITIVE_COL = args.sensitive_feature_col

        <span class="hljs-comment"># compute fairness metrics</span>
        metrics = calculate_fairness_metrics(
            df=df_inference,
            sensitive_feature_col=SENSITIVE_COL,
            label_col=LABEL_COL,
            prediction_col=PREDICTION_COL,
            privileged_group=args.privileged_group,
            mod_threshold=args.mod_threshold,
        )

        <span class="hljs-comment"># add items to metrics</span>
        metrics[<span class="hljs-string">'model_version'</span>] = <span class="hljs-string">f'dfn_<span class="hljs-subst">{args.stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>'</span>
        metrics[<span class="hljs-string">'sensitive_feature'</span>] = args.sensitive_feature_col
        metrics[<span class="hljs-string">'privileged_group'</span>] = args.privileged_group
        metrics[<span class="hljs-string">'mod_threshold'</span>] = args.mod_threshold
        metrics[<span class="hljs-string">'stockcode'</span>] = args.stockcode
        metrics[<span class="hljs-string">'timestamp'</span>] = datetime.datetime.now().isoformat()

        <span class="hljs-comment"># load metrics (dvc track)</span>
        <span class="hljs-keyword">with</span> open(args.metrics_output_path, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
            json_metrics = { k: (v <span class="hljs-keyword">if</span> pd.notna(v) <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>) <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> metrics.items() }
            json.dump(json_metrics, f, indent=<span class="hljs-number">4</span>)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        main_logger.error(<span class="hljs-string">f'... an error occurred during risk and fairness assessment: <span class="hljs-subst">{e}</span> ...'</span>)
        exit(<span class="hljs-number">1</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    main()
</code></pre>
<h4 id="heading-outputs-5">Outputs</h4>
<p>The final stage generates a metrics file which contains test results and model version:</p>
<p><code>metrics/dfn_risk_fairness.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"mse_unprivileged"</span>: <span class="hljs-number">3.5370739412593575</span>,
    <span class="hljs-attr">"mae_unprivileged"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"rmsle_unprivileged"</span>: <span class="hljs-number">0.6080000224747837</span>,
    <span class="hljs-attr">"mean_prediction_unprivileged"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"mae_diff"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"mean_outcome_difference"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"is_mod_acceptable"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35971"</span>,
    <span class="hljs-attr">"sensitive_feature"</span>: <span class="hljs-string">"is_registered"</span>,
    <span class="hljs-attr">"privileged_group"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"mod_threshold"</span>: <span class="hljs-number">0.1</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:15.998590"</span>
}
</code></pre>
<p>That’s all for the lineage configuration. Now, we’ll test it in local.</p>
<h3 id="heading-test-in-local">Test in Local</h3>
<p>We’ll run the entire ML lineage with this command:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> repro -f
</code></pre>
<p><code>-f</code> forces DVC to rerun all the stages with or without any updates.</p>
<p>The command will automatically create the <code>dvc.lock</code> file at the root of the project directory:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">schema:</span> <span class="hljs-string">'2.0'</span>
<span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline_full:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
    <span class="hljs-attr">deps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">ae41392532188d290395495f6827ed00.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">15870</span>
      <span class="hljs-attr">nfiles:</span> <span class="hljs-number">10</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">a8a61a4b270581a7c387d51e416f4e86.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">95715</span>
<span class="hljs-string">...</span>
</code></pre>
<p>The <code>dvc.lock</code> file must be published in Git to make sure DVC will load the latest files:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add dvc.lock .dvc dvc.yaml params.yaml
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dvc config'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<h2 id="heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</h2>
<p>Next, we’ll deploy the DVC project to ensure the AWS Lambda function can access the cached files in production.</p>
<p>We’ll start by configuring the DVC remote where the cached files are stored.</p>
<p>DVC offers <a target="_blank" href="https://dvc.org/doc/user-guide/data-management/remote-storage#supported-storage-types">various storage types</a> like AWS S3 and Google Cloud. We’ll use AWS S3 for this project but your choice depend on the project ecosystem, your familiarity with the tool, and any resource constraints.</p>
<p>First, we’ll create a new S3 bucket in the selected AWS region:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> s3 mb s3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;  --region &lt;AWS REGION&gt;
</code></pre>
<p>Make sure the IAM role has the following permissions: <code>s3:ListBucket</code>, <code>s3:GetObject</code>, <code>s3:PutObject</code>, and <code>s3:DeleteObject</code>.</p>
<p>Then, add theURI of the S3 bucket to the DVC remote:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> remote add -d &lt;DVC REMOTE NAME&gt; ss3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;
</code></pre>
<p>Next, push the cache files to the DVC remote:</p>
<pre><code class="lang-python">$dvc push
</code></pre>
<p>Now, all cache files are stored in the S3 bucket:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*yl9N4P8LNI7d_G_z.png" alt="Figure C. Screenshot of the DVC remote in AWS S3 bucket" width="600" height="400" loading="lazy"></p>
<p><strong>Figure C.</strong> Screenshot of the DVC remote in AWS S3 bucket</p>
<p>As shown in <strong>Figure A,</strong> this deployment step is necessary for the AWS Lambda function to access the DVC cache in production.</p>
<h2 id="heading-step-4-configuring-scheduled-run-with-prefect"><strong>Step 4: Configuring Scheduled Run with Prefect</strong></h2>
<p>The next step is to configure the scheduled run of the entire lineage with Prefect.</p>
<p>Prefect is an open-source workflow orchestration tool for building, scheduling, and monitoring pipelines. It uses a concept called a work pool to effectively decouple the orchestration logic from the execution infrastructure.</p>
<p>Then, the work pool serves as a standardized base configuration by running a Docker container image to guarantee a consistent execution environment for all flows.</p>
<h3 id="heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</h3>
<p>The first step is to configure the Docker image registry for the Prefect work pool:</p>
<ul>
<li><p>For local deployment: <strong>A container registry in the Docker Hub.</strong></p>
</li>
<li><p>For production deployment: <strong>AWS ECR</strong>.</p>
</li>
</ul>
<p>For local deployment, we’ll first authenticate the Docker client:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> login
</code></pre>
<p>And grant a user permission to run Docker commands without <code>sudo</code>:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$sudo</span> dscl . -append /Groups/docker GroupMembership <span class="hljs-variable">$USER</span>
</code></pre>
<p>For production deployment, we’ll create a new ECR:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> ecr create-repository --repository-name &lt;REGISTORY NAME&gt; --region &lt;AWS REGION&gt;
</code></pre>
<p>(Make sure the IAM role has access to this new ECR URI.)</p>
<h3 id="heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</h3>
<p>Next, we’ll configure the Prefect <code>task</code> and <code>flow</code> in the project:</p>
<ul>
<li><p>The Prefect <code>task</code> executes the <code>dvc repro</code> and <code>dvc push</code> commands</p>
</li>
<li><p>The Prefect <code>flow</code> weekly executes the Prefect <code>task</code>.</p>
</li>
</ul>
<p><code>src/prefect_flows.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> subprocess
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> timedelta, datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> prefect <span class="hljs-keyword">import</span> flow, task
<span class="hljs-keyword">from</span> prefect.schedules <span class="hljs-keyword">import</span> Schedule
<span class="hljs-keyword">from</span> prefect_aws <span class="hljs-keyword">import</span> AwsCredentials

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># add project root to the python path - enabling prefect to find the script</span>
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), <span class="hljs-string">'..'</span>)))

<span class="hljs-comment"># define the prefect task</span>
<span class="hljs-meta">@task(retries=3, retry_delay_seconds=30)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_dvc_pipeline</span>():</span>
    <span class="hljs-comment"># execute the dvc pipeline </span>
    result = subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"repro"</span>], capture_output=<span class="hljs-literal">True</span>, text=<span class="hljs-literal">True</span>, check=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># push the updated data</span>
    subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"push"</span>], check=<span class="hljs-literal">True</span>)


<span class="hljs-comment"># define the prefect flow</span>
<span class="hljs-meta">@flow(name="Weekly Data Pipeline")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">weekly_data_flow</span>():</span>
    run_dvc_pipeline()

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># docker image registry (either docker hub or aws ecr)</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ENV = os.getenv(<span class="hljs-string">'ENV'</span>, <span class="hljs-string">'production'</span>)
    DOCKER_HUB_REPO = os.getenv(<span class="hljs-string">'DOCKER_HUB_REPO'</span>)
    ECR_FOR_PREFECT_PATH = os.getenv(<span class="hljs-string">'S3_BUCKET_FOR_PREFECT_PATH'</span>)
    image_repo = <span class="hljs-string">f'<span class="hljs-subst">{DOCKER_HUB_REPO}</span>:ml-sales-pred-data-latest'</span> <span class="hljs-keyword">if</span> ENV == <span class="hljs-string">'local'</span> <span class="hljs-keyword">else</span> <span class="hljs-string">f'<span class="hljs-subst">{ECR_FOR_PREFECT_PATH}</span>:latest'</span>

    <span class="hljs-comment"># define weekly schedule</span>
    weekly_schedule = Schedule(
        interval=timedelta(weeks=<span class="hljs-number">1</span>),
        anchor_date=datetime(<span class="hljs-number">2025</span>, <span class="hljs-number">9</span>, <span class="hljs-number">29</span>, <span class="hljs-number">9</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>),
        active=<span class="hljs-literal">True</span>,
    )

    <span class="hljs-comment"># aws credentials to access ecr</span>
    AwsCredentials(
        aws_access_key_id=os.getenv(<span class="hljs-string">'AWS_ACCESS_KEY_ID'</span>),
        aws_secret_access_key=os.getenv(<span class="hljs-string">'AWS_SECRET_ACCESS_KEY'</span>),
        region_name=os.getenv(<span class="hljs-string">'AWS_REGION_NAME'</span>),
    ).save(<span class="hljs-string">'aws'</span>, overwrite=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># deploy the prefect flow</span>
    weekly_data_flow.deploy(
        name=<span class="hljs-string">'weekly-data-flow'</span>,
        schedule=weekly_schedule, <span class="hljs-comment"># schedule</span>
        work_pool_name=<span class="hljs-string">"wp-ml-sales-pred"</span>, <span class="hljs-comment"># work pool where the docker image (flow) runs</span>
        image=image_repo, <span class="hljs-comment"># create a docker image at docker hub (local) or ecr (production)</span>
        concurrency_limit=<span class="hljs-number">3</span>,
        push=<span class="hljs-literal">True</span> <span class="hljs-comment"># push the docker image to the image_repo</span>
    )
</code></pre>
<h3 id="heading-test-in-local-1">Test in Local</h3>
<p>Next, we’ll test the workflow locally with the Prefect server:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run prefect server start

<span class="hljs-variable">$export</span> PREFECT_API_URL=<span class="hljs-string">"http://127.0.0.1:4200/api"</span>
</code></pre>
<p>Run the <code>prefect_flows.py</code> script:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run src/prefect_flows.py
</code></pre>
<p>Upon the successful execution, the Prefect dashboard indicates the workflow is scheduled to run:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*pUJppTJ4MloU2DVr.png" alt="Figure D. The screenshot of the Prefect dashboard" width="1260" height="586" loading="lazy"></p>
<p><strong>Figure D.</strong> As screenshot of the Prefect dashboard</p>
<h2 id="heading-step-5-deploying-the-application">Step 5: Deploying the Application</h2>
<p>The final step is to deploy the entire application as a containerized Lambda by configuring the <code>Dockerfile</code> and the Flask application scripts.</p>
<p>The specific process in this final deployment step depends on the infrastructure.</p>
<p>But the common point is that DVC eliminates the need to store the large Parquet or CSV files directly in the feature store or model store because it caches them as lightweight hashed files.</p>
<p>So, first, we’ll simplify the loading logic of the Flask application script by using the <code>dvc.api</code> framework:</p>
<p><code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-keyword">import</span> dvc.api

DVC_REMOTE_NAME=&lt;REMOTE NAME IN .dvc/config file&gt;


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">configure_dvc_for_lambda</span>():</span>
    <span class="hljs-comment"># set dvc directories to /tmp</span>
    os.environ.update({
        <span class="hljs-string">'DVC_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-cache'</span>,
        <span class="hljs-string">'DVC_DATA_DIR'</span>: <span class="hljs-string">'/tmp/dvc-data'</span>,
        <span class="hljs-string">'DVC_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-config'</span>,
        <span class="hljs-string">'DVC_GLOBAL_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-global-config'</span>,
        <span class="hljs-string">'DVC_SITE_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-site-cache'</span>
    })
    <span class="hljs-keyword">for</span> dir_path <span class="hljs-keyword">in</span> [<span class="hljs-string">'/tmp/dvc-cache'</span>, <span class="hljs-string">'/tmp/dvc-data'</span>, <span class="hljs-string">'/tmp/dvc-config'</span>]:
        os.makedirs(dir_path, exist_ok=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_x_test</span>():</span>
    <span class="hljs-keyword">global</span> X_test
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading x_test ..."</span>)

        <span class="hljs-comment"># config dvc directories</span>
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(X_TEST_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                X_test = pd.read_parquet(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded x_test via dvc api'</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_preprocessor</span>():</span>
    <span class="hljs-keyword">global</span> preprocessor
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading preprocessor ..."</span>)
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(PREPROCESSOR_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                preprocessor = joblib.load(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded preprocessor via dvc api'</span>)

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)

<span class="hljs-comment">### ... the rest components remain the same  ...</span>
</code></pre>
<p>Then, update the Dockerfile to enable Docker to correctly reference the DVC components:</p>
<p><code>Dockerfile.lambda.production</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># use an official python runtime</span>
FROM public.ecr.aws/<span class="hljs-keyword">lambda</span>/python:<span class="hljs-number">3.12</span>

<span class="hljs-comment"># set environment variables (adding dvc related env variables)</span>
ENV JOBLIB_MULTIPROCESSING=<span class="hljs-number">0</span>
ENV DVC_HOME=<span class="hljs-string">"/tmp/.dvc"</span>
ENV DVC_CACHE_DIR=<span class="hljs-string">"/tmp/.dvc/cache"</span>
ENV DVC_REMOTE_NAME=<span class="hljs-string">"storage"</span>
ENV DVC_GLOBAL_SITE_CACHE_DIR=<span class="hljs-string">"/tmp/dvc_global"</span>

<span class="hljs-comment"># copy requirements file and install dependencies</span>
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir dvc dvc-s3

<span class="hljs-comment"># setup dvc</span>
RUN dvc init --no-scm
RUN dvc config core.no_scm true

<span class="hljs-comment"># copy the code to the lambda task root</span>
COPY . ${LAMBDA_TASK_ROOT}
CMD [ <span class="hljs-string">"app.handler"</span> ]
</code></pre>
<p>Lastly, ensure the large files are ignored from the Docker container image:</p>
<p><code>.dockerignore</code>:</p>
<pre><code class="lang-bash"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-comment"># dvc cache contains large files</span>
.dvc/cache
.dvcignore

<span class="hljs-comment"># add all folders that DVC will track</span>
data/
preprocessors/
models/
reports/
metrics/
</code></pre>
<h3 id="heading-test-in-local-2">Test in Local</h3>
<p>Finally, we’ll build and test the Docker image:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> build -t my-app -f Dockerfile.lambda.local .
<span class="hljs-variable">$docker</span> run -p 5002:5002 -e ENV=<span class="hljs-built_in">local</span> my-app app.py
</code></pre>
<p>Upon the successful configuration, the waitress server will run the Flask application.</p>
<p>After confirming the changes, push the code to Git:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add .
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dockerfiles and flask app scripts'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<p>This <code>push</code> command triggers the CI/CD pipeline via GitHub Actions, which generates a Docker container image and pushes it to AWS ECR.</p>
<p>And then after a successful pipeline flow and verification, we can manually run the deployment workflow using GitHub Actions.</p>
<p>And that’s it!</p>
<p>You can learn more here: <a target="_blank" href="https://medium.com/towards-artificial-intelligence/integrating-ci-cd-pipelines-to-machine-learning-applications-f5657c7fa164">Integrating the infrastructure CI/CD pipeline to an ML application</a></p>
<p>All code is available in <a target="_blank" href="https://github.com/krik8235/ml-sales-prediction">my GitHub repository</a>.</p>
<p>The mock app is available <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Building robust ML applications requires comprehensive ML lineage to ensure reliability and traceability.</p>
<p>In this article, you learned how to build an ML lineage by integrating open-source services like DVC and Prefect.</p>
<p>In practice, initial planning matters. Specifically, defining how metrics are tracked and at which stages leads directly to a cleaner, more maintainable code structure and the extensibility in the future.</p>
<p>Moving forward, we can consider adding more stages to the lineage and integrating advanced logic for data drift detection or fairness tests.</p>
<p>This will further ensure continued model performance and data integrity in the production environment.</p>
<p><strong>You can check out my</strong> <a target="_blank" href="https://kuriko-iwai.vercel.app/"><strong>Portfolio</strong></a> <strong>/</strong> <a target="_blank" href="https://github.com/krik8235"><strong>Github</strong></a><strong>.</strong></p>
<p><em>All images, unless otherwise noted, are by the author.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn MLOps by Creating a YouTube Sentiment Analyzer ]]>
                </title>
                <description>
                    <![CDATA[ If you’re serious about machine learning and want to break into real-world ML engineering, learning MLOps is one of the best things you can do. It’s what turns experiments into reliable systems. You can train a great model, but without the right pipe... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-mlops-by-creating-a-youtube-sentiment-analyzer/</link>
                <guid isPermaLink="false">684dd5eb387267af7a264b7a</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Sat, 14 Jun 2025 20:04:59 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749822376754/96a5ebfc-e64d-4541-9fc7-b17bcc43db5a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re serious about machine learning and want to break into real-world ML engineering, learning MLOps is one of the best things you can do. It’s what turns experiments into reliable systems. You can train a great model, but without the right pipeline to deploy, monitor, and update it, that model won’t be useful in a real application. MLOps is how companies ship machine learning at scale, and even if you're working solo or on smaller projects, knowing how to build proper ML pipelines will save you tons of time and headaches. Plus, it's one of the most in-demand skill sets in the industry right now.</p>
<p>We just released a new video course on the freeCodeCamp.org YouTube channel that teaches you how to build a full end-to-end MLOps pipeline by working on a real, practical project. You’ll create a system that analyzes the sentiment of YouTube comments in real time using a Chrome extension. This isn’t just another toy example. It’s a full production-ready pipeline that covers everything from data collection to deployment, and it uses real tools that are used in modern ML workflows: MLflow, DVC, Docker, AWS, Flask, and more. The course is taught by Bappy Ahmed, who walks through each step in a clear, practical way, so you actually understand what’s happening.</p>
<p>Here’s what the course covers:</p>
<ul>
<li><p><strong>Introduction &amp; Project Planning</strong> – Understand the problem and design the architecture of the full pipeline.</p>
</li>
<li><p><strong>Data Collection</strong> – Learn how to scrape YouTube comments and prepare the data you'll use to train the sentiment model.</p>
</li>
<li><p><strong>Data Preprocessing &amp; EDA</strong> – Clean the data, explore patterns, and prep it for training.</p>
</li>
<li><p><strong>Setup MLflow Server on AWS</strong> – Use MLflow to track experiments and manage your models.</p>
</li>
<li><p><strong>Building a Baseline Model</strong> – Start simple with a basic model to establish a performance benchmark.</p>
</li>
<li><p><strong>Improving the Model</strong> – Experiment with techniques like Bag of Words, TFIDF, adjusting feature sizes, handling class imbalance, and hyperparameter tuning.</p>
</li>
<li><p><strong>Stacking Models</strong> – Use ensemble techniques to combine different models for better performance.</p>
</li>
<li><p><strong>Build a Full ML Pipeline Using DVC</strong> – Break your code into modular components (data ingestion, preprocessing, model building, etc.) and version everything using DVC.</p>
</li>
<li><p><strong>Model Evaluation and Registration with MLflow</strong> – Evaluate performance and keep track of the best models.</p>
</li>
<li><p><strong>Deploy with Flask and Docker</strong> – Wrap your model in a Flask API, containerize it with Docker, and prepare it for production.</p>
</li>
<li><p><strong>Create a Chrome Plugin</strong> – Build a browser extension that interacts with your deployed model in real time.</p>
</li>
<li><p><strong>CI/CD Deployment on AWS</strong> – Automate your deployment so updates happen smoothly and reliably.</p>
</li>
</ul>
<p>By the end, you’ll have a working, deployed MLOps project that shows you understand the full ML lifecycle. This course is perfect for anyone who already knows a bit of machine learning and wants to level up their engineering skills.</p>
<p>You can <a target="_blank" href="https://www.youtube.com/watch?v=gwNPV882tkc">watch the full course on the freeCodeCamp.org YouTube channel</a> (3-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/gwNPV882tkc" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Automate Compliance and Fraud Detection in Finance with MLOps ]]>
                </title>
                <description>
                    <![CDATA[ These days, businesses are under increasing pressure to comply with stringent regulations while also combating fraudulent activities. The high volume of data and the intricate requirements of real-time fraud detection and compliance reporting are fre... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/automate-compliance-and-fraud-detection-in-finance-with-mlops/</link>
                <guid isPermaLink="false">68222009a8daed5c1fbf1692</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GCP ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #AIOps ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Mon, 12 May 2025 16:21:29 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1747064311601/923284fd-8584-4ef3-8591-f717b9807148.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>These days, businesses are under increasing pressure to comply with stringent regulations while also combating fraudulent activities. The high volume of data and the intricate requirements of real-time fraud detection and compliance reporting are frequently a challenge for traditional systems to manage.</p>
<p>This is where MLOps (Machine Learning Operations) comes into play. It can help teams streamline these processes and elevate automation to the forefront of financial security and regulatory adherence.</p>
<p>In this article, we will investigate the potential of MLOps for automating compliance and fraud detection in the finance sector.</p>
<p>I’ll show you step by step how financial institutions can deploy a machine learning model for fraud detection and integrate it into their operations to ensure continuous monitoring and automated alerts for compliance. I’ll also demonstrate how to deploy this solution in a cloud-based environment using Google Colab, ensuring that it is both user-friendly and accessible, whether you are a beginner or more advanced.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-mlops">What is MLOps?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-need">What You’ll Need</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-set-up-google-colab-and-prepare-the-data">Step 1: Set Up Google Colab and Prepare the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-data-preprocessing">Step 2: Data Preprocessing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-retrain-the-model-with-new-data">Step 4: Retrain the Model with New Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-automated-alert-system">Step 5: Automated Alert System</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-visualize-model-performance">Step 6: Visualize Model Performance</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-key-takeaways">Key Takeaways</a></p>
</li>
</ul>
<h2 id="heading-what-is-mlops"><strong>What is MLOps?</strong></h2>
<p>Machine Learning Operations, or MLOps for short, is a methodology that integrates DevOps with Machine Learning (ML).  The whole machine learning model lifecycle, including development, training, deployment, monitoring, and maintenance, can be automated with its help. </p>
<p>MLOps has several main goals: continuous optimization, scalability, and the delivery of operational value over time.</p>
<p>The financial industry provides great use cases for MLOps processes and techniques, as these can help businesses manage complicated data pipelines, deploy models in real-time, and evaluate their performance – all while making sure they're compliant with regulations.</p>
<h3 id="heading-why-is-mlops-important-in-finance"><strong>Why is MLOps Important in Finance?</strong></h3>
<p>Financial institutions are subject to various rules including Anti-Money Laundering (AML), Know Your Customer (KYC), and Fraud Prevention Regulations – so they have to carefully manage private information. Ignoring these rules might result in severe fines and loss of reputation.</p>
<p>Detecting fraud in financial transactions also calls for advanced systems capable of real-time identification of suspicious activity.</p>
<p>MLOps can help to solve these issues in the following ways:</p>
<ul>
<li><p>MLOps lets financial institutions automatically track transactions for regulatory compliance, guaranteeing they follow changing legislation.</p>
</li>
<li><p>MLOps helps to create and implement machine learning models that can identify fraudulent transactions in real-time.</p>
</li>
<li><p>MLOps runs automated processes, enabling organizations to expand their fraud detection systems with as little human involvement as possible through automation.</p>
</li>
</ul>
<h2 id="heading-what-youll-need"><strong>What You’ll Need:</strong></h2>
<p>To follow along with this tutorial, ensure that you have the following:</p>
<ol>
<li><p><strong>Python</strong> installed, along with basic ML libraries such as scikit-learn, Pandas, and NumPy.</p>
</li>
<li><p>A <strong>sample dataset</strong> of financial transactions, which we will use to train a fraud detection model (You can use this <a target="_blank" href="https://www.datacamp.com/datalab/datasets/dataset-r-credit-card-fraud">sample dataset</a> if you don’t have one on hand).</p>
</li>
<li><p><strong>Google Colab</strong> (for cloud-based execution), which is free to use and doesn't require installation.</p>
</li>
</ol>
<h2 id="heading-step-1-set-up-google-colab-and-prepare-the-data"><strong>Step 1: Set Up Google Colab and Prepare the Data</strong></h2>
<p>Google Colab is an ideal choice for beginners and advanced users alike, because it’s cloud-based and doesn’t require installation. To start get started using it, follow these steps:</p>
<h3 id="heading-access-google-colab"><strong>Access Google Colab</strong>:</h3>
<p>Visit Google Colab and <a target="_blank" href="https://colab.research.google.com/">sign-in</a> with your <strong>Google account</strong>.</p>
<h3 id="heading-create-a-new-notebook"><strong>Create a New Notebook</strong>:</h3>
<p>In the Colab interface, go to <strong>File</strong> and then select <strong>New Notebook</strong> to create a fresh notebook.</p>
<h3 id="heading-import-libraries-and-load-the-dataset"><strong>Import Libraries and Load the Dataset</strong></h3>
<p>Now, let’s import the necessary libraries and load our fraud detection dataset. We'll assume the dataset is available as a CSV file, and we'll upload it to Colab.</p>
<p><strong>Import libraries:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> classification_report, confusion_matrix
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
</code></pre>
<p><strong>Upload the Dataset</strong>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> google.colab <span class="hljs-keyword">import</span> files
uploaded = files.upload()

<span class="hljs-comment"># Load dataset into pandas DataFrame</span>
data = pd.read_csv(<span class="hljs-string">'data.csv'</span>)
print(data.head())
</code></pre>
<h2 id="heading-step-2-data-preprocessing"><strong>Step 2: Data Preprocessing</strong></h2>
<p>Data preprocessing is essential to prepare the dataset for model training. This involves handling missing values, encoding categorical variables, and normalizing numerical features.</p>
<h3 id="heading-why-is-preprocessing-important">Why is Preprocessing Important?</h3>
<p>Data preprocessing lets you take care of various data issues that could affect your results. During this process, you’ll:</p>
<ul>
<li><p><strong>Handle missing values</strong>: Financial datasets often have missing values. Filling in these missing values (for example, with the median) ensures that the model doesn’t encounter errors during training.</p>
</li>
<li><p><strong>Convert categorical data</strong>: Machine learning algorithms require numerical input, so categorical features (like transaction type or location) need to be converted into numeric format using one-hot encoding.</p>
</li>
<li><p><strong>Normalize data</strong>: Some machine learning models, like Random Forest, are not sensitive to feature scaling, but normalization helps maintain consistency and allows us to compare the importance of different features. This step is especially critical for models that rely on gradient descent.</p>
</li>
</ul>
<p>Here’s an example:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Handle missing data by filling with the median value for each column</span>
data.fillna(data.median(), inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Convert categorical columns to numeric using one-hot encoding</span>
data = pd.get_dummies(data, drop_first=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Normalize numerical columns for scaling</span>
data[<span class="hljs-string">'normalized_amount'</span>] = (data[<span class="hljs-string">'Amount'</span>] - data[<span class="hljs-string">'Amount'</span>].mean()) / data[<span class="hljs-string">'Amount'</span>].std()

<span class="hljs-comment"># Separate features and target variable</span>
X = data.drop(columns=[<span class="hljs-string">'Class'</span>])
y = data[<span class="hljs-string">'Class'</span>]

<span class="hljs-comment"># Split data into training and testing sets (80% train, 20% test)</span>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

print(<span class="hljs-string">"Data preprocessing completed."</span>)
</code></pre>
<h2 id="heading-step-3-train-a-fraud-detection-model"><strong>Step 3: Train a Fraud Detection Model</strong></h2>
<p>We'll now train a <strong>RandomForestClassifier</strong> and evaluate its performance.</p>
<h3 id="heading-what-is-a-random-forest-classifier"><strong>What is a Random Forest Classifier?</strong></h3>
<p>A <strong>Random Forest</strong> is an ensemble learning method that creates a collection (forest) of decision trees, typically trained with different parts of the data. It aggregates their predictions to improve accuracy and reduce overfitting.</p>
<p>This method is a popular choice for fraud detection because it can handle high-dimensional data. It’s also quite robust against overfitting.</p>
<p>Here’s how you can implement the Random Forest Classifier:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Initialize the Random Forest Classifier</span>
rf_model = RandomForestClassifier(n_estimators=<span class="hljs-number">150</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># Train the model on the training data</span>
rf_model.fit(X_train, y_train)

<span class="hljs-comment"># Predict on the test data</span>
y_pred = rf_model.predict(X_test)

<span class="hljs-comment"># Evaluate model performance</span>
print(<span class="hljs-string">"Model Evaluation:\n"</span>, classification_report(y_test, y_pred))
print(<span class="hljs-string">"Confusion Matrix:\n"</span>, confusion_matrix(y_test, y_pred))

<span class="hljs-comment"># Plot confusion matrix for visual understanding</span>
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
cax = ax.matshow(cm, cmap=<span class="hljs-string">'Blues'</span>)
fig.colorbar(cax)
plt.title(<span class="hljs-string">"Confusion Matrix"</span>)
plt.xlabel(<span class="hljs-string">"Predicted"</span>)
plt.ylabel(<span class="hljs-string">"Actual"</span>)
plt.show()
</code></pre>
<p>How the model is evaluated:</p>
<ul>
<li><p><strong>Classification report</strong>: Shows metrics like precision, recall, and F1-score for the fraud and non-fraud classes.</p>
</li>
<li><p><strong>Confusion matrix</strong>: Helps visualize the performance of the model by showing the true positives, false positives, true negatives, and false negatives.</p>
</li>
</ul>
<h2 id="heading-step-4-retrain-the-model-with-new-data"><strong>Step 4: Retrain the Model with New Data</strong></h2>
<p>Once you have trained your model, it’s important to retrain it periodically with new data to ensure that it continues to detect emerging fraud patterns.</p>
<h3 id="heading-what-is-retraining"><strong>What is Retraining?</strong></h3>
<p>Retraining the model ensures that it adapts to new, unseen data and improves over time. In the case of fraud detection, retraining is crucial because fraud tactics evolve over time, and your model needs to stay up-to-date to recognize new patterns.</p>
<p>Here’s how you can do this:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Simulate loading new fraud data</span>
new_data = pd.read_csv(<span class="hljs-string">'new_fraud_data.csv'</span>)

<span class="hljs-comment"># Apply preprocessing steps to new data (like filling missing values, encoding, normalization)</span>
new_data.fillna(new_data.median(), inplace=<span class="hljs-literal">True</span>)
new_data = pd.get_dummies(new_data, drop_first=<span class="hljs-literal">True</span>)
new_data[<span class="hljs-string">'normalized_amount'</span>] = (new_data[<span class="hljs-string">'transaction_amount'</span>] - new_data[<span class="hljs-string">'transaction_amount'</span>].mean()) / new_data[<span class="hljs-string">'transaction_amount'</span>].std()

<span class="hljs-comment"># Concatenate old and new data for retraining</span>
X_new = new_data.drop(columns=[<span class="hljs-string">'fraud_label'</span>])
y_new = new_data[<span class="hljs-string">'fraud_label'</span>]

<span class="hljs-comment"># Retrain the model with the updated dataset</span>
X_combined = pd.concat([X_train, X_new], axis=<span class="hljs-number">0</span>)
y_combined = pd.concat([y_train, y_new], axis=<span class="hljs-number">0</span>)

rf_model.fit(X_combined, y_combined)

<span class="hljs-comment"># Re-evaluate the model</span>
y_pred_new = rf_model.predict(X_test)
print(<span class="hljs-string">"Updated Model Evaluation:\n"</span>, classification_report(y_test, y_pred_new))
</code></pre>
<h2 id="heading-step-5-automated-alert-system"><strong>Step 5: Automated Alert System</strong></h2>
<p>To automate fraud detection, we’ll send an email whenever a suspicious transaction is detected.</p>
<h3 id="heading-how-the-alert-system-works"><strong>How the Alert System Works</strong></h3>
<p>The email alert system uses <a target="_blank" href="https://www.freecodecamp.org/news/send-emails-in-python-using-mailtrap-smtp-and-the-email-api/"><strong>SMTP</strong> to send an email</a> whenever fraud is detected. When the model identifies a suspicious transaction, it triggers an automated alert to notify the compliance team for further investigation.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> smtplib
<span class="hljs-keyword">from</span> email.mime.text <span class="hljs-keyword">import</span> MIMEText
<span class="hljs-keyword">from</span> email.mime.multipart <span class="hljs-keyword">import</span> MIMEMultipart

<span class="hljs-comment"># Function to send an email alert</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">send_alert</span>(<span class="hljs-params">email_subject, email_body</span>):</span>
    sender_email = <span class="hljs-string">"your_email@example.com"</span>
    receiver_email = <span class="hljs-string">"compliance_team@example.com"</span>
    password = <span class="hljs-string">"your_password"</span>

    msg = MIMEMultipart()
    msg[<span class="hljs-string">'From'</span>] = sender_email
    msg[<span class="hljs-string">'To'</span>] = receiver_email
    msg[<span class="hljs-string">'Subject'</span>] = email_subject

    msg.attach(MIMEText(email_body, <span class="hljs-string">'plain'</span>))

    <span class="hljs-comment"># Send email using SMTP</span>
    <span class="hljs-keyword">try</span>:
        server = smtplib.SMTP_SSL(<span class="hljs-string">'smtp.example.com'</span>, <span class="hljs-number">465</span>)
        server.login(sender_email, password)
        text = msg.as_string()
        server.sendmail(sender_email, receiver_email, text)
        server.quit()
        print(<span class="hljs-string">"Fraud alert email sent successfully."</span>)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Failed to send email: <span class="hljs-subst">{str(e)}</span>"</span>)

<span class="hljs-comment"># Example: Check for fraud and trigger an alert</span>
suspicious_transaction_details = <span class="hljs-string">"Transaction ID: 12345, Amount: $5000, Suspicious Activity Detected."</span>
send_alert(<span class="hljs-string">"Fraud Detection Alert"</span>, <span class="hljs-string">f"A suspicious transaction has been detected: <span class="hljs-subst">{suspicious_transaction_details}</span>"</span>)
</code></pre>
<h2 id="heading-step-6-visualize-model-performance"><strong>Step 6: Visualize Model Performance</strong></h2>
<p>Finally, we will visualize the performance of the model using an <strong>ROC curve</strong> (Receiver Operating Characteristic Curve), which helps evaluate the trade-off between the true positive rate and false positive rate.</p>
<p>Visualizing the performance of a machine learning model is an essential step in understanding how well the model is doing, especially when it comes to evaluating its ability to detect fraudulent transactions.</p>
<h3 id="heading-what-is-an-roc-curve"><strong>What is an ROC curve?</strong></h3>
<p>An ROC curve shows how well a model performs across all classification thresholds. It plots the True Positive Rate (TPR) versus the False Positive Rate (FPR). The area under the ROC curve (AUC) provides a summary measure of model performance.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> roc_curve, auc

<span class="hljs-comment"># Calculate ROC curve</span>
fpr, tpr, thresholds = roc_curve(y_test, rf_model.predict_proba(X_test)[:,<span class="hljs-number">1</span>])
roc_auc = auc(fpr, tpr)

<span class="hljs-comment"># Plot ROC curve</span>
plt.figure(figsize=(<span class="hljs-number">8</span>,<span class="hljs-number">6</span>))
plt.plot(fpr, tpr, color=<span class="hljs-string">'blue'</span>, label=<span class="hljs-string">f'ROC curve (area = <span class="hljs-subst">{roc_auc:<span class="hljs-number">.2</span>f}</span>)'</span>)
plt.plot([<span class="hljs-number">0</span>, <span class="hljs-number">1</span>], [<span class="hljs-number">0</span>, <span class="hljs-number">1</span>], color=<span class="hljs-string">'gray'</span>, linestyle=<span class="hljs-string">'--'</span>)
plt.xlim([<span class="hljs-number">0.0</span>, <span class="hljs-number">1.0</span>])
plt.ylim([<span class="hljs-number">0.0</span>, <span class="hljs-number">1.05</span>])
plt.xlabel(<span class="hljs-string">'False Positive Rate'</span>)
plt.ylabel(<span class="hljs-string">'True Positive Rate'</span>)
plt.title(<span class="hljs-string">'Receiver Operating Characteristic (ROC) Curve'</span>)
plt.legend(loc=<span class="hljs-string">'lower right'</span>)
plt.show()
</code></pre>
<p>The ROC curve gives us a comprehensive picture of how well our model is distinguishing between the two classes across various thresholds. By evaluating this curve, we can make decisions on how to tune the model’s threshold to find the best balance between detecting fraud and minimizing false alarms (that is, minimizing false positives).</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>By following this guide, you’ve learned how to leverage MLOps to automate fraud detection and ensure compliance in the financial industry using Google Colab. This cloud-based environment makes it easy to work with machine learning models without the hassle of local setups or configurations.</p>
<p>From automating data preprocessing to deploying models in production, MLOps offers an end-to-end solution that improves efficiency, scalability, and accuracy in detecting fraudulent activities.</p>
<p>By integrating real-time monitoring and continuous updates, financial institutions can stay ahead of fraud threats while ensuring regulatory compliance with minimal manual effort.</p>
<h2 id="heading-key-takeaways"><strong>Key Takeaways</strong></h2>
<ul>
<li><p>MLOps automates the whole machine learning model lifecycle by integrating machine learning with DevOps.</p>
</li>
<li><p>Simplifies regulatory compliance and fraud detection, letting banks spot fraudulent transactions automatically.</p>
</li>
<li><p>Maintains fraud detection systems current with fresh data through constant monitoring and model retraining.</p>
</li>
<li><p>Machine learning model development and testing may be done on Google Colab, a free cloud-based platform that provides access to GPUs and TPUs. No local installation is required.</p>
</li>
<li><p>Allows for automated workflows to detect suspicious behavior and send out alerts in real-time, allowing for fraud detection and alerting.</p>
</li>
<li><p>Continuous integration/continuous delivery pipelines guarantee continuous system improvement by automating the testing and deployment of new fraud detection models.</p>
</li>
<li><p>Financial organizations may save money using MLOps because cloud-based systems like Google Colab lower infrastructure expenses.</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Make IT Operations More Efficient with AIOps: Build Smarter, Faster Systems ]]>
                </title>
                <description>
                    <![CDATA[ In the rapidly evolving IT landscape, development teams have to operate at their best and manage complex systems while minimizing downtime. And having to do many routine tasks manually can really slow down operations and reduce efficiency. These days... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/make-it-operations-more-efficient-with-aiops/</link>
                <guid isPermaLink="false">681e7192df44ab8496bca883</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #AIOps ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ IT ]]>
                    </category>
                
                    <category>
                        <![CDATA[ IT Operations ]]>
                    </category>
                
                    <category>
                        <![CDATA[ infrastructure ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Fri, 09 May 2025 21:20:18 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1746825359981/5587ade8-875d-4623-b3f5-708109b34672.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In the rapidly evolving IT landscape, development teams have to operate at their best and manage complex systems while minimizing downtime. And having to do many routine tasks manually can really slow down operations and reduce efficiency.</p>
<p>These days, we can use artificial intelligence to manage and enhance IT operations. This is where AIOps for IT operations comes into play.</p>
<p>AIOps is changing IT operations as it lets teams create better, faster systems that can find and resolve problems on their own. It also helps them make the best use of resources, and grow without as many problems.</p>
<p>In this tutorial, you’ll learn about the key components of AIOps, how they interact with other IT systems, and how you can apply AIOps to improve the efficiency of your environment.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-aiops">What is AIOps?</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-the-significance-of-aiops-for-it-operations">The Significance of AIOps for IT Operations</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-aiops-can-help-address-these-challenges-by">AIOps can help address these challenges by</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-getting-started-with-aiops">Getting Started with AIOps</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-1-choose-an-aiops-tool">1. Choose an AIOps Tool</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-2-implement-aiops-in-your-it-environment">2. Implement AIOps in Your IT Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-3-leverage-machine-learning-for-anomaly-detection">3. Leverage Machine Learning for Anomaly Detection</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-4-automate-root-cause-analysis">4. Automate Root Cause Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-5-set-up-automated-responses-using-webhooks">5. Set Up Automated Responses Using Webhooks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-6-automate-system-cleanup-with-ansible-sample-playbook">6. Automate system cleanup with Ansible (sample playbook)</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-use-case-aiops-in-cloud-infrastructure-and-incident-management">Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-challenges">Challenges:</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-aiops-implementation">AIOps implementation:</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-setting-up-monitoring-with-prometheus">Step 1: Setting Up Monitoring with Prometheus</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-collecting-system-data-cpu-usage">Step 2: Collecting System Data (CPU Usage)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-anomaly-detection-with-machine-learning">Step 3: Anomaly Detection with Machine Learning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-automating-incident-response-with-aws-lambda">Step 4: Automating Incident Response with AWS Lambda</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-proactive-resource-scaling-with-predictive-analytics">Step 5: Proactive Resource Scaling with Predictive Analytics</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-what-is-aiops"><strong>What is AIOps?</strong></h2>
<p>AIOps is <strong>artificial intelligence for IT operations</strong>. It means enhancing and streamlining IT chores by means of artificial intelligence and machine learning.</p>
<p>AIOps systems examine the vast volumes of data generated by IT systems, such as logs and metrics, while utilizing machine learning methods. The main objective of AIOps is to enable companies to more quickly and effectively identify and resolve IT issues.</p>
<p>Key components of AIOps include:</p>
<ol>
<li><p><strong>Anomaly detection</strong>: the process of spotting unusual patterns in a system's operation that might indicate a problem.</p>
</li>
<li><p><strong>Event correlation</strong>: the process of examining data from several sources to determine how they complement one another and help to explain why issues arise.</p>
</li>
<li><p><strong>Automated response:</strong> acting to resolve issues without human assistance.</p>
</li>
</ol>
<h3 id="heading-the-significance-of-aiops-for-it-operations"><strong>The Significance of AIOps for IT Operations</strong></h3>
<p>The rise of hybrid and multi-cloud platforms, microservices architectures, and systems that can expand quickly are complicating IT operations. Often, conventional IT management tools fall behind the size and speed of the systems that we need to monitor and maintain.</p>
<p>Here are some issues that often come up in standard IT operations:</p>
<ol>
<li><p><strong>Manual troubleshooting</strong>: IT teams sometimes must comb through logs and reports by hand to identify the root of issues.</p>
</li>
<li><p><strong>Long settlement times</strong>: The longer it takes to resolve a problem after discovery, the more downtime and dissatisfied users result.</p>
</li>
<li><p><strong>Scalability</strong>: Monitoring all system components becomes more difficult as they grow since more manual labor would be required.</p>
</li>
</ol>
<h3 id="heading-aiops-can-help-address-these-challenges-by">AIOps can help address these challenges by</h3>
<ul>
<li><p><strong>Improving incident resolution times</strong>: By correlating events and providing actionable insights, AIOps can resolve problems in real-time.</p>
</li>
<li><p><strong>Scaling effortlessly</strong>: AIOps can handle large volumes of data and events without additional resources, making it ideal for scaling operations</p>
</li>
<li><p><strong>Automating incident detection and response</strong>: AI models can detect issues and automatically resolve them, reducing manual intervention.</p>
</li>
</ul>
<p>You can better understand AIOps by looking at its main components:</p>
<h4 id="heading-1-machine-learning-for-predictive-analytics">1. Machine Learning for Predictive Analytics</h4>
<p>AIOps tools forecast future events by means of machine learning and examining historical data. Prediction analytics, for example, can inform teams when a system's performance is likely to decline, letting them address the issue before it worsens.</p>
<h4 id="heading-2-automating-and-self-healing">2. Automating and Self-Healing</h4>
<p>AIOps lets your team automate daily tasks, eliminating the need for human intervention. Services, for instance, can be restarted, or resources can be relocated. Running the company costs less, and problem resolution takes less time.</p>
<h4 id="heading-3-event-correlation-and-root-cause-analysis">3. Event Correlation and Root Cause Analysis</h4>
<p>Event correlation is the technique of linking events from several related systems to identify the root cause of the problem. For instance, AIOps will examine server, network, and application logs to determine what’s wrong – whether it’s a network problem or a web application failure – and correct it.</p>
<h2 id="heading-getting-started-with-aiops">Getting Started with AIOps</h2>
<p>Enhancing your team’s IT operations with AIOps involves including tools and procedures run by artificial intelligence in your present system. These are the most crucial actions to start with:</p>
<h3 id="heading-1-choose-an-aiops-tool"><strong>1. Choose an AIOps Tool</strong></h3>
<p>There are several AIOps platforms available, each with its own set of features. Some popular AIOps tools include:</p>
<ul>
<li><p><strong>Moogsoft</strong>: An AIOps platform that uses machine learning for event correlation, anomaly detection, and incident management.</p>
</li>
<li><p><strong>BigPanda</strong>: Focuses on automating incident management and root cause analysis.</p>
</li>
<li><p><strong>Splunk IT Service Intelligence</strong>: Offers advanced analytics for monitoring and managing IT infrastructure.</p>
</li>
</ul>
<p>When selecting an AIOps tool, consider the following:</p>
<ul>
<li><p><strong>Integration with existing tools</strong>: Ensure the platform integrates with your current monitoring, logging, and alerting systems.</p>
</li>
<li><p><strong>Scalability</strong>: The platform should be able to handle large volumes of data and scale with your organization.</p>
</li>
<li><p><strong>Ease of use</strong>: Look for a user-friendly interface and automation capabilities to minimize manual intervention.</p>
</li>
</ul>
<h3 id="heading-2-implement-aiops-in-your-it-environment"><strong>2. Implement AIOps in Your IT Environment</strong></h3>
<p>These are the steps you’ll need to take to integrate AIOps into your IT operations:</p>
<ul>
<li><p><strong>Data aggregation:</strong> is the process of collecting data from various sources, including computers, network devices, cloud infrastructure, and applications, and consolidating it all onto one platform.</p>
</li>
<li><p><strong>Determine thresholds and KPIs</strong>: Identify the most crucial key performance indicators such as error rates, system uptime, and response for your company.</p>
</li>
<li><p><strong>Establishing alerts and automation</strong>: For instance, when thresholds are crossed, configure automatic responses to restart services or raise resource consumption.</p>
</li>
</ul>
<h3 id="heading-3-leverage-machine-learning-for-anomaly-detection"><strong>3. Leverage Machine Learning for Anomaly Detection</strong></h3>
<p>Machine learning models are quite crucial in the search for anomalies. These models can identify trends that are not usual and learn from prior data. This enables IT departments to identify issues early on before they escalate.</p>
<p><strong>Example</strong>: A machine learning model may detect a spike in CPU usage that is unusual for a particular time of day, triggering an alert or automatic remediation process, such as scaling the application to add more resources.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> IsolationForest
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># Example dataset (e.g., CPU usage or network traffic over time)</span>
data = np.array([<span class="hljs-number">50</span>, <span class="hljs-number">51</span>, <span class="hljs-number">52</span>, <span class="hljs-number">53</span>, <span class="hljs-number">200</span>, <span class="hljs-number">55</span>, <span class="hljs-number">56</span>, <span class="hljs-number">57</span>, <span class="hljs-number">58</span>, <span class="hljs-number">60</span>]).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)

<span class="hljs-comment"># Initialize Isolation Forest model for anomaly detection</span>
model = IsolationForest(contamination=<span class="hljs-number">0.1</span>)  <span class="hljs-comment"># 10% outliers</span>
model.fit(data)

<span class="hljs-comment"># Predict anomalies: -1 indicates anomaly, 1 indicates normal</span>
predictions = model.predict(data)

<span class="hljs-comment"># Plotting the results</span>
plt.plot(data, label=<span class="hljs-string">"System Metric"</span>)
plt.scatter(np.arange(len(data)), data, c=predictions, cmap=<span class="hljs-string">"coolwarm"</span>, label=<span class="hljs-string">"Anomalies"</span>)
plt.title(<span class="hljs-string">"Anomaly Detection in System Metric"</span>)
plt.legend()
plt.show()
</code></pre>
<h3 id="heading-4-automate-root-cause-analysis"><strong>4. Automate Root Cause Analysis</strong></h3>
<p>AIOps platforms can automatically correlate data from various sources to identify the root cause of incidents. For instance, if an application is experiencing high response times, AIOps can check the server logs, network status, and database performance to determine if the issue is due to a server failure, database bottleneck, or network congestion.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> splunklib.client <span class="hljs-keyword">as</span> client
<span class="hljs-keyword">import</span> splunklib.results <span class="hljs-keyword">as</span> results

<span class="hljs-comment"># Connect to Splunk server (replace with actual credentials)</span>
service = client.Service(
    host=<span class="hljs-string">'localhost'</span>,
    port=<span class="hljs-number">8089</span>,
    username=<span class="hljs-string">'admin'</span>,
    password=<span class="hljs-string">'password'</span>
)

<span class="hljs-comment"># Perform a search query to find events related to system issues</span>
search_query = <span class="hljs-string">'search index=main "error" OR "fail" | stats count by sourcetype'</span>

<span class="hljs-comment"># Run the search</span>
job = service.jobs.create(search_query)

<span class="hljs-comment"># Wait for the search job to complete</span>
<span class="hljs-keyword">while</span> <span class="hljs-keyword">not</span> job.is_done():
    print(<span class="hljs-string">"Waiting for results..."</span>)
    time.sleep(<span class="hljs-number">2</span>)

<span class="hljs-comment"># Retrieve and process the results</span>
<span class="hljs-keyword">for</span> result <span class="hljs-keyword">in</span> results.JSONResultsReader(job.results()):
    print(result)
</code></pre>
<h3 id="heading-5-set-up-automated-responses-using-webhooks"><strong>5. Set Up Automated Responses Using Webhooks</strong></h3>
<p>In AIOps, automated incident response is triggered through Webhooks or other messaging systems. For example, when an anomaly is detected, a Webhook can notify a team or initiate a resolution process.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests

<span class="hljs-comment"># Simulate an anomaly detection system that triggers when an anomaly is found</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">send_alert_to_webhook</span>(<span class="hljs-params">anomaly_detected</span>):</span>
    webhook_url = <span class="hljs-string">'https://your-webhook-url.com'</span>
    payload = {
        <span class="hljs-string">"text"</span>: <span class="hljs-string">f"Alert: Anomaly detected! Please review the system metrics immediately."</span>
    }

    <span class="hljs-keyword">if</span> anomaly_detected:
        response = requests.post(webhook_url, json=payload)
        print(<span class="hljs-string">"Alert sent to webhook"</span>)
        <span class="hljs-keyword">return</span> response.status_code
    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># Simulate anomaly detection</span>
anomaly_detected = <span class="hljs-literal">True</span>  <span class="hljs-comment"># Set to True when an anomaly is found</span>

<span class="hljs-comment"># Trigger automated response (alert)</span>
status_code = send_alert_to_webhook(anomaly_detected)

<span class="hljs-keyword">if</span> status_code == <span class="hljs-number">200</span>:
    print(<span class="hljs-string">"Webhook triggered successfully"</span>)
<span class="hljs-keyword">else</span>:
    print(<span class="hljs-string">"Failed to trigger webhook"</span>)
</code></pre>
<h3 id="heading-6-automate-system-cleanup-with-ansible-sample-playbook"><strong>6. Automate system cleanup with Ansible (sample playbook)</strong></h3>
<p>Automatic remediation is a major component of AIOps in resolving issues without any human intervention. Like restarting a service when a system measure exceeds a particular threshold, here is an illustration of an Ansible script that automatically resolves an issue.</p>
<pre><code class="lang-yaml"><span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Automated</span> <span class="hljs-string">Remediation</span> <span class="hljs-string">for</span> <span class="hljs-string">High</span> <span class="hljs-string">CPU</span> <span class="hljs-string">Usage</span>
  <span class="hljs-attr">hosts:</span> <span class="hljs-string">all</span>
  <span class="hljs-attr">become:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">tasks:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Check</span> <span class="hljs-string">CPU</span> <span class="hljs-string">Usage</span>
      <span class="hljs-attr">shell:</span> <span class="hljs-string">"top -bn1 | grep load | awk '{printf \"%.2f\", $(NF-2)}'"</span>
      <span class="hljs-attr">register:</span> <span class="hljs-string">cpu_load</span>
      <span class="hljs-attr">changed_when:</span> <span class="hljs-literal">false</span>

    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Restart</span> <span class="hljs-string">service</span> <span class="hljs-string">if</span> <span class="hljs-string">CPU</span> <span class="hljs-string">load</span> <span class="hljs-string">is</span> <span class="hljs-string">high</span>
      <span class="hljs-attr">service:</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">"your-service-name"</span>
        <span class="hljs-attr">state:</span> <span class="hljs-string">restarted</span>
      <span class="hljs-attr">when:</span> <span class="hljs-string">cpu_load.stdout</span> <span class="hljs-string">|</span> <span class="hljs-string">float</span> <span class="hljs-string">&gt;</span> <span class="hljs-number">80.0</span>
</code></pre>
<h2 id="heading-real-world-use-case-aiops-in-cloud-infrastructure-and-incident-management"><strong>Real-World Use Case: AIOps in Cloud Infrastructure and Incident Management</strong></h2>
<p>Imagine a large-scale e-commerce company that operates in the cloud, hosting its infrastructure on AWS. The company’s platform is supported by hundreds of virtual machines (VMs), microservices, databases, and web servers.</p>
<p>As the company grows, so do the complexities of its IT operations, especially in managing system health, uptime, and performance. The company has a traditional monitoring setup in place using basic cloud-native tools. But as the platform scales, the sheer volume of data (logs, metrics, alerts) overwhelms the IT team, leading to delays in identifying the root cause of issues and resolving them in real time.</p>
<h3 id="heading-challenges"><strong>Challenges:</strong></h3>
<ul>
<li><p><strong>Incident overload</strong>: With hundreds of alerts coming in daily, the team struggled to prioritize critical incidents, which led to slower resolution times.</p>
</li>
<li><p><strong>Manual processes</strong>: Identifying the root cause of issues required manual sifting through logs, which was time-consuming and error-prone.</p>
</li>
<li><p><strong>Scalability issues</strong>: As the company scaled its infrastructure, manual intervention became increasingly inefficient, and the system could not dynamically respond to issues without human input.</p>
</li>
</ul>
<h3 id="heading-aiops-implementation"><strong>AIOps implementation</strong>:</h3>
<p>The company decided to implement an AIOps platform to streamline incident management, automate responses, and predict issues before they occurred.</p>
<h3 id="heading-step-1-setting-up-monitoring-with-prometheus"><strong>Step 1: Setting Up Monitoring with Prometheus</strong></h3>
<p>First, we need to monitor system performance to collect metrics such as CPU usage and memory consumption. We’ll use Prometheus, an open-source monitoring tool, to collect this data.</p>
<h4 id="heading-install-prometheus">Install Prometheus:</h4>
<p>First, download and install Prometheus:</p>
<pre><code class="lang-bash">wget https://github.com/prometheus/prometheus/releases/download/v2.27.1/prometheus-2.27.1.linux-amd64.tar.gz
tar -xvzf prometheus-2.27.1.linux-amd64.tar.gz
<span class="hljs-built_in">cd</span> prometheus-2.27.1.linux-amd64/
./prometheus
</code></pre>
<p>Then install Node Exporter (to collect system metrics):</p>
<pre><code class="lang-bash">wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar -xvzf node_exporter-1.1.2.linux-amd64.tar.gz
<span class="hljs-built_in">cd</span> node_exporter-1.1.2.linux-amd64/
./node_exporter
</code></pre>
<p>Next, configure Prometheus to scrape metrics from Node Exporter:</p>
<pre><code class="lang-yaml"><span class="hljs-comment">##Edit prometheus.yml to scrape metrics from the Node Exporter:</span>
<span class="hljs-attr">scrape_configs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">job_name:</span> <span class="hljs-string">'node'</span>
    <span class="hljs-attr">static_configs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">targets:</span> [<span class="hljs-string">'localhost:9100'</span>]
</code></pre>
<p>And start Prometheus:</p>
<pre><code class="lang-bash">./prometheus --config.file=prometheus.yml
</code></pre>
<p>You can now access Prometheus via <a target="_blank" href="http://localhost:9090">http://localhost:9090</a> to verify that it's collecting metrics.</p>
<h3 id="heading-step-2-collecting-system-data-cpu-usage"><strong>Step 2: Collecting System Data (CPU Usage)</strong></h3>
<p>Now that we have Prometheus collecting metrics, we need to extract CPU usage data (which will be the focus of our anomaly detection) from Prometheus.</p>
<h4 id="heading-querying-prometheus-api-for-cpu-usage">Querying Prometheus API for CPU Usage</h4>
<p>We’ll use Python to query Prometheus and retrieve CPU usage data (for example, using the node_cpu_seconds_total metric). We’ll fetch the data for the last 30 minutes.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta

<span class="hljs-comment"># Define the Prometheus URL and the query</span>
prom_url = <span class="hljs-string">"http://localhost:9090/api/v1/query_range"</span>
query = <span class="hljs-string">'rate(node_cpu_seconds_total{mode="user"}[1m])'</span>

<span class="hljs-comment"># Define the start and end times</span>
end_time = datetime.now()
start_time = end_time - timedelta(minutes=<span class="hljs-number">30</span>)

<span class="hljs-comment"># Make the request to Prometheus API</span>
response = requests.get(prom_url, params={
    <span class="hljs-string">'query'</span>: query,
    <span class="hljs-string">'start'</span>: start_time.timestamp(),
    <span class="hljs-string">'end'</span>: end_time.timestamp(),
    <span class="hljs-string">'step'</span>: <span class="hljs-number">60</span>
})

data = response.json()[<span class="hljs-string">'data'</span>][<span class="hljs-string">'result'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'values'</span>]
timestamps = [item[<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> data]
cpu_usage = [item[<span class="hljs-number">1</span>] <span class="hljs-keyword">for</span> item <span class="hljs-keyword">in</span> data]

<span class="hljs-comment"># Create a DataFrame for easier processing</span>
df = pd.DataFrame({
    <span class="hljs-string">'timestamp'</span>: pd.to_datetime(timestamps, unit=<span class="hljs-string">'s'</span>),
    <span class="hljs-string">'cpu_usage'</span>: cpu_usage
})

print(df.head())
</code></pre>
<h3 id="heading-step-3-anomaly-detection-with-machine-learning"><strong>Step 3: Anomaly Detection with Machine Learning</strong></h3>
<p>To detect anomalies in CPU usage, we’ll use Isolation Forest, a machine learning algorithm from Scikit-learn.</p>
<h4 id="heading-train-an-anomaly-detection-model">Train an Anomaly Detection Model:</h4>
<p>First, install Scikit-learn:</p>
<pre><code class="lang-bash">pip install scikit-learn matplotlib
</code></pre>
<p>Then you’ll need to train the model using the CPU usage data we collected:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> IsolationForest
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># Prepare the data for anomaly detection (CPU usage data)</span>
cpu_usage_data = df[<span class="hljs-string">'cpu_usage'</span>].values.reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)

<span class="hljs-comment"># Train the Isolation Forest model (anomaly detection)</span>
model = IsolationForest(contamination=<span class="hljs-number">0.05</span>)  <span class="hljs-comment"># 5% expected anomalies</span>
model.fit(cpu_usage_data)

<span class="hljs-comment"># Predict anomalies (1 = normal, -1 = anomaly)</span>
predictions = model.predict(cpu_usage_data)

<span class="hljs-comment"># Add predictions to the DataFrame</span>
df[<span class="hljs-string">'anomaly'</span>] = predictions

<span class="hljs-comment"># Visualize the anomalies</span>
plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">6</span>))
plt.plot(df[<span class="hljs-string">'timestamp'</span>], df[<span class="hljs-string">'cpu_usage'</span>], label=<span class="hljs-string">'CPU Usage'</span>)
plt.scatter(df[<span class="hljs-string">'timestamp'</span>][df[<span class="hljs-string">'anomaly'</span>] == <span class="hljs-number">-1</span>], df[<span class="hljs-string">'cpu_usage'</span>][df[<span class="hljs-string">'anomaly'</span>] == <span class="hljs-number">-1</span>], color=<span class="hljs-string">'red'</span>, label=<span class="hljs-string">'Anomaly'</span>)
plt.title(<span class="hljs-string">"CPU Usage with Anomalies"</span>)
plt.xlabel(<span class="hljs-string">"Time"</span>)
plt.ylabel(<span class="hljs-string">"CPU Usage (%)"</span>)
plt.legend()
plt.show()
</code></pre>
<h3 id="heading-step-4-automating-incident-response-with-aws-lambda"><strong>Step 4: Automating Incident Response with AWS Lambda</strong></h3>
<p>When an anomaly is detected (for example, high CPU usage), AIOps can automatically trigger a response, such as scaling up resources.</p>
<h4 id="heading-aws-lambda-for-automated-scaling">AWS Lambda for Automated Scaling</h4>
<p>Here’s an example of how to use AWS Lambda to scale up EC2 instances when CPU usage exceeds a threshold.</p>
<p>First, create your AWS Lambda function that scales EC2 instances when CPU usage exceeds 80%.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> boto3

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">lambda_handler</span>(<span class="hljs-params">event, context</span>):</span>
    ec2 = boto3.client(<span class="hljs-string">'ec2'</span>)

    <span class="hljs-comment"># If CPU usage exceeds threshold, scale up EC2 instance</span>
    <span class="hljs-keyword">if</span> event[<span class="hljs-string">'cpu_usage'</span>] &gt; <span class="hljs-number">0.8</span>:  <span class="hljs-comment"># 80% CPU usage</span>
        instance_id = <span class="hljs-string">'i-1234567890'</span>  <span class="hljs-comment"># Replace with your EC2 instance ID</span>
        ec2.modify_instance_attribute(InstanceId=instance_id, InstanceType={<span class="hljs-string">'Value'</span>: <span class="hljs-string">'t2.large'</span>})

    <span class="hljs-keyword">return</span> {
        <span class="hljs-string">'statusCode'</span>: <span class="hljs-number">200</span>,
        <span class="hljs-string">'body'</span>: <span class="hljs-string">f'Instance <span class="hljs-subst">{instance_id}</span> scaled up due to high CPU usage.'</span>
    }
</code></pre>
<p>Then you’ll need to trigger the Lambda function. Set up AWS CloudWatch Alarms to monitor the output from the anomaly detection and trigger the Lambda function when CPU usage exceeds the threshold.</p>
<h3 id="heading-step-5-proactive-resource-scaling-with-predictive-analytics"><strong>Step 5: Proactive Resource Scaling with Predictive Analytics</strong></h3>
<p>Finally, using predictive analytics, AIOps can forecast future resource usage and proactively scale resources before problems arise.</p>
<h4 id="heading-predictive-scaling">Predictive Scaling:</h4>
<p>We’ll use a linear regression model to predict future CPU usage and trigger scaling events proactively.</p>
<p>Start by training a predictive model:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LinearRegression
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Historical data (CPU usage trends)</span>
data = pd.DataFrame({
    <span class="hljs-string">'timestamp'</span>: pd.date_range(start=<span class="hljs-string">"2023-01-01"</span>, periods=<span class="hljs-number">100</span>, freq=<span class="hljs-string">'H'</span>),
    <span class="hljs-string">'cpu_usage'</span>: np.random.normal(<span class="hljs-number">50</span>, <span class="hljs-number">10</span>, <span class="hljs-number">100</span>)  <span class="hljs-comment"># Simulated data</span>
})

X = np.array(range(len(data))).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)  <span class="hljs-comment"># Time steps</span>
y = data[<span class="hljs-string">'cpu_usage'</span>]

model = LinearRegression()
model.fit(X, y)

<span class="hljs-comment"># Predict next 10 hours</span>
future_prediction = model.predict([[len(data) + <span class="hljs-number">10</span>]])
print(<span class="hljs-string">"Predicted CPU usage:"</span>, future_prediction)
</code></pre>
<p>If the predicted CPU usage exceeds a threshold, AIOps can trigger auto-scaling using AWS Lambda or Kubernetes.</p>
<h4 id="heading-results">Results:</h4>
<ul>
<li><p><strong>Reduced incident resolution time</strong>: The average time to resolve incidents dropped from hours to minutes because AIOps helped the team identify issues faster.</p>
</li>
<li><p><strong>Reduced false positives</strong>: By using anomaly detection, the system significantly reduced the number of false alerts.</p>
</li>
<li><p><strong>Increased automation</strong>: With automated responses in place, the system dynamically adjusted resources in real time, reducing the need for manual intervention.</p>
</li>
<li><p><strong>Proactive issue management</strong>: Predictive analytics enabled the team to address potential problems before they became critical, preventing performance degradation.</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>AIOps transforms IT operations, enabling companies to build more efficient, responsive, and superior systems. By automating routine tasks, identifying issues before they worsen, and providing real-time data, AIOps is altering the function of IT teams.</p>
<p>AIOps is the most effective tool for increasing system speed, reducing downtime, and streamlining your IT procedures. You can begin modestly, and gradually include more functionality. Then you’ll start to see how AIOps opens your IT environment to fresh ideas and increases its efficiency.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
