Machine Learning - freeCodeCamp.org

Product Experimentation with Synthetic Control: Causal Inference for Global LLM Rollouts in Python

Rudrendu Paul — Tue, 12 May 2026 04:55:04 +0000

Every product experimentation team doing causal inference on LLM-based features eventually hits the same wall: when the provider ships a new model version, there's no holdout.

Your infrastructure team upgrades every workspace from Claude 4.5 to Claude 4.6 overnight. All 50 production workspaces get the new model at the same time. A week later, task completion climbs across the board. The head of product calls it a win.

But you know something's off. No holdout group ran 4.5 through the upgrade week. The naïve before/after picks up whatever else changed that week alongside the model: a new onboarding flow, a seasonal uptick, a high-profile customer onboarding.

This is the Global Rollout Problem. It appears whenever a team ships a model upgrade to the entire user base simultaneously. For product teams running generative AI features, it's one of the most common measurement traps in the stack. Staged rollouts buy you a control group, global rollouts eliminate it.

In 2026, global model upgrades are the norm: every API provider pushes new versions, and every team using Claude, GPT, or Gemini has experienced the sudden jump from one version to the next with no opt-out.

Synthetic control is the tool that data scientists use when the control group is missing. You build a weighted combination of untreated units (other workspaces or regions that weren't upgraded at the same time) whose pre-upgrade behavior matches that of the treated unit. Compare the treated unit to its synthetic twin after the upgrade, and the gap is the causal estimate, conditional on three identification assumptions that we'll name explicitly.

In this tutorial, you'll build a synthetic control from scratch in Python using scipy.optimize, apply it to a 50,000-user synthetic SaaS dataset, and validate with a placebo permutation test, leave-one-out donor sensitivity, and a cluster bootstrap 95% confidence interval.

Companion code: every code block runs end-to-end in the companion notebook at github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control. The notebook (synthetic_control_demo.ipynb) has all outputs pre-executed, so you can read along on GitHub before running anything locally.

Why Global Rollouts Break Naïve Measurement
What Synthetic Control Actually Does
Prerequisites
Setting Up the Working Example
Step 1: Fit Donor Weights with SLSQP
Step 2: Plot Treated vs Synthetic Control Trajectories
Step 3: In-Space Placebo Permutation Test
Step 4: Leave-One-Out Donor Sensitivity
Step 5: Cluster Bootstrap 95% Confidence Intervals
When Synthetic Control Fails
What to Do Next

Why Global Rollouts Break Naïve Measurement

The math of an A/B test is elegant because of one assumption: treatment assignment is independent of everything else. Flip a coin: half your workspaces get Claude 4.6, and half stay on 4.5. The coin flip breaks every possible confound. The global rollout world has no coin.

Three mechanisms make the naive before/after misleading.

Co-occurring product changes: Shipping a model upgrade rarely happens in isolation. The same week, the onboarding team ships a redesigned tutorial, the pricing team runs a promotion, or customer success reaches out to enterprise accounts about the new capabilities. Your before/after picks up the sum.
Seasonal and market drift: Weekly usage patterns, monthly billing cycles, and quarterly procurement cycles all move outcome metrics. A 3 pp lift in week 20 looks like the model upgrade, but in fact, users returned from spring break.
Peer-company dynamics: A competitor releases a buggy update, and your users migrate over for a week. Your task completion rate spikes because the new users had easier queries, with zero contribution from the model itself.

All three produce the same symptom: a raw before/after that folds the upgrade's causal effect together with the causal effect of every other week-20 event.

In this tutorial's dataset, the naïve gap is +0.0515, nearly equal to the ground-truth +0.05. That coincidence is the scariest failure mode: the naive number sometimes lands correctly by accident, and without a counterfactual, you can't tell luck from truth.

What Synthetic Control Actually Does

Figure 1 (above): Schematic of the synthetic control construction. The gray curves are donor workspaces that remain on the old model. The dashed navy curve is the weighted combination of donors that best tracks the treated unit (red) during the pre-treatment window marked by the blue bracket below the x-axis.

After the treatment date (week 20, dotted vertical line), the weights stay frozen, and the dashed curve projects forward as the counterfactual, while the treated unit moves upward. The gap between the two curves in the post-treatment window is the causal-effect estimate.

The key design choice the figure illustrates is that weights are fit once, using only pre-treatment data, and never refit using post-treatment data.

Synthetic control finds a weighted combination of untreated units whose outcome trajectory closely matches the treated unit's in the pre-treatment period. Once the weights are fixed, you project the synthetic unit's trajectory forward into the post-treatment period and read off the gap between the two lines.

In your AI product context: if wave-2 workspaces didn't get the model upgrade at the same time as wave-1 workspaces, each wave-2 workspace is a candidate donor. The optimizer finds the combination of wave-2 workspaces whose weighted pre-upgrade trajectory best matches wave 1's. After week 20 (when wave 1 was upgraded), the gap between wave 1 and its synthetic twin is the causal-effect estimate, provided that the following three identification assumptions hold.

These identification assumptions work together.

First, pre-period fit (the convex-hull condition): the treated unit's pre-treatment trajectory must lie inside the convex hull of the donor trajectories, which is what the non-negativity and sum-to-1 constraints enforce.
Second, no interference for donors (SUTVA for the donor pool): the treatment on the treated unit must not affect the donors. Shared API rate-limit pools or users migrating between workspaces both break this.
Third, stable donor composition: the donors must not experience structural breaks unrelated to the treatment during the post-period. Violate any one, and the gap is biased even when the pre-period fit looks perfect. The failure modes section walks through each.

One geometric note: with T₀ pre-treatment periods and J donors, pre-period overfitting becomes serious when J approaches T₀. This tutorial runs with T₀ = 20 and J = 25, which sits in the danger zone. The LOO sensitivity step later is the right diagnostic for whether the fit reflects genuine comparability or overfitting.

Prerequisites

You'll need Python 3.11 or newer, comfort with pandas and numpy, and familiarity with basic constrained optimization.

Install the packages for this tutorial:

pip install numpy pandas scipy matplotlib

Here's what's happening: four packages cover the full pipeline. Pandas loads the user-level log, NumPy handles panel arithmetic, SciPy provides the SLSQP solver to enforce the convex-combination constraint on the donor weights, and matplotlib renders the trajectory plot and the placebo distribution.

Clone the companion repo to get the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here's what's happening: the clone pulls the companion repo, and generate_data.py produces the shared synthetic dataset used across the series. Seed 42 keeps the dataset reproducible, and 50,000 users give a clean signal for the estimator in this tutorial. The output CSV lands at data/synthetic_llm_logs.csv.

Setting Up the Working Example

The synthetic dataset simulates a SaaS product with 50,000 users spread across 50 workspaces. Workspaces 0 through 24 are in wave 1, which received the model upgrade at week 20. Workspaces 25 through 49 are in wave 2, which stayed on the old model through week 29.

The ground-truth causal effect baked into the data generator is a +5 percentage-point increase in task completion for wave-1 users in the post-treatment period. You know the truth, so you can check what the synthetic control recovers.

Load the data and aggregate to a workspace-by-week panel:

import numpy as np
import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")

PRE = 20         # weeks 0-19 are pre-treatment
WINDOW = 30      # analysis window weeks 0-29

df_window = df[df.signup_week < WINDOW].copy()

panel = (
    df_window.groupby(["workspace_id", "signup_week"])
    ["task_completed"].mean().reset_index()
)
panel.columns = ["workspace_id", "week", "task_completed"]

pivot = panel.pivot(
    index="week", columns="workspace_id", values="task_completed"
)
pivot = pivot.interpolate(method="linear", axis=0).ffill().bfill()

ws_wave = df.groupby("workspace_id").wave.first()
wave1_ws = sorted(ws_wave[ws_wave == 1].index.tolist())
wave2_ws = sorted(ws_wave[ws_wave == 2].index.tolist())

treated_series = pivot[wave1_ws].mean(axis=1).values
donor_matrix = pivot[wave2_ws].values

print(f"Treated series shape: {treated_series.shape}")
print(f"Donor matrix shape:   {donor_matrix.shape}")
print(f"Users per workspace-week: ~{len(df_window) / (50 * WINDOW):.1f}")
print(f"Pre-period treated mean  (weeks 0-19):  {treated_series[:PRE].mean():.4f}")
print(f"Post-period treated mean (weeks 20-29): {treated_series[PRE:].mean():.4f}")

Expected output:

Treated series shape: (30,)
Donor matrix shape:   (30, 25)
Users per workspace-week: ~19.2
Pre-period treated mean  (weeks 0-19):  0.5927
Post-period treated mean (weeks 20-29): 0.6421

Here's what's happening: you restrict to the 30-week window, aggregate user rows to a workspace-by-week panel, and reshape so rows are weeks and columns are workspaces. Interpolation fills any missing cells (each cell averages about 19 users). The treated series is the mean across all 25 wave-1 workspaces, pooling roughly 480 users per week to smooth cell-level noise.

The donor matrix keeps each wave-2 workspace as a separate column: 25 time series, each covering weeks 0 through 29. The pre-period treated mean of 0.5927 and the post-period mean of 0.6421 yield a raw before/after gap of +5.15 pp, which coincidentally sits near the ground-truth +5 pp and is contaminated by everything else that moved in weeks 20 through 29.

Figure 2: The diagnostic on the real 50,000-user dataset. Top panel: wave 1's trajectory in red and the fitted synthetic control in navy dashed, with pre-period RMSE of 3.74 pp and a post-treatment gap averaging +8.29 pp. Bottom panel: the placebo distribution built by re-fitting the synthetic control with each of the 25 donor workspaces standing in as the placebo treated unit. The observed gap lies outside the full placebo range, which drives the pseudo p-value in Step 3.

Where Figure 1 schematically showed the method, this figure shows that it produces a pre-period fit tight enough to make the post-period gap interpretable and a placebo distribution that discriminates the observed effect from noise.

Step 1: Fit Donor Weights with SLSQP

The synthetic control weight vector w is the solution to a constrained optimization problem: minimize the pre-period mean squared error between the treated series and the weighted combination of donor series, subject to each weight being in [0, 1] and all weights summing to 1. The non-negativity and sum-to-1 constraints together define a convex combination, which is what prevents extrapolation beyond the support of the donor pool.

from scipy.optimize import minimize

n_donors = len(wave2_ws)
Y_pre = treated_series[:PRE]
D_pre = donor_matrix[:PRE, :]

def objective(w):
    return np.mean((Y_pre - D_pre @ w) ** 2)

w0 = np.ones(n_donors) / n_donors
bounds = [(0, 1)] * n_donors
constraints = [{"type": "eq", "fun": lambda w: w.sum() - 1}]

result = minimize(
    objective, w0, method="SLSQP", bounds=bounds,
    constraints=constraints,
    options={"ftol": 1e-12, "maxiter": 5000},
)
w_opt = result.x

pre_mse = float(np.mean((Y_pre - D_pre @ w_opt) ** 2))
pre_rmse = float(np.sqrt(pre_mse))
nz = int((w_opt > 0.001).sum())

print(f"Optimization converged: {result.success}")
print(f"Non-zero donor weights (|w| > 0.001): {nz}")
print(f"Pre-period MSE:  {pre_mse:.6f}")
print(f"Pre-period RMSE: {pre_rmse:.4f}  "
      f"({pre_rmse * 100:.2f} percentage points)")

synth_full = donor_matrix @ w_opt
gap = float((treated_series[PRE:] - synth_full[PRE:]).mean())
print(f"\nObserved post-period gap: {gap:+.4f}  (ground truth = +0.0500)")

nz_pairs = sorted(
    [(ws, w_opt[i]) for i, ws in enumerate(wave2_ws) if w_opt[i] > 0.001],
    key=lambda x: -x[1]
)
print("\nTop 5 donor weights:")
for ws_id, weight in nz_pairs[:5]:
    print(f"  workspace {ws_id}: w = {weight:.4f}")

Expected output:

Optimization converged: True
Non-zero donor weights (|w| > 0.001): 12
Pre-period MSE:  0.001400
Pre-period RMSE: 0.0374  (3.74 percentage points)

Observed post-period gap: +0.0829  (ground truth = +0.0500)

Top 5 donor weights:
  workspace 35: w = 0.2016
  workspace 40: w = 0.1900
  workspace 25: w = 0.1638
  workspace 32: w = 0.0872
  workspace 36: w = 0.0784

Here's what's happening: the objective function computes the mean squared error between the treated pre-period series and the dot product of the donor matrix with the weight vector.

SLSQP handles the non-negativity bounds and the sum-to-1 equality constraint simultaneously. The w > 0.001 threshold classifies 12 donors as non-zero. SLSQP doesn't guarantee exact zeros at inactive constraints, so the threshold is a display convention. Pre-period RMSE of 3.74 pp measures how closely the weighted donors tracked the treated unit before the upgrade. The observed post-period gap of +0.0829 is the headline estimate, which overshoots the ground-truth +5 pp, as Step 5 quantifies with a confidence interval.

The weights are fixed at the end of the pre-period and never re-estimated using post-treatment data. Any divergence after week 20 reflects movement the optimizer had no opportunity to fit.

Step 2: Plot Treated vs Synthetic Control Trajectories

The primary visual diagnostic for synthetic control is the trajectory overlay: plot both series together, mark the treatment date, and confirm that the synthetic control tracks the treated unit in the pre-period and that a gap opens in the post-period.

A tight pre-period fit is the visible signal that the identification condition holds. A ragged fit means the treated unit is outside the convex hull of the donors, and the whole exercise is suspect.

import matplotlib.pyplot as plt

weeks = np.arange(WINDOW)

fig, ax = plt.subplots(figsize=(9, 4.5))
ax.plot(weeks, treated_series, marker="o", linewidth=1.8,
        color="#C44E52", label="Wave 1 (treated)")
ax.plot(weeks, synth_full, marker="s", linestyle="--",
        linewidth=1.8, color="#4C72B0", label="Synthetic control")
ax.axvline(PRE, color="#555555", linestyle=":", linewidth=1.4,
           label="Model upgrade (week 20)")
ax.set_xlabel("Signup week")
ax.set_ylabel("Mean task completion rate")
ax.set_title("Treated unit vs synthetic control")
ax.legend(frameon=False)
plt.tight_layout()
plt.show()

post_gap = treated_series[PRE:] - synth_full[PRE:]
print("Post-period weekly gaps (treated minus synthetic):")
for wk, g in zip(range(PRE, WINDOW), post_gap):
    print(f"  week {wk}: {g:+.4f}")
print(f"\nMean gap: {post_gap.mean():+.4f}")

Expected output:

Post-period weekly gaps (treated minus synthetic):
  week 20: +0.0398
  week 21: +0.1663
  week 22: +0.1019
  week 23: +0.1535
  week 24: +0.1071
  week 25: +0.1047
  week 26: +0.0424
  week 27: +0.0326
  week 28: +0.0327
  week 29: +0.0479

Mean gap: +0.0829

Here's what's happening: the two lines track each other in the pre-period, confirming the fit assumption. After week 20, the treated series moves above the synthetic control, and the weekly gaps are all positive with a mean of +8.29 pp.

The spread across weeks (from +3.26 pp to +16.63 pp) is how much week-to-week noise the estimator absorbs. A single bad week could swing the mean by a percentage point, which is why the placebo and LOO steps that follow matter more than any single point estimate.

Step 3: In-Space Placebo Permutation Test

You can't run a standard t-test on a single treated unit. The synthetic control has one treated observation (wave 1) and 25 donor observations, which is not a setup for which any conventional p-value applies.

The standard validation is the in-space placebo permutation test. Treat each donor in turn as if it were the "treated" unit, re-fit the synthetic control using the remaining 24 donors as its placebo pool, record the placebo post-period gap, and compare the observed gap to the distribution of placebos.

placebo_gaps = []

for j in range(n_donors):
    placebo_treated = donor_matrix[:, j]
    placebo_pool = np.delete(donor_matrix, j, axis=1)
    n_p = placebo_pool.shape[1]

    def obj_p(w):
        return np.mean((placebo_treated[:PRE] - placebo_pool[:PRE] @ w) ** 2)

    res_p = minimize(
        obj_p, np.ones(n_p) / n_p, method="SLSQP",
        bounds=[(0, 1)] * n_p,
        constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1}],
        options={"ftol": 1e-12, "maxiter": 5000},
    )
    synth_p = placebo_pool @ res_p.x
    placebo_gaps.append((placebo_treated[PRE:] - synth_p[PRE:]).mean())

placebo_gaps = np.array(placebo_gaps)
observed_gap = gap

rank = int((np.abs(placebo_gaps) >= abs(observed_gap)).sum())
pseudo_p = (rank + 1) / (len(placebo_gaps) + 1)

print(f"Observed gap:      {observed_gap:+.4f}")
print(f"Placebo mean gap:  {placebo_gaps.mean():+.4f}")
print(f"Placebo std gap:   {placebo_gaps.std():.4f}")
print(f"Placebo gap range: [{placebo_gaps.min():+.4f}, "
      f"{placebo_gaps.max():+.4f}]")
print(f"|placebo| >= |observed|: {rank} of {len(placebo_gaps)}")
print(f"Pseudo p-value: {pseudo_p:.4f}")

Expected output:

Observed gap:      +0.0829
Placebo mean gap:  -0.0008
Placebo std gap:   0.0380
Placebo gap range: [-0.0748, +0.0707]
|placebo| >= |observed|: 0 of 25
Pseudo p-value: 0.0385

Here's what's happening: the loop iterates over all 25 wave-2 workspaces. For each one, you remove it from the donor pool, treat it as a placebo-treated unit, and re-run the SLSQP optimization. After 25 placebo runs, you count how many placebo gaps meet or exceed the observed gap in absolute value and apply the conservative (count + 1) / (N + 1) correction.

None of the 25 placebos produced a gap as extreme as the observed +0.0829, yielding a pseudo-p-value of 0.0385. That rejects the null of no effect at the 5% level. The placebo distribution centers near zero (mean -0.0008, std 3.80 pp), which is the noise floor to compare the observed gap against.

The correct statistical statement is: the observed gap is more extreme than any placebo drawn from untreated donors at the 5% level. The permutation test's power depends on the donor pool size: with 25 donors, the smallest possible pseudo-p is 1/26 = 0.0385, so you can't get a smaller p-value with this donor count. A wider placebo distribution or a smaller observed gap would rank the observation inside the placebo bulk and push the pseudo p above any useful threshold.

Step 4: Leave-One-Out Donor Sensitivity

A tight point estimate can still be fragile if it hangs on a single donor. The leave-one-out (LOO) sensitivity check drops each non-zero-weight donor in turn, refits the synthetic control on the remaining donors, and records the new gap.

Abadie (2021) recommends this as the first-line robustness check. If removing any single donor swings the gap by a large amount, you don't have a synthetic control – you have a single-donor comparison dressed up with extra weight.

def fit_and_gap(treated, donors, pre=PRE):
    n = donors.shape[1]
    def obj(w):
        return np.mean((treated[:pre] - donors[:pre] @ w) ** 2)
    res = minimize(
        obj, np.ones(n) / n, method="SLSQP",
        bounds=[(0, 1)] * n,
        constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1}],
        options={"ftol": 1e-12, "maxiter": 5000},
    )
    synth = donors @ res.x
    return float((treated[pre:] - synth[pre:]).mean())


nz_idx = np.where(w_opt > 0.001)[0]
loo_rows = []
for j in nz_idx:
    kept = np.delete(donor_matrix, j, axis=1)
    gap_new = fit_and_gap(treated_series, kept)
    loo_rows.append({
        "dropped_workspace": int(wave2_ws[j]),
        "dropped_weight": float(w_opt[j]),
        "new_gap": gap_new,
    })
loo_df = pd.DataFrame(loo_rows).sort_values("dropped_weight", ascending=False)
print(loo_df.round(4).to_string(index=False))
print(f"\nLOO gap range: [{loo_df.new_gap.min():+.4f}, "
      f"{loo_df.new_gap.max():+.4f}]")
print(f"Original gap:  {gap:+.4f}")

Expected output:

 dropped_workspace  dropped_weight  new_gap
                35          0.2016   0.0945
                40          0.1900   0.0756
                25          0.1638   0.0932
                32          0.0872   0.0868
                36          0.0784   0.0739
                31          0.0718   0.0858
                29          0.0648   0.0782
                26          0.0439   0.0786
                27          0.0364   0.0867
                46          0.0350   0.0794
                39          0.0192   0.0848
                42          0.0078   0.0839

LOO gap range: [+0.0739, +0.0945]
Original gap:  +0.0829

Here's what's happening: the loop drops one non-zero-weight donor at a time and refits. All 12 LOO estimates stay positive, with the range [+7.39 pp, +9.45 pp] straddling the original +8.29 pp by about a percentage point in either direction.

No single donor drives the result. Even dropping workspace 35 (the largest weight at 0.2016) only shifts the gap to +9.45 pp because the optimizer redistributes weight across remaining donors.

That redistribution is the point of convex-combination weighting: many near-equivalent donor mixtures produce similar counterfactuals.

Step 5: Cluster Bootstrap 95% Confidence Intervals

Point estimates are only half the story. A stakeholder asking "how sure are you" wants an interval. The classical non-parametric bootstrap doesn't apply cleanly to synthetic control on a single treated unit, because resampling the one treated time series with replacement destroys the time-ordering that the estimator depends on.

A valid substitute is the user-level cluster bootstrap: resample users with replacement, rebuild the workspace-by-week panel from the resampled user log, re-fit the donor weights on the pre-period, and record the post-period gap.

Repeat 500 times. The 2.5th and 97.5th percentiles of the resulting distribution are the 95% CI.

def build_panel(df_inner):
    dfw = df_inner[df_inner.signup_week < WINDOW].copy()
    panel = (dfw.groupby(["workspace_id", "signup_week"])
             ["task_completed"].mean().reset_index())
    panel.columns = ["workspace_id", "week", "task_completed"]
    piv = panel.pivot(index="week", columns="workspace_id",
                      values="task_completed")
    piv = piv.interpolate(method="linear", axis=0).ffill().bfill()
    ws_wave_b = df_inner.groupby("workspace_id").wave.first()
    w1 = sorted(ws_wave_b[ws_wave_b == 1].index.tolist())
    w2 = sorted(ws_wave_b[ws_wave_b == 2].index.tolist())
    return piv[w1].mean(axis=1).values, piv[w2].values


rng = np.random.default_rng(7)
n = len(df)
n_reps = 500
gaps_boot = np.empty(n_reps)
for i in range(n_reps):
    sample = df.iloc[rng.integers(0, n, size=n)]
    t_b, d_b = build_panel(sample)
    gaps_boot[i] = fit_and_gap(t_b, d_b)

lo = float(np.percentile(gaps_boot, 2.5))
hi = float(np.percentile(gaps_boot, 97.5))
print(f"Post-period gap 95% CI: [{lo:+.4f}, {hi:+.4f}]")
print(f"Observed point estimate: {gap:+.4f}")
print(f"Ground truth +0.0500 inside CI: "
      f"{'YES' if lo <= 0.05 <= hi else 'NO'}")
print(f"Zero inside CI: {'YES' if lo <= 0 <= hi else 'NO'}")

Expected output:

Post-period gap 95% CI: [+0.0511, +0.1215]
Observed point estimate: +0.0829
Ground truth +0.0500 inside CI: NO
Zero inside CI: NO

Here's what's happening: you resample the user log 500 times, rebuild the panel from each resample, re-fit the weights on the pre-period, and take the 2.5th and 97.5th percentiles of the 500 resulting gaps. The 95% CI is [+5.11 pp, +12.15 pp]. It excludes zero with room to spare, so the effect is statistically meaningful.

The lower bound sits just above the +5 pp ground truth: a finite-sample upward bias typical of synthetic control on small donor panels, where each donor workspace (about 19 users per week) carries more noise than the 25-workspace treated average.

Placebo, LOO, and bootstrap together confirm a real positive effect. The point-estimate bias is the tradeoff for using single-workspace donors.

For a stakeholder report, cite the interval alongside the point estimate and note the bias direction so the team reads the number with the right calibration.

When Synthetic Control Fails

Synthetic control is a precise tool with narrow failure modes. The four most common map directly to the three identification assumptions.

1. Donor Pool Contamination (Violates No Interference)

If the upgrade shipped to wave 1 spills over to wave 2 (shared API rate-limit pools, shared prompt caches, users migrating between workspaces), the donors are contaminated, and the gap understates the true effect.

The defense is institutional: audit what changed for donor units around the treatment date, explicitly including model-level channels like shared routing, shared caching, and shared monitoring.

2. Fundamentally Different Units (Violates Pre-period Fit)

The convex-hull condition states that the treated unit must lie within the donors' support. If the treated unit is structurally different (for example, enterprise customers where every donor is an SMB), no weighting scheme yields a credible counterfactual, regardless of how tight the pre-period fit appears.

Check the weights: if the optimizer assigns 80 percent to a single donor, that donor is doing the entire job, and you should ask whether it's truly comparable.

3. Post-Treatment Shocks to Donors (Violate Stable Donor Composition)

The synthetic control projects donor behavior forward from pre-period weights. If a key donor experiences a major shock after treatment (a customer churn, an outage, a competitor release), its post-treatment trajectory is no longer a clean counterfactual. Inspect the time series of high-weight donors for unusual post-treatment patterns.

4. Overfitting Risk When J Approaches T₀ (Degrades Pre-period Fit in Practice)

The optimizer can fit the pre-period solely to noise when J ≥ T₀, creating the illusion of comparability. This tutorial runs at T₀/J = 20/25 = 0.8, in the danger zone. The LOO sensitivity check is the practical defense: if the gap holds up across donor drops, the fit reflects genuine comparability.

These failure modes stay invisible in your point estimate. They surface as a synthetic control that looks well-fit on paper and produces a gap that doesn't hold up when treatment rolls out to the next wave. Placebo test, LOO sensitivity, and bootstrap together are your defense.

What to Do Next

Synthetic control is the right tool when your feature ships globally and there's a pool of untreated units resembling the treated unit.

If treated and donor units operate at different scales, augmented synthetic control adds a bias-correction term from a linear outcome model. If you have many treated units with staggered adoption, generalized synthetic control (the gsynth R package) extends the framework.

For production Python work, pysyncon implements the full Abadie-Diamond-Hainmueller estimator with predictor-weighting via a V-matrix outer loop and adds in-time placebo tests (assigning the treatment to a pre-period date and checking for a spurious gap) that this tutorial doesn't cover. The from-scratch implementation here shows that the mechanics pysyncon is what you ship to a reviewer.

The companion notebook for this tutorial lives at github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control. Clone the repo, generate the synthetic dataset, and run synthetic_control_demo.ipynb (or synthetic_control_demo.py) to reproduce every code block, every number, and every figure from this tutorial.

When a model upgrade ships to every user at once, the naive before/after is usually the wrong number. Synthetic control builds "users like yours who didn't get the upgrade" from the data you already have, locks in the weights before the treatment week, and gives you a placebo distribution plus a bootstrap interval you can defend when a stakeholder asks how confident you are.

Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python

Rudrendu Paul — Fri, 08 May 2026 15:33:41 +0000

Causal inference for LLM-based features starts with one question editors ask before they ship anything: Did the change actually move the metric, or did the metric just move?

Let's say that your team built a routing layer that splits incoming queries between two models: queries with a confidence score below 0.85 go to a premium model, and those above 0.85 go to a cheaper distilled model. The premium model costs 5x as much as the cheaper one.

Your boss wants the answer that ends the debate: Is the premium model worth it for the queries it sees?

You can't run a clean A/B test, because routing is deterministic: a query at confidence 0.84 always gets premium, a query at 0.86 always gets cheap, and you can't randomize the assignment.

You also can't trust a naïve comparison of premium-routed users against cheap-routed users. Premium handles the harder queries by design (that's the reason you built the gate), so the two groups differ in query difficulty before either model touches them.

The threshold itself is your free experiment. Right at 0.85, the assignment flips, but the queries on either side of that boundary are essentially identical. A query at confidence 0.849 isn't meaningfully different from a query at 0.851. Any differences in outcomes between the two narrow groups stem solely from the routing decision. That's what regression discontinuity design (RDD) reads.

In this tutorial, you'll use Python to estimate the causal effect of premium routing on task completion using sharp RDD with local linear regression. You'll sweep bandwidths to test estimate stability, run a manipulation diagnostic, check robustness with a quadratic specification, and bootstrap 95% confidence intervals around every point estimate.

The LLM telemetry is a 50,000-user synthetic dataset with the ground-truth premium-routing effect baked in at +6 percentage points, so you can verify that RDD recovers it.

Companion code: every code block runs end-to-end in the companion notebook.

Why Threshold Routing is a Natural Experiment
What Regression Discontinuity Actually Does
Prerequisites
Setting Up the Working Example
Step 1: A Sharp RDD with Local Linear Regression
Step 2: Try Different Bandwidths
Step 3: Checking for Manipulation at the Threshold
Step 4: Quadratic Specification as a Robustness Check
Step 5: Bootstrap Confidence Intervals
When Regression Discontinuity Fails
What to Do Next

Why Threshold Routing is a Natural Experiment

The product reason this routing rule exists is to help your team spend the premium model budget where it earns its keep. Low-confidence queries are the harder ones, which is where a stronger model has the most upside. High-confidence queries already look easy enough for the cheap model to handle.

You'll see this routing direction across confidence-score gates for Q&A assistants, query-complexity gates in multi-model gateways like OpenRouter, safety-score gates in content moderation, and latency-budget gates that re-route when the cheap model would exceed a p99 latency budget.

The mechanism is the same in every case: a continuous score, a threshold, and a deterministic routing rule.

What makes this setup useful for causal inference is that users don't pick which model they get. A query lands, the system computes confidence, and the routing layer decides. Right at the threshold, the user's experience flips from premium to cheap based on a difference too small to be meaningful.

Again, a query at 0.849 confidence isn't shipping a different problem to the model than a query at 0.851. Anything that differs in outcomes between those two groups is the routing decision speaking. The underlying query is the same.

That local randomness is the experiment RDD reads from. You don't need a randomized control group, you don't need a propensity score. And you don't need an instrument, you need a sharp threshold that nobody can game.

What Regression Discontinuity Actually Does

The jump at the threshold is the causal effect, which is the number a product team can act on. RDD reads it by fitting two separate regression lines to the outcome: one for users just below the threshold and one for users just above. The vertical difference between those two fitted lines at the cutoff is the local average treatment effect at that point.

Graphically, picture task completion on the y-axis and query confidence on the x-axis. Completion generally trends with confidence (easier queries complete more often). At exactly 0.85, though, users below the cutoff get premium routing, and users above get cheap.

If premium routing helps, you'd see a sharp upward jump in task completion just below 0.85, then disappear just above. Approached from left to right with confidence rising, the visual reads as a downward step at 0.85, because you're moving from the premium-treated zone into the cheap-treated zone.

Figure 1. Conceptual schematic. Two outcome trajectories, one for premium-routed queries (confidence below 0.85) and one for cheap-routed queries (confidence above 0.85), meet at the threshold but don't match. The vertical gap between their endpoints at 0.85 is the local causal effect of premium routing.

That gap is identified under two named assumptions:

No manipulation of the running variable: Users (or your system) can't precisely nudge a query's confidence score across the cutoff. If anyone can game their score to land just below 0.85 and grab premium routing, the cutoff is no longer drawn at random, and RDD breaks.
Continuity of potential outcomes at the cutoff: Every other factor that affects task completion (query type, user expertise, workspace tenure, time of day) varies smoothly across 0.85. Only the routing assignment changes discontinuously at exactly the threshold. If a second product rule fires at 0.85 (a different logging level, a separate UI treatment, a retry policy), RDD will attribute that rule's effect to the routing decision.

These are the two assumptions you check before you trust the estimate. Step 3 below tests the first one. The second is a structural property of your system that you have to know cold.

Two practical choices shape every RDD: the bandwidth (how close to the cutoff to restrict the analysis) and the functional form (linear, quadratic, or local polynomial).

Narrow bandwidths cut potential bias by staying close to the local-randomization zone, but they shrink the sample. Linear specifications are stable, though they assume the underlying relationship can be approximated by a straight line on each side.

You'll try both linear and quadratic specifications at multiple bandwidths to see whether the answer holds.

The article uses sharp RDD throughout, since assignment is a deterministic function of confidence (below 0.85 always premium, above 0.85 always cheap). When the threshold is probabilistic and compliance is partial, the design is a fuzzy RDD, which requires an instrumental variables framework that you can implement using the rdrobust Python package.

Prerequisites

You need Python 3.11 or newer, comfort with pandas and statsmodels, and rough familiarity with linear regression and interaction terms.

Install the packages used in this tutorial:

pip install numpy pandas statsmodels matplotlib scipy

Here's what's happening: four standard scientific Python libraries plus matplotlib for the diagnostic visualization. Nothing exotic.

Clone the companion repo and generate the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here's what's happening: the data generator draws 50,000 users with a query_confidence score from a Beta(5,2) distribution, applies the routing rule (routed_to_premium = query_confidence < 0.85), and bakes a +6-percentage-point premium routing effect into task_completed. Same seed, same dataset, every time.

Setting Up the Working Example

The dataset simulates a SaaS product that routes queries between a premium and a cheap model based on confidence score. The threshold is 0.85, and the ground-truth causal effect of premium routing is +6 percentage points on task completion. You know the truth, so you can check whether RDD recovers it.

Load the data and look at the routing breakdown:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(f"Loaded {len(df):,} rows, {df.shape[1]} columns")

print("\nRouting breakdown:")
counts = df.routed_to_premium.value_counts().to_dict()
print(f"  Premium-routed (confidence < 0.85):  {counts.get(1, 0):,}")
print(f"  Cheap-routed   (confidence >= 0.85): {counts.get(0, 0):,}")

print("\nQuery confidence distribution:")
print(df.query_confidence.describe().round(3))

Expected output:

Loaded 50,000 rows, 16 columns

Routing breakdown:
  Premium-routed (confidence < 0.85):  38,874
  Cheap-routed   (confidence >= 0.85): 11,126

Query confidence distribution:
count    50000.000
mean         0.715
std          0.159
min          0.078
25%          0.611
50%          0.736
75%          0.838
max          0.998

Here's what's happening: about 78% of queries land below the 0.85 cutoff and get premium routing. The Beta(5,2) distribution is skewed toward the upper end, with a median of 0.736, and most of its mass still sits below 0.85. The remaining 22% are queries that the model already feels confident about, and they go to the cheap model.

Before any regression, look at the naïve comparison every product team is tempted to run:

naive = (
    df[df.routed_to_premium == 1].task_completed.mean()
    - df[df.routed_to_premium == 0].task_completed.mean()
)
print(f"Naive premium-vs-cheap effect: {naive:+.4f}  (ground truth = +0.06)")

Expected output:

Naive premium-vs-cheap effect: +0.0632  (ground truth = +0.06)

Here's what's happening: the naive estimate sits at +0.0632, which is suspiciously close to the truth. That's a coincidence of this specific synthetic dataset, where the only confounder of premium vs. cheap is query_confidence itself, and the outcome doesn't depend on confidence except through routing.

In production, you almost never get this lucky. User expertise, prompt phrasing, time of day, and a dozen unobserved query traits all correlate with confidence and with completion.

A naïve comparison in a real system can be off by 50% or more in either direction. RDD gives you identification that doesn't depend on the absence of hidden confounders.

Step 1: A Sharp RDD with Local Linear Regression

The basic sharp RDD estimator is a local linear regression. Restrict to users whose confidence sits within a bandwidth of the cutoff, fit separate linear slopes on each side, and read off the jump at 0.85.

cutoff = 0.85
bw = 0.10

near = df[(df.query_confidence > cutoff - bw)
          & (df.query_confidence < cutoff + bw)].copy()
near["below_cutoff"] = (near.query_confidence < cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff

rdd_model = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc",
    data=near,
).fit(cov_type="HC3")

effect = rdd_model.params["below_cutoff"]
print(f"RDD effect at cutoff (LATE): {effect:+.4f}")
print(f"Std error (HC3):             {rdd_model.bse['below_cutoff']:.4f}")
print(f"p-value:                     {rdd_model.pvalues['below_cutoff']:.4f}")
print(f"N users in [0.75, 0.95):     {len(near):,}")

Expected output:

RDD effect at cutoff (LATE): +0.0548
Std error (HC3):             0.0131
p-value:                     0.0000
N users in [0.75, 0.95):     21,689

Here's what's happening: the model fits separate intercepts and slopes on each side of 0.85 (below_cutoff is the side indicator, rc is confidence centered at the cutoff). The coefficient on below_cutoff reads off the vertical jump at the threshold, which is the local average treatment effect (LATE) for queries with confidence near 0.85. You get +0.0548, within sampling noise of the +0.06 ground truth.

Three notes on the specification. First, task_completed is binary, so this is a linear probability model. For RDD with a binary outcome at the cutoff, the linear probability model is standard practice because local linearity is the identifying assumption either way. Logit at the cutoff is an alternative if you need bounded predictions globally.

Second, the standard errors are used cov_type="HC3" to relax the homoskedasticity assumption, which is almost always wrong for binary outcomes.

Third, the dataset has one query per user with no within-user clustering, so cluster-robust standard errors aren't needed here. In a setting with multiple queries per user, you'd cluster on user_id.

The next diagnostic to look at is the confidence distribution near the cutoff. Figure 2 shows what 50,000 queries look like in the bandwidth window:

Figure 2. Real distribution from the 50,000-user synthetic dataset. Unlike the schematic in Figure 1, this shows the actual query density by confidence score, with the routing threshold annotated. The bottom panel counts how many queries land in each 2-percentage-point bin near the cutoff (2,461 / 2,481 / 2,335 / 2,229 / 2,048 across the 0.80–0.90 range). The roughly uniform spread is the visual signal that no manipulation is concentrating users on one side of the threshold.

Step 2: Try Different Bandwidths

Bandwidth choice matters. Too narrow and you have too few observations, so the confidence interval blows up. Too wide and you're extrapolating into regions where the linear specification is no longer a reasonable local approximation.

The honest move is to try multiple bandwidths and report whether the estimate holds.

results = []
for bw in [0.05, 0.10, 0.15, 0.20]:
    sub = df[(df.query_confidence > cutoff - bw)
             & (df.query_confidence < cutoff + bw)].copy()
    sub["below_cutoff"] = (sub.query_confidence < cutoff).astype(int)
    sub["rc"] = sub.query_confidence - cutoff

    m = smf.ols(
        "task_completed ~ below_cutoff + rc + below_cutoff:rc",
        data=sub,
    ).fit(cov_type="HC3")

    results.append({
        "bandwidth": bw,
        "n": len(sub),
        "effect": m.params["below_cutoff"],
        "se": m.bse["below_cutoff"],
        "p": m.pvalues["below_cutoff"],
    })

print(pd.DataFrame(results).round(4).to_string(index=False))

Expected output:

 bandwidth      n  effect     se       p
      0.05  11554  0.0635  0.0183  0.0005
      0.10  21689  0.0548  0.0131  0.0000
      0.15  29137  0.0618  0.0112  0.0000
      0.20  34074  0.0614  0.0107  0.0000

Here's what's happening: four bandwidths from ±0.05 to ±0.20 around the cutoff, refitting the same RDD specification at each. The estimates range from +0.0548 to +0.0635, all in the same neighborhood as the +0.06 ground truth, with standard errors that shrink as the bandwidth widens and grow as it narrows. Every p-value is well below 0.05. Whether the estimates are "stable" depends on the confidence intervals around them, which Step 5 produces with the bootstrap.

Step 3: Checking for Manipulation at the Threshold

RDD is valid only if users can't precisely manipulate the running variable around the cutoff. If your users (or your system) can nudge confidence scores just below 0.85 to force premium routing, you get a density spike at the cutoff, and the RDD estimate is contaminated.

The standard diagnostic is the McCrary density test, which checks whether the distribution of the running variable has a sharp jump at the cutoff. The simple version: bin the data tightly around 0.85 and check whether the counts on the two sides are similar.

print("User counts in 2-percentage-point bins around 0.85:")
for lo in [0.80, 0.82, 0.84, 0.86, 0.88]:
    hi = lo + 0.02
    cnt = ((df.query_confidence >= lo) & (df.query_confidence < hi)).sum()
    print(f"  [{lo:.2f}, {hi:.2f}):  n = {cnt:,}")

Expected output:

User counts in 2-percentage-point bins around 0.85:
  [0.80, 0.82):  n = 2,461
  [0.82, 0.84):  n = 2,481
  [0.84, 0.86):  n = 2,335
  [0.86, 0.88):  n = 2,229
  [0.88, 0.90):  n = 2,048

Here's what's happening: counts trend gently downward across the bandwidth because Beta(5,2) places more mass at higher confidence levels, and the density tapers as it approaches 1.0. There's no spike or dip at the 0.84–0.86 bin that straddles the cutoff. The 433-user spread across all five bins is consistent with smooth tapering of the underlying density.

That's the pattern you want when manipulation is absent. For a more rigorous test, the rddensity Python package implements the formal McCrary procedure with bias-corrected standard errors.

What manipulation looks like when it's real: a spike in users at confidences just barely below 0.85 (they're being nudged into premium routing) and a dip just above. If you see that pattern, the RDD estimate overstates the causal effect because the users right below 0.85 differ in motivation from those right above. They cared enough to manipulate the score, and they'd have shown different outcomes even under random routing.

Step 4: Quadratic Specification as a Robustness Check

If the true relationship between confidence and task completion isn't exactly linear, a local linear RDD can mistake the curvature for a jump. The standard robustness check allows quadratic terms on both sides of the cutoff and tests whether the estimate holds.

near = df[(df.query_confidence > cutoff - 0.10)
         & (df.query_confidence < cutoff + 0.10)].copy()
near["below_cutoff"] = (near.query_confidence < cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff
near["rc2"] = near.rc ** 2

rdd_quad = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc"
    " + rc2 + below_cutoff:rc2",
    data=near,
).fit(cov_type="HC3")

print(f"Linear RDD    (bw=0.10):  effect = +0.0548, p < 0.0001")
print(f"Quadratic RDD (bw=0.10):  effect = "
      f"{rdd_quad.params['below_cutoff']:+.4f}, "
      f"p = {rdd_quad.pvalues['below_cutoff']:.4f}")

Expected output:

Linear RDD    (bw=0.10):  effect = +0.0548, p < 0.0001
Quadratic RDD (bw=0.10):  effect = +0.0569, p = 0.0036

Here's what's happening: the quadratic specification adds squared terms and interactions with the cutoff indicator, allowing the relationship to curve differently on each side. The below_cutoff coefficient still captures the jump at the threshold, now under a more flexible specification.

The two estimates differ by 0.0022, both close to the +0.06 ground truth, and both are significant at p < 0.01. The answer doesn't change when you let the model bend.

When linear and quadratic specifications disagree noticeably, you have a real signal. With small samples (a few thousand at narrow bandwidths), the quadratic version can lose power because four extra parameters need data to be identified.

The standard move is to widen the bandwidth and re-run both specifications. If they still disagree at wider bandwidths, the linear approximation is wrong, and you should report both numbers.

Step 5: Bootstrap Confidence Intervals

Every point estimate in this article is a single number from a finite sample. The bootstrap quantifies how much that number would move under resampling, which is what a confidence interval describes.

def bootstrap_ci(df, cutoff, bw, quadratic=False, n_reps=500, seed=7):
    rng = np.random.default_rng(seed)
    near = df[(df.query_confidence > cutoff - bw)
              & (df.query_confidence < cutoff + bw)].copy()
    near["below_cutoff"] = (near.query_confidence < cutoff).astype(int)
    near["rc"] = near.query_confidence - cutoff
    if quadratic:
        near["rc2"] = near.rc ** 2
        formula = ("task_completed ~ below_cutoff + rc + below_cutoff:rc"
                   " + rc2 + below_cutoff:rc2")
    else:
        formula = "task_completed ~ below_cutoff + rc + below_cutoff:rc"

    n = len(near)
    estimates = np.empty(n_reps)
    for i in range(n_reps):
        sample = near.iloc[rng.integers(0, n, size=n)]
        m = smf.ols(formula, data=sample).fit()
        estimates[i] = m.params["below_cutoff"]
    return (np.percentile(estimates, 2.5), np.percentile(estimates, 97.5))


print("Linear RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10)
print(f"  effect = +0.0548   95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nBandwidth sensitivity:")
for bw, eff in [(0.05, 0.0635), (0.10, 0.0548), (0.15, 0.0618), (0.20, 0.0614)]:
    lo, hi = bootstrap_ci(df, cutoff, bw=bw)
    print(f"  bw = {bw:.2f}   effect = {eff:+.4f}   "
          f"95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nQuadratic RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10, quadratic=True)
print(f"  effect = +0.0569   95% CI: [{lo:+.4f}, {hi:+.4f}]")

Expected output:

Linear RDD (bw=0.10):
  effect = +0.0548   95% CI: [+0.0278, +0.0817]

Bandwidth sensitivity:
  bw = 0.05   effect = +0.0635   95% CI: [+0.0244, +0.0986]
  bw = 0.10   effect = +0.0548   95% CI: [+0.0278, +0.0817]
  bw = 0.15   effect = +0.0618   95% CI: [+0.0381, +0.0823]
  bw = 0.20   effect = +0.0614   95% CI: [+0.0420, +0.0808]

Quadratic RDD (bw=0.10):
  effect = +0.0569   95% CI: [+0.0205, +0.0959]

Here's what's happening: the bootstrap resamples the bandwidth-restricted data with replacement 500 times, refits the RDD on each replicate, and collects the below_cutoff coefficient. The 2.5th and 97.5th percentiles of those 500 estimates form the 95% interval. Every interval covers the +0.06 ground truth, every interval excludes zero, and the bandwidth sweep produces overlapping intervals.

That's quantitative stability, verified by resampling across the full bandwidth range. Intervals widen as the bandwidth shrinks and narrow as it grows. The quadratic interval is wider than the linear one because the four extra parameters absorb degrees of freedom.

One thing the intervals do NOT do on this dataset: exclude the naive +0.0632 estimate. That's because the data generator doesn't bake in confounding by query confidence. The only difference between the premium and cheap groups in expectations is the +6pp routing effect itself, so the naïve comparison is close to the truth.

Real systems are messier. In a production setting where unobserved query traits affect both the routing assignment and task completion, the naïve estimate would diverge from the RDD estimate, and the bootstrap intervals would tell you which one to trust.

When Regression Discontinuity Fails

RDD looks clean, but several specific failure modes can destroy the identification. Each one maps to a violation of one of the two named assumptions.

Users manipulate the running variable (violates assumption 1). The whole setup depends on users (or any upstream service) being unable to precisely control which side of the cutoff they land on. Any system that reveals the cutoff and gives users a way to influence their score (a retry mechanism, a prompt engineering workaround, a confidence-inflating trick) breaks RDD.

Run the density check in Step 3 every time. If you find manipulation, switch to a fuzzy RDD that treats the threshold as probabilistic, or abandon the approach.

Other policies fire at the same cutoff (violates assumption 2). If your product has additional rules that activate at 0.85 (a separate UI treatment, a different logging level, a different retry policy), RDD can't separate the routing effect from those other policy effects. Audit the full rule book for anything that shares the threshold.

The threshold has noise or overrides (violates assumption 1, in the structural sense). Maybe routing isn't strictly deterministic at 0.85 – it may have random jitter, or a second rule may override the main rule in some cases.

If assignment to the premium model isn't a deterministic function of query_confidence, you have a fuzzy RDD, which requires an instrumental variables framework. The rdrobust package handles both sharp and fuzzy designs.

Curvature masquerading as a jump (breaks the linear approximation that supports identification at the cutoff). Sharp RDD assumes linearity is a reasonable local approximation. When the underlying outcome-confidence relationship is strongly curved, the linear specification can mistake the bend for a jump.

Step 4's quadratic robustness check is the standard diagnostic. If linear and quadratic disagree, widen the bandwidth and re-run both.

Extrapolation bias (a continuity issue, reframed). RDD estimates are strictly local to the cutoff. The +0.06 effect at 0.85 tells you nothing about what premium routing would do for queries with confidence 0.30 or 0.99.

If you want a global average effect, you need a different technique: propensity methods, regression with confounder adjustment, or an actual experiment.

What to Do Next

RDD is the right tool when your AI feature is gated by a continuous score and a sharp threshold.

If your feature is gated by a user-controlled toggle, propensity score methods are a better fit. If it's gated by a staged rollout across workspaces, difference-in-differences handles it. If it's gated by rules you can't observe directly but that have a random component, instrumental variables is the right choice.

For production RDD analyses, use the rdrobust Python package. It gives you optimal bandwidth selection (Calonico, Cattaneo, and Titiunik 2014), bias-corrected standard errors, and a built-in plotting utility. The companion rddensity package implements the McCrary density test you saw informally in Step 3.

The from-scratch version in this tutorial shows the mechanics. The rd-packages stack is what you ship to a reviewer.

One thing the LATE doesn't do: tell you the effect for users far from the cutoff. If a +0.06 LATE at 0.85 is enough to keep premium routing in the pipeline, you're done. If you need to know what premium would do for the easy queries you're currently sending to cheap (or the hardest queries near the floor), the next step is a small randomized rollout in those zones, scored against the RDD estimate as a calibration check. Don't generalize the LATE without evidence.

The companion notebook for this tutorial lives here on GitHub. Clone the repo, generate the synthetic dataset, and run rdd_demo.ipynb to reproduce every code block from this tutorial.

Threshold routing is one of the most common patterns in production LLM systems, and every confidence-gated routing decision in your stack is a potential RDD. Run the analysis.

AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)

Mohammed Fahd Abrah — Wed, 06 May 2026 18:13:01 +0000

We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on research papers where the original ideas were developed and tested.

Now, not everyone enjoys reading research papers or has the time to comb through and digest all that (sometimes very dense) info. So I decided to do the hard work for you and share the key insights in a series of AI paper reviews.

The goal isn’t to turn this into a heavy academic discussion, but to explain the main ideas in a clear and practical way. You'll learn what problem the paper was trying to solve, what approach it introduced, and why it mattered.

In each article, you’ll get a simple breakdown of the paper, how it works, and what you should take away from it. By the end, you should understand the key idea without needing to go through the full research paper yourself.

Paper Overview

The first paper I'll be reviewing is "Improving Language Understanding by Generative Pre-Training", by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.

Here's the actual paper if you want to read it yourself: Read the paper.

And here's a little infographic of what we'll cover here:

Executive Summary
Goals of the Paper
Methodology
Transformer vs. BERT vs. GPT
Model Architecture
Key Techniques
Key Findings
Conclusions
Limitations
Related Work & Context
Final Insight
Resources

Prerequisites

To get the most out of this breakdown, it helps to be familiar with a few basic ideas:

A general understanding of natural language processing (NLP) and how machines work with text
A high-level idea of what a Transformer model is (you don’t need deep details, just the concept)
The difference between supervised and unsupervised learning
Basic machine learning concepts like training data and models

If you’re not fully comfortable with all of these, that’s okay, you can still follow along. The goal here is to keep things clear and intuitive.

Executive Summary

Before models like GPT became what we know today, there was a key limitation: AI systems were good at specific tasks, but struggled with general understanding.

In this paper, the authors introduce a simple but powerful idea. Instead of training a model separately for each task, they first train it on a large amount of unlabeled text to learn the structure of language. Then, they adapt it to specific tasks using smaller labeled datasets.

According to the authors, this two-step approach (pre-training followed by fine-tuning) allows a single model to handle many different tasks with minimal changes.

In practice, this marked a major shift: rather than building a new model for every problem, we can train one general model that learns language itself and then reuse it across tasks.

Goals of the Paper

To understand the motivation behind this work, it helps to look at the main limitations in NLP at the time.

Most models depended heavily on large labeled datasets, which weren’t always available. Many tasks simply didn’t have enough labeled data to train effective systems. On top of that, existing models were usually designed for a single task, making them hard to reuse or adapt.

Because of this, the authors aimed to reduce the reliance on labeled data and move toward a more general approach. Their goal was to build a language model that could learn from large amounts of raw text and then be applied across different tasks.

According to the paper, they also wanted to enable transfer learning: the ability to take knowledge learned from one task and apply it to others. They also wanted to improve performance without needing to redesign a new model each time.

Methodology

To understand how the authors approached this problem, let’s look at the core idea behind their method.

Pre-Training

At the heart of the paper is a simple but powerful approach built in two stages. The first stage is pre-training, where the model learns directly from raw text.

According to the authors, the model is trained on a large corpus of unlabeled text using a language modeling objective (predicting the next word in a sequence) – specifically, predicting the next word based on the previous ones to solve the intractable problem of high dimension probabilities. Through this process, the model gradually learns important aspects of language, such as grammar, context, structure, and general patterns.

The paper highlights that datasets like BooksCorpus are used in this stage because they contain long, continuous text. This is important, since it helps the model understand relationships across sentences rather than just short fragments.

Fine-Tuning (Adapting to Tasks)

Once the model has learned general language patterns, the next step is fine-tuning, where it is adapted to specific tasks using labeled data.

According to the authors, this includes tasks like question answering, text classification, natural language inference, and semantic similarity. Instead of building a new model for each task, the same pre-trained model is reused with only small adjustments.

In practice, this is what makes the approach powerful: the model already understands language at a general level, so it can quickly adapt to different tasks without needing to be redesigned from scratch.

Transformer vs. BERT vs. GPT

Before diving into GPT-1, it helps to understand how modern language models are structured. Most of them are based on the Transformer architecture, but they use it in different ways: encoder-only models (like BERT), decoder-only models (like GPT), or full encoder–decoder models.

The original encoder–decoder Transformer was mainly used for tasks like machine translation. Encoder-only models are typically used for understanding tasks such as text classification and sentiment analysis, while decoder-only models are designed for generation tasks like text creation, powering systems such as ChatGPT, Gemini, and Claude.

Illustration comparing Transformer, GPT, and BERT architectures, adapted from Comparing Large Language Models: GPT vs. BERT vs. T5 showing encoder-decoder, decoder-only, and encoder-only designs

Transformer vs BERT vs GPT: Key Differences

Aspect	Transformer (Original)	BERT	GPT
Paper	Attention Is All You Need (2017)	BERT (2018)	GPT (2018–2019)
Architecture Type	Encoder + Decoder	Encoder-only	Decoder-only
Primary Goal	Sequence-to-sequence tasks (for example, translation)	Language understanding	Language generation
Training Objective	Predict next token (seq2seq setup)	Masked language modeling (fill in blanks)	Predict next token (autoregressive)
Directionality	Bidirectional (encoder) + left-to-right (decoder)	Fully bidirectional	Left-to-right only
Context Understanding	Strong (via attention)	Very strong (full bidirectional context)	Strong (but only past context)
Input/Output Style	Input → Output sequence	Input → Representation	Input → Generated text
Fine-tuning	Required for each task	Required for each task	Optional (GPT-2+ supports zero-shot)
Typical Tasks	Translation, summarization	Classification, QA, NLI	Text generation, QA, chat
Strength	Flexible architecture foundation	Deep understanding of text	General-purpose generation
Limitation	Not directly usable without adaptation	Cannot generate text naturally	Limited bidirectional context
Key Innovation	Self-attention mechanism	Deep bidirectional encoding	Scaled generative pre-training
Evolution Role	Foundation of all modern LLMs	Specialized understanding models	Path to general-purpose AI

Model Architecture

To support this pre-training and fine-tuning approach, the GPT-1 model is built on a Transformer (decoder) architecture.

According to the authors, this choice is important for a few reasons. Unlike older models such as LSTMs, Transformers handle long-range dependencies more effectively, meaning they can better understand relationships between words that are far apart in a sentence.

They also rely on self-attention, a mechanism that allows the model to focus on the most relevant parts of the text when processing each word. This helps the model capture context more accurately.

Another key advantage is that Transformers make transfer learning more effective, since the same learned representations can be reused across different tasks with minimal changes.

The paper highlights that, in these transfer learning scenarios, Transformers outperform LSTM-based models.

Figure 1 from “Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018), showing the Transformer architecture and task-specific input transformations.

Key Techniques

Along with the main approach, the authors introduce a few practical techniques that make the model more flexible across tasks.

According to the paper, different tasks are handled by converting them into text-based formats, so they can all be processed in a similar way. This makes it easier to use the same model across multiple problems without redesigning it each time.

Another important point is that the model requires only minimal architectural changes when switching between tasks. Most of the knowledge learned during pre-training is reused as-is.

The authors also include an auxiliary language modeling objective during fine-tuning, which helps the model retain its general understanding of language while adapting to specific tasks.

Key Findings

After training and evaluation, the results weren't just strong – they were surprisingly competitive.

According to the authors, the model outperformed state-of-the-art systems in 9 out of 12 tasks. It also showed clear improvements, including +8.9% in commonsense reasoning and +5.7% in question answering.

Another important observation is that the model performed well across datasets of different sizes, although performance was weaker on some smaller datasets.

This suggests that the pre-training step helped it generalize better, even when labeled data was limited.

In practice, what makes these results significant is that a single model was able to compete with specialized systems that were specifically designed for each individual task.

Figure 2 from “Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018), illustrating performance gains from layer transfer and zero-shot learning behavior.

Conclusions

To wrap things up, this paper introduced a major shift in how AI systems are built.

According to the authors, instead of training a new model from scratch for every task, we can first teach a model the structure of language through pre-training, and then adapt it to specific tasks through fine-tuning. This simple idea turns out to be highly effective.

The key takeaway is that language models can develop a general understanding of text, especially when combined with Transformer architectures and large-scale data. This makes transfer learning practical across many different tasks.

In my view, this is what makes the paper so impactful. It doesn’t just improve performance on a few benchmarks. It changes the overall approach to building AI systems.

This idea later became the foundation for models like GPT-2, GPT-3, and ChatGPT, and continues to shape modern large language models today.

Limitations

Like any approach, this method comes with its own limitations.

According to the paper, one of the main challenges is the need for large amounts of unlabeled data during the pre-training stage, which may not always be easy to get. The model’s performance also depends heavily on how well the fine-tuning step is done.

The authors also note that multi-task learning was not fully explored in this work, leaving some open questions about how well the model can handle multiple tasks at the same time.

In practice, another limitation is that performance can be weaker when working with very small datasets, especially if the fine-tuning process is not carefully handled.

To better understand where this paper fits, it helps to look at the ideas it builds on.

According to the authors, earlier approaches such as word embeddings (like Word2Vec and GloVe), LSTM-based language models, and semi-supervised learning had already made progress in understanding language. But these methods were often limited to learning representations at the word level or required more task-specific design.

What this paper does differently is move beyond that. Instead of focusing only on individual words, it learns broader language representations that capture context and meaning across entire sequences. This shift is what enables the model to generalize better across different tasks.

Final Insight

If there’s one idea to take away from this paper, it’s this: you don’t need to teach an AI system every task separately.

According to the authors, once a model learns the structure of language, it can adapt to a wide range of tasks with minimal changes. That shift – from task-specific models to general language understanding – is what makes this work so important.

In my view, this is the moment where things really changed. What started here with GPT-1 became the foundation for the systems we use today, including ChatGPT and other modern language models.

Resources:

Contact Me

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Rakshath Naik — Tue, 05 May 2026 16:59:17 +0000

In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.

Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.

Done.

Except something looks odd.

When we take a closer look, we see that most customers are buying items worth $8 - $15. So where's $20 coming from?

In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.

Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.

In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?

Prerequisites
The Dataset
Mean: The Sensitive Giant
Median: The Robust Middle
Beyond Averages: Understanding Spread with Quartiles
Applying IQR to Our Dataset
Final Comparison and Insights
Conclusion
Connect with me

Prerequisites

To follow along here, you'll need:

Basic Python knowledge: Understanding of variables and functions.

The Pandas library: Familiarity with loading data and basic DataFrame operations.

A development environment: Access to a tool like Jupyter Notebook, VS Code, or Google Colab.

A Dataset: For this analysis, I used the Online Retail Dataset, which is available for download here.

The Dataset

We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.

Source: UCI Machine Learning Repository
Collected by: UK-based online retail company (2010–2011)
Size: 541,909 transactions
Features: 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)
Ownership: Public dataset hosted by UCI
License: Open for research and educational use

Mean: The Sensitive Giant

In statistics and data analysis, the terms "average" and "arithmetic mean" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:

$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$

In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")

The results are as follows:

Average Order Value (Mean): 20.40

At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.

Take a look at the graph for the mean below.

The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)

The graph shows a right-skewed distribution where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of $8 - $15 range, but the red line is being dragged to the right by the long tail of high-value bulk orders by some customers.

In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.

In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.

Median: The Robust Middle

When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.

Median is defined as the middle value after sorting the data.

In our dataset, we sort all the transactions and pick the middle one.

The formula for calculating the median is:

$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} & \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} & \text{if } n \text{ is even} \end{cases}$$

Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")

The results are as follows:

Typical Order Value (Median): 11.10

Now you'll notice that the result lies in the $8 — $15 range, where most of the transactions lie.

The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)

In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.

In the above figure the median graph accurately highlights the range where most of the customers lie.

Beyond Averages: Understanding Spread with Quartiles

So far, we've studied the median, but knowing the center is not enough.

To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.

Quartiles divide the dataset into the following parts:

Q1(25th percentile): 25% of transactions are below this.
Q2 (50th percentile): Median
Q3 (75th percentile): 75% of transactions are below this.

This is formally expressed as the Interquartile Range (IQR):

$$IQR = Q_3 - Q_1$$

The IQR: Detecting Outliers

The IQR measures the spread of the middle 50%.

If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.

Outlier Rule:

Lower Bound = Q1 — 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

A Simple Example to Understand IQR

Consider the following transaction values:

$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$

Step 1: Find the Median (Q2):

The middle value is:

$$Q_2 = 12$$

Step 2: Find Q1 (Lower Quartile):

The lower half is [5, 8, 10]. The median of the lower half is:

$$Q_1 = 8$$

Step 3: Find Q3 (Upper Quartile):

The upper half is [15, 18, 20]. The median of the upper half is:

$$Q_3 = 18$$

Step 4: Calculate IQR:

$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$

Step 5: Find Outlier Bounds:

$$\begin{aligned} \text{Lower Bound} &= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$

Any value below -7 or above 33 is an outlier (but in this demo problem, no outliers exist).

Applying IQR to Our Dataset

In our retail dataset, instead of neat values, we have bulk values and even negative returns.

# 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

When we calculate IQR for our dataset, we get:

Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180

The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)

As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.

Revisiting the Mean After Removing Outliers

Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] >= lower_bound) & (df['TotalPrice'] <= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")

After recomputing, we get:

Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63

Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.

Final Comparison and Insights

Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.

The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.

After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.

Conclusion

The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.

This highlights a key lesson: The mean isn't wrong, but it must be used with an understanding of the data.

Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.

Connect with me

If you want to dive deeper, you can visit: Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis.

Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python

Rudrendu Paul — Thu, 30 Apr 2026 23:01:26 +0000

Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a random sample.

Your product shipped a new agent mode last quarter. Users have to tap the "Try agent mode" toggle to enable it. The dashboard numbers look stunning: agent-mode users complete 21 percentage points more tasks than non-users. The CPO calls it the best feature launch of the year.

But you know something's off. Heavy-engagement users opt into new features constantly, while light users ignore toggles entirely. That 21-point gap measures the agent's effect combined with the pre-existing gap between power users and the rest of your base.

This is the Opt-In Trap. It shows up in every generative AI product that ships features behind a user-controlled toggle: "Try our AI assistant," "Enable smart replies," "Turn on code suggestions." Users who click to opt in differ systematically from those who scroll past. Any naïve comparison between the two groups collapses the feature's causal effect into whatever made those users opt in in the first place.

Running an AI feature behind a toggle is a product experiment. The hypothesis: the feature improves outcomes for users who adopt it.

Unlike an A/B test, where the coin flip creates two otherwise-identical populations, the toggle creates two populations that differ before they even make a choice. That pre-existing difference is the measurement problem, and a t-test on dashboard numbers can't fix it.

Propensity score methods are statistical tools that data scientists use to separate adoption bias from the feature's actual effect. They reweight (or rematch) your comparison so that opted-in and non-opted-in groups look comparable on observable characteristics, approximating what a randomized experiment would have given you.

This tutorial walks through the full pipeline (propensity estimation, inverse-probability weighting, nearest-neighbor matching, balance diagnostics, and bootstrap confidence intervals) on a 50,000-user synthetic SaaS dataset where the ground-truth causal effect is known. You'll estimate it, quantify uncertainty, and see where the approach silently breaks.

Companion code: every code block runs end-to-end in the companion notebook at github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in. The notebook (psm_demo.ipynb) has all outputs pre-executed, so you can read along on GitHub before running anything locally.

Why Opt-in Features Break Naïve Comparisons
What Propensity Scores Actually Do
Prerequisites
Setting Up the Working Example
Step 1: Estimate the Propensity Score
Step 2: Inverse-Probability Weighting
Step 3: Nearest-Neighbor Matching
Step 4: Check Covariate Balance
Step 5: Bootstrap Confidence Intervals
When Propensity Score Methods Fail
What to Do Next

Why Opt-in Features Break Naïve Comparisons

The math of an A/B test is elegant because of one assumption: treatment is assigned independent of everything else. Flip a coin: half your users get agent mode, and the coin flip breaks every possible confound by construction. The opt-in world has no coin.

Three mechanisms make opt-in comparisons misleading.

1. Selection on engagement

Power users click everything. If your heavy-engagement cohort opts into agent mode at 65 percent and your light-engagement cohort opts in at 12 percent, you've stacked the opt-in group with users who were going to complete more tasks anyway.

That compositional imbalance accounts for most of the observed lift on its own, before the agent does any work.

2. Selection on intent

Users who opt into a new feature often have a specific use case in mind. A developer who clicks "Try code suggestions" already has code to write. That user would have shown higher task completion even with the control UI.

3. Selection on risk tolerance

Early adopters tolerate rough edges. A user who clicks "Try beta" and sees slow latency sticks around, but a risk-averse user bounces.

Your opt-in group is enriched for people willing to put up with bad experiences, which affects every downstream metric you might measure.

All three produce the same symptom: a raw comparison of opted-in users against everyone else that can overstate the feature's causal effect by 2x or more, depending on how concentrated opt-in is among your heaviest users.

On the synthetic dataset in this tutorial, the naïve comparison inflates a true +8pp effect to +21pp, a 2.6x overshoot. Propensity score methods exist to correct this.

What Propensity Scores Actually Do

Figure 1: Schematic propensity score distributions for two hypothetical groups. The opted-in group (red) skews toward higher propensities, while the non-opted-in group (blue) skews lower.

In the above figure, the bracketed strip below the x-axis splits the score range into three zones: a control-heavy region at low propensities where few treated users exist, a region of common support in the middle where both groups are well represented, and a treatment-heavy region at high propensities where few controls exist. Propensity score methods operate within the common-support region by reweighting or rematching so that the two groups appear balanced on observables. The extremes are either trimmed out or handled with caution.

The propensity score is the probability that a user opts in given their observable characteristics. Estimate this probability well, and you can use it to reweight your sample so that opted-in and non-opted-in users look similar on observables, just as they would have if opt-in had been randomized.

Two practical strategies use the propensity score:

Inverse-probability weighting (IPW) assigns each user a weight equal to the inverse of their probability of receiving the treatment they actually received. Opted-in users get weighted by 1/P(opt-in). Non-opted-in users get weighted by 1/P(no opt-in). After weighting, the two groups are balanced on observables, and the weighted difference in outcomes approximates the average treatment effect.
Matching pairs each opted-in user with one or more non-opted-in users who have similar propensity scores. The average outcome difference between matched pairs estimates the average treatment effect on the treated (ATT): what opt-in users actually gained by opting in.

Both methods rest on three identification assumptions working together.

First, unconfoundedness: every observable variable that drives opt-in and affects the outcome is in your propensity model.
Second, overlap (also called positivity): every user has some nonzero probability of opting in and some nonzero probability of staying out.
Third, no interference: one user's opt-in decision does not affect another user's outcome (the stable-unit-treatment-value assumption, or SUTVA.

Violate any one of these and the estimate is biased even when the other two hold. The failure modes at the end of this tutorial walk through each one.

Prerequisites

You'll need Python 3.11 or newer, comfort with pandas and scikit-learn, and rough familiarity with logistic regression.

Install the packages for this tutorial:

pip install numpy pandas scikit-learn matplotlib

Here's what's happening: four packages cover the full pipeline. Pandas loads the data, NumPy handles weights and array arithmetic, scikit-learn fits the propensity model and runs nearest-neighbor matching, and matplotlib renders the overlap diagnostic.

Clone the companion repo to get the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here's what's happening: the clone pulls the companion repo, and generate_data.py produces the shared synthetic dataset used across the series. Seed 42 keeps the dataset reproducible, and 50,000 users give clean signal for every estimator in this tutorial. The output CSV lands at data/synthetic_llm_logs.csv.

Setting Up the Working Example

The synthetic dataset simulates a SaaS product where users can opt into an agent mode that uses a more expensive model. With fifty thousand users, opt-in rates differ sharply by engagement tier: heavy users opt in at 65 percent, medium users at 35 percent, and light users at 12 percent.

The ground-truth causal effect baked into the data generator is +8 percentage points on task completion for users who opted in. The naive comparison inflates this to around +21 percentage points because selection bias stacks the opted-in group with your most engaged users.

Knowing the ground truth is what lets you verify that your propensity score method recovers it.

Load the data and see the selection problem:

import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")

print(df.groupby("engagement_tier").opt_in_agent_mode.mean().round(3))

naive_effect = (
    df[df.opt_in_agent_mode == 1].task_completed.mean()
    - df[df.opt_in_agent_mode == 0].task_completed.mean()
)
print(f"\nNaive opt-in effect: {naive_effect:+.4f}")

Expected output:

engagement_tier
heavy     0.647
light     0.120
medium    0.353
Name: opt_in_agent_mode, dtype: float64

Naive opt-in effect: +0.2106

Here's what's happening: you load 50,000 rows, group by engagement tier, and print the opt-in rate inside each group. Heavy users opt in far more than light users, which is the selection-on-engagement pattern baked into the data. The naïve effect lands at +0.2106 (21 percentage points), nearly three times the ground truth of +0.08. That gap is exactly what propensity score methods have to remove.

Step 1: Estimate the Propensity Score

The propensity score is the output of a model that predicts opt-in from observable characteristics. Logistic regression is the right starting point because it's interpretable and fast, but watch the balance diagnostics in Step 4: if any weighted SMD stays above 0.1, the logistic model is missing an interaction, and gradient boosting is the next move.

For this dataset, the relevant observables are engagement tier and query confidence. In a real product, you'd include every variable you think drives opt-in: device type, tenure, plan tier, and historical usage patterns.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X = pd.get_dummies(
    df[["engagement_tier", "query_confidence"]],
    drop_first=True
).astype(float)
y_treat = df.opt_in_agent_mode

ps_model = LogisticRegression(max_iter=1000).fit(X, y_treat)
df["propensity"] = ps_model.predict_proba(X)[:, 1]

# Basic sanity checks
print(df.groupby("engagement_tier").propensity.mean().round(3))
print(
    f"\nPropensity range (treated):  "
    f"{df[df.opt_in_agent_mode == 1].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 1].propensity.max():.3f}"
)
print(
    f"Propensity range (control):  "
    f"{df[df.opt_in_agent_mode == 0].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 0].propensity.max():.3f}"
)
print(f"Propensity model AUC: {roc_auc_score(y_treat, df.propensity):.3f}")

Expected output:

engagement_tier
heavy     0.646
light     0.120
medium    0.353
Name: propensity, dtype: float64

Propensity range (treated):  0.114 - 0.675
Propensity range (control):  0.114 - 0.673
Propensity model AUC: 0.744

Here's what's happening: you encode the engagement tier as dummy variables, keep query confidence continuous, and fit a logistic regression model. The predicted probability from the model is each user's propensity score.

Scikit-learn LogisticRegression applies L2 regularization by default (C=1.0), which shrinks propensities slightly toward 0.5. For production use, you can set penalty=None if you want an unregularized fit.

Mean propensity inside each engagement tier recovers the true opt-in rate for that tier almost exactly, so the model is calibrated. The AUC of 0.744 confirms the model discriminates between opt-ins and non-opt-ins well above chance (0.5).

And the propensity ranges overlap between treated and control groups (both span roughly 0.11 to 0.67), which is the visual overlap condition.

Figure 2: Two views of the same positivity check on the real 50,000-user synthetic dataset.

In the figure above, the top panel plots smooth kernel density curves of the fitted propensity scores for each group. The three peaks align with the three engagement tiers (light at p ≈ 0.12, medium at p ≈ 0.35, heavy at p ≈ 0.65), as expected, because the opt-in rate is tier-driven. The bottom panel translates that same distribution into raw counts per tier: every tier contains thousands of both opted-in and non-opted-in users, which is exactly what positivity requires.

Where Figure 1 schematically illustrated the idea, this figure shows that it holds for the data, so the weighting and matching that follow will have real counterfactuals to work with.

Step 2: Inverse-Probability Weighting

IPW assigns each user a weight inversely proportional to their propensity. An opted-in user with a 0.12 propensity is rare (a light user who still opted in despite low engagement) and carries information about 1 / 0.12 ≈ 8 similar users in the population. A control user with a 0.12 propensity is the expected case for light users who stayed out, so they're common and get a weight of 1 / (1 - 0.12) ≈ 1.14.

import numpy as np

# ATE weights: 1/P(treat) for treated, 1/P(no treat) for control
df["ipw"] = np.where(
    df.opt_in_agent_mode == 1,
    1 / df.propensity,
    1 / (1 - df.propensity)
)

t = df[df.opt_in_agent_mode == 1]
c = df[df.opt_in_agent_mode == 0]
ate_ipw = (
    (t.task_completed * t.ipw).sum() / t.ipw.sum()
    - (c.task_completed * c.ipw).sum() / c.ipw.sum()
)
print(f"IPW average treatment effect (ATE): {ate_ipw:+.4f}")

# ATT: what opt-in users actually gained
df["ipw_att"] = np.where(
    df.opt_in_agent_mode == 1,
    1,
    df.propensity / (1 - df.propensity)
)
t = df[df.opt_in_agent_mode == 1]   # re-slice now that ipw_att is in df
c = df[df.opt_in_agent_mode == 0]
treated_mean = t.task_completed.mean()
control_w_mean = (c.task_completed * c.ipw_att).sum() / c.ipw_att.sum()
att_ipw = treated_mean - control_w_mean
print(f"IPW average treatment effect on treated (ATT): {att_ipw:+.4f}")

Expected output:

IPW average treatment effect (ATE): +0.0851
IPW average treatment effect on treated (ATT): +0.0770

Here's what's happening: first, you compute ATE weights for every user and take the weighted difference in task completion between opted-in and non-opted-in groups. Then you compute ATT weights, which reweight only the control group to match the treated group's covariate distribution, and compute the average treatment effect on the treated.

ATE answers the population question: what's the effect on a random user who might or might not have opted in anyway? ATT answers the user question: What did opt-in users actually gain? On this dataset, ATE lands at +0.0851 and ATT at +0.0770, both close to the ground-truth +0.08 and a massive improvement over the naive +0.2106.

The distinction matters in practice. Deciding whether to roll the feature out to users who haven't opted in calls for ATE. Reporting on the value opt-in users captured calls for ATT.

Step 3: Nearest-Neighbor Matching

Matching takes a different approach: pair each opted-in user with the non-opted-in user whose propensity score is closest, then take the average outcome difference across matched pairs. The result estimates ATT.

from sklearn.neighbors import NearestNeighbors

treated_ps = df[df.opt_in_agent_mode == 1][["propensity"]].values
control_ps = df[df.opt_in_agent_mode == 0][["propensity"]].values

nn = NearestNeighbors(n_neighbors=1).fit(control_ps)
_, idx = nn.kneighbors(treated_ps)

treated_outcomes = df[df.opt_in_agent_mode == 1].task_completed.values
matched_control_outcomes = (
    df[df.opt_in_agent_mode == 0].task_completed.values[idx.flatten()]
)

att_match = (treated_outcomes - matched_control_outcomes).mean()
print(f"1-NN matching ATT: {att_match:+.4f}")

Expected output:

1-NN matching ATT: +0.0752

Here's what's happening: you extract propensity scores for each group, fit a nearest-neighbor index on the control group, and find the single closest control user for every treated user.

The NearestNeighbors index allows the same control user to be selected as the match for multiple treated users, so this is a matching-with-replacement case.

You pull the outcomes for each treated user and their matched control, take the difference per pair, and average across pairs. The result estimates what opt-in users gained compared to very similar users who did not opt in.

The +0.0752 result lands close to the ground truth of +0.08 but slightly below IPW ATT, typical of 1-NN matching because a single nearest neighbor is a high-variance estimator.

Two variants are worth knowing. Matching with replacement (what you just ran) allows a single control user to serve as a match for multiple treated users, reducing bias when good matches are scarce but inflating variance.

Matching without replacement assigns each control user to at most one treated user, which keeps variance lower but forces poor-quality pairings when the treated group dwarfs the available controls.

For most production analyses, k-nearest-neighbor matching with k = 3-5 and replacement is a sensible default.

Step 4: Check Covariate Balance

Propensity score methods work only if they actually balance the covariates between groups. You need to verify that they did, because if the balance fails, your estimate is wrong.

The standard diagnostic is the standardized mean difference (SMD) for each covariate. SMD compares the treated group mean to the control group mean, divided by the pooled standard deviation.

Before weighting, SMDs tell you how imbalanced the raw groups are. After weighting, they should be small (|SMD| < 0.1 is the conventional cutoff).

def smd(treated_vals, control_vals, treated_w=None, control_w=None):
    """Standardized mean difference, optionally with weights."""
    if treated_w is None:
        treated_w = np.ones(len(treated_vals))
    if control_w is None:
        control_w = np.ones(len(control_vals))
    t_mean = np.average(treated_vals, weights=treated_w)
    c_mean = np.average(control_vals, weights=control_w)
    pooled_std = np.sqrt((treated_vals.var() + control_vals.var()) / 2)
    return (t_mean - c_mean) / pooled_std

engagement_heavy = (df.engagement_tier == "heavy").astype(float).values
qc = df.query_confidence.values
tr = (df.opt_in_agent_mode == 1).values

covariates = {
    "engagement_tier_heavy": engagement_heavy,
    "query_confidence": qc,
}

print(f"{'Covariate':<30} {'Raw SMD':>10} {'Weighted SMD':>15}")
for name, vals in covariates.items():
    smd_raw = smd(vals[tr], vals[~tr])
    smd_weighted = smd(
        vals[tr], vals[~tr],
        treated_w=df[tr].ipw.values,
        control_w=df[~tr].ipw.values,
    )
    print(f"{name:<30} {smd_raw:>+10.3f} {smd_weighted:>+15.3f}")

Expected output:

Covariate                         Raw SMD    Weighted SMD
engagement_tier_heavy              +0.742          +0.002
query_confidence                   -0.032          -0.003

Here's what's happening: the helper computes the standardized mean difference for any covariate, with optional IPW weights.

You then print raw and weighted SMDs for each covariate. The raw SMD on engagement_tier_heavy is +0.742 (heavy users opt in far more than everyone else), and the weighted SMD drops to +0.002, a clean pass. Query confidence was already close to balanced on the raw data, and weighting keeps it that way. If any weighted SMD came back above 0.1 in absolute value, your propensity model would be missing something; the fix is usually richer features or interaction terms in the logistic regression.

Visually, Figure 2 above confirmed what the SMDs now confirm numerically: the overlap condition holds, and balance is achievable.

Step 5: Bootstrap Confidence Intervals

Point estimates are only half the story. Any estimate you report to a product team needs an interval that tells them whether +0.08 is distinguishable from +0.03 or from +0.12. Analytic standard errors for IPW and matching are tricky because of the estimated propensity score, so the simplest and most honest move is the non-parametric bootstrap.

def estimate_all(sample):
    """Return (ATE_IPW, ATT_IPW, ATT_match) on a bootstrap sample."""
    s = sample.copy()
    X_s = pd.get_dummies(
        s[["engagement_tier", "query_confidence"]], drop_first=True
    ).astype(float)
    ps = LogisticRegression(max_iter=1000).fit(X_s, s.opt_in_agent_mode)
    s["p"] = ps.predict_proba(X_s)[:, 1]

    s["w_ate"] = np.where(
        s.opt_in_agent_mode == 1, 1 / s.p, 1 / (1 - s.p)
    )
    s["w_att"] = np.where(
        s.opt_in_agent_mode == 1, 1, s.p / (1 - s.p)
    )
    t, c = s[s.opt_in_agent_mode == 1], s[s.opt_in_agent_mode == 0]

    ate = (
        (t.task_completed * t.w_ate).sum() / t.w_ate.sum()
        - (c.task_completed * c.w_ate).sum() / c.w_ate.sum()
    )
    att = t.task_completed.mean() - (
        (c.task_completed * c.w_att).sum() / c.w_att.sum()
    )
    nn_b = NearestNeighbors(n_neighbors=1).fit(c[["p"]].values)
    _, idx_b = nn_b.kneighbors(t[["p"]].values)
    match = (
        t.task_completed.values
        - c.task_completed.values[idx_b.flatten()]
    ).mean()
    return ate, att, match

rng = np.random.default_rng(7)
n_reps = 500
results = np.zeros((n_reps, 3))
for i in range(n_reps):
    boot = df.iloc[rng.integers(0, len(df), size=len(df))]
    results[i] = estimate_all(boot)

for name, col in zip(["IPW ATE", "IPW ATT", "1-NN ATT"], range(3)):
    lo, hi = np.percentile(results[:, col], [2.5, 97.5])
    print(f"{name:<10} 95% CI: [{lo:+.4f}, {hi:+.4f}]")

Expected output:

IPW ATE    95% CI: [+0.0745, +0.0954]
IPW ATT    95% CI: [+0.0687, +0.0865]
1-NN ATT   95% CI: [+0.0659, +0.0940]

Here's what's happening: you resample the dataset with replacement 500 times, refit the propensity model, and recompute each estimator on each resample, and take the 2.5th and 97.5th percentiles of the bootstrap distribution as the 95% confidence interval. All three intervals cover the ground-truth +0.08 and exclude the naive +0.21 by a wide margin.

The IPW ATT interval is the tightest because ATT reweights only the control group. The 1-NN matching interval is the widest because single-neighbor matching discards control users outside the matched set.

Running this once takes about 90 seconds on a laptop. For a stakeholder report, anchor the headline to the point estimate and cite the interval so the team sees the uncertainty alongside the number.

When Propensity Score Methods Fail

Propensity scores make opt-in comparisons rigorous when their assumptions hold. They produce biased estimates that look clean when those assumptions fail.

Four common failure modes map to the three identification assumptions from earlier.

1. Unmeasured Confounders (Violate Unconfoundedness)

If something drives both opt-in and your outcome but isn't in your propensity model, IPW and matching produce biased estimates. This is the most common failure in practice.

An example: users who opt into agent mode are also the users who follow your engineering blog and read release notes. If blog-reading behavior raises task completion independently of the feature, missing that signal attributes the effect to agent mode, inflating your estimate.

The only real defense is domain knowledge about what drives opt-in, richer feature engineering in your propensity model, and formal sensitivity tools (Rosenbaum bounds, E-values) that quantify how strong an unmeasured confounder would have to be to overturn the result.

2. Positivity (Overlap) Failures (Violates Overlap)

If some users have near-zero probability of opting in (or near-one), you've got no comparable counterfactual for them. I

PW creates extreme weights (1 / 0.001 = 1,000) that let a single outlier dominate the estimate. So matching is forced into poor-quality pairings.

Check propensity histograms and trim propensities outside [0.05, 0.95] before weighting if extreme values exist.

3. Misspecified Propensity Models (Degrade Unconfoundedness in Practice)

A linear logistic regression can't capture nonlinear relationships. If opt-in depends on the interaction between engagement tier and query confidence (power users with complex queries opt in, while light users pass), a main-effects model misses that and produces poor balance.

Use flexible models (for example, gradient boosting on the propensity score or regression adjustment on top of weighting) and always check the balance after weighting. Poor balance after weighting is the primary signal of misspecification.

4. Spillovers Between Users (Violates SUTVA)

Propensity score methods assume your users are independent. If one user opting into agent mode affects another user's task completion (for example, teammates adopting the feature together in shared workspaces), your estimated effect includes the spillover.

This violates the stable-unit-treatment-value-assumption, and handling it cleanly requires a different toolkit: either cluster randomization for features adopted at the workspace level or network-aware experimental designs for user-level spillovers.

These failure modes stay invisible in your regression coefficients. They surface as estimates that look good on paper but don't hold up when the feature rolls out to a broader audience.

Run balance diagnostics, check overlap plots, and document what you might have missed: those are your only real defenses.

What to Do Next

Propensity score methods are the right tool when your feature ships behind an opt-in toggle and you've got rich covariates to model selection with.

If opt-in follows a crisp rule (a threshold on query complexity, a paid-tier gate), regression discontinuity fits better. If you suspect unobserved confounders and have an external randomization source (randomized rollout noise, rate-limit-triggered routing), instrumental variables will do better.

To guard your estimate against propensity misspecification, doubly robust estimators combine propensity weighting with regression adjustment and stay consistent if at least one of the two component models is correctly specified.

The companion notebook for this tutorial lives here. Clone the repo, generate the synthetic dataset, and run psm_demo.ipynb (or psm_demo.py) to reproduce every code block, every number, and every figure from this tutorial.

When an AI feature ships behind a toggle, the naïve opt-in comparison is usually the wrong number. Propensity score methods give you "users comparable to those who clicked this" as your counterfactual, and the bootstrap gives you an interval you can defend when a stakeholder asks how sure you are.

How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway

Rakshath Naik — Thu, 30 Apr 2026 05:06:15 +0000

In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distinguish legitimate emails from malicious ones.

While building a machine learning model in a notebook is relatively straightforward, the real challenge lies in the last mile: deploying that model into a scalable, production-ready system that users can actually interact with.

In this project, I built an end-to-end serverless spam classifier, combining Scikit-learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, scalable API that can classify messages in real time.

The system is designed to be modular and cost-efficient, allowing the model to be retrained and updated independently without affecting the live API. From detecting "free iPhone" scams to identifying phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real-world deployment.

Prerequisites
Building the Brain: The Model
Deploying the Model to AWS
How to Run The Project Locally
Our Project Architecture
Conclusion: The Power of Serverless AI
Acknowledgment / References

1. Prerequisites

Fundamental skills: Basic proficiency in Python and understanding of Machine Learning concepts like classification.
AWS account: Access to an AWS account with permissions for Lambda, S3, and API Gateway.
Environment: Python 3.11 installed, along with libraries like scikit-learn, pandas, and joblib.
AWS CLI: Configured on your local machine for file uploads.
HuggingFace account: You can directly download the model from my account.

2. Building the Brain: The Model

Photo by Steve A Johnson on Unsplash

At the heart of this project lies a supervised learning approach. Instead of simply specifying which words are considered spam, we'll provide the computer with a dataset and an algorithm, enabling it to learn and identify spam patterns on its own.

1. Vectorization: Turning Text into Math

Machine Learning models can't read text. They require numerical input. To solve this, we used the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer.

feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train

Here's the mathematical formula:

$$w_{i,j} = tf_{i,j} \times \log \left( \frac{N}{df_i} \right)$$

TF-IDF term definitions:

wᵢ,ⱼ (Weight): The final importance score of a specific word in a document.
tfᵢ,ⱼ (Term Frequency): How often a word appears in a single email.
N (Total Documents): The total count of all emails in your dataset.
dfᵢ (Document Frequency): The number of different emails that contain this specific word.
log(N/dfᵢ) (IDF): A penalty that lowers the score of common words like the or is that appear everywhere.

It cleans the data by removing common words, converts all text to lowercase for consistency, and assigns more importance to rare and meaningful words while giving less importance to frequently used words.

2. Training: The Logistic Regression Engine

We'll use Logistic Regression here, a classification algorithm that predicts the probability of an outcome.

In this stage, we feed our vectorized training data into the Logistic Regression algorithm. The goal is to establish a mathematical relationship between specific word weights and the Spam or Ham label.

During training, the model iteratively adjusts its internal parameters to minimize error, eventually learning that words like winner or free correlate highly with spam, while conversational language correlates with legitimate messages.

model = LogisticRegression()
model.fit(X_train_features, Y_train)

In our case, it calculates the probability that an email belongs to spam or HAM.

The algorithm uses the Sigmoid function to map any real-valued number into a value between 0 and 1.

$$P(y=1|x) = \frac{1}{1 + e^{-(z)}}$$

where z = β₀ + β₁x₁ + … + βₙxₙ.

3. Evaluation: Testing the Intelligence

After training, we need to verify if the brain actually works on data it hasn't seen before.

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

By comparing the model’s predictions against the actual labels in our test set, we calculate an Accuracy Score. This gives us the confidence that the model is ready for the real world (achieving ~94% accuracy in our tests).

4. Exporting the Logic (Serialization)

To move this brain from our local Python environment to the AWS Cloud, we'll use Joblib to save our work into binary files (.pkl).

joblib.dump(model, 'spam_model.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')

We use the Pickle format because it allows us to freeze complex Python objects (mathematical weights and word mappings) into a portable binary format that can be instantly re-animated in the cloud.

We need the Vectorizer to translate new user text into the exact numerical coordinates the Model was trained to understand. Using one without the other is like having a key but no lock.

The trained Logistic Regression model and TF-IDF vectorizer are openly available for the community on Hugging Face here: Get the model on HuggingFace.

3. Deploying the Model to AWS

Training a model is science, while deploying it is engineering. To make this classifier accessible to the world, we'll use a serverless stack that scales automatically and incurs nearly no maintenance costs.

1. Model Storage: Amazon S3

First, we'll uploade our .pkl files to an S3 bucket. By decoupling the model from the code, we can update the AI's intelligence (simply by overwriting the file in S3) without redeploying the backend code. It makes the system highly maintainable.

2. The Production Backend: AWS Lambda

To make the AI accessible, we'll move from a local script to a Serverless Cloud Architecture. This ensures the model is always available without the cost of a 24/7 server.

The deployment environment is AWS Lambda (Python 3.11). Since Lambda is a lightweight environment, it doesn't include Scikit-Learn or Joblib. To provide these, we'll download and store them in our S3 bucket and import them through the layers.

Commands in AWS CLI:


# 1. Create a workspace
mkdir ml_layer && cd ml_layer

# 2. Install scikit-learn and its dependencies into a folder
pip install \
    --platform manylinux2014_x86_64 \
    --target=python/lib/python3.11/site-packages \
    --implementation cp \
    --python-version 3.11 \
    --only-binary=:all: \
    scikit-learn joblib

# 3. Zip the folder
zip -r sklearn_lib.zip python

# 4. Upload to S3 (Using AWS CLI)
aws s3 cp sklearn_lib.zip s3://YOUR-BUCKET-NAME/

We store the Scikit-Learn library as a ZIP in S3 to bypass the AWS Lambda deployment package size limit. This allows the function to dynamically load heavy dependencies only when needed without bloating the core code.

The Lambda Function:


import json
import boto3
import os
import sys
from io import BytesIO

# Ensures the custom Lambda layer(containing sklearn/joblib)
sys.path.append('/opt/python')

try:
    import joblib
except ImportError:
    # Fallback for specific Scikit-Learn distributions
    from sklearn.utils import _joblib as joblib

# Initialize S3 client
s3 = boto3.client('s3')

# Use placeholders for the article so readers can insert their own values
BUCKET_NAME = 'YOUR_S3_BUCKET_NAME' 
MODEL_KEY = 'spam_model.pkl'
VECTORIZER_KEY = 'vectorizer.pkl'

# Global variables for 'Warm Start' caching (improves performance by keeping model in RAM)
model = None
vectorizer = None

def load_model():
    """Downloads model files from S3 only if they aren't already in RAM"""
    global model, vectorizer
    if model is None or vectorizer is None:
        try:
            # 1. Load the Logistic Regression Model from S3
            m_obj = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
            model = joblib.load(BytesIO(m_obj['Body'].read()))
            
            # 2. Load the TF-IDF Vectorizer directly from S3
            v_obj = s3.get_object(Bucket=BUCKET_NAME, Key=VECTORIZER_KEY)
            vectorizer = joblib.load(BytesIO(v_obj['Body'].read()))
        except Exception as e:
            raise Exception(f"Failed to load .pkl files from S3: {str(e)}")

def lambda_handler(event, context):
    try:
        # Ensure model and vectorizer are ready before processing
        load_model()
        
        # Handles both direct Lambda tests and API Gateway POST requests
        body = event.get('body', event)
        if isinstance(body, str):
            body = json.loads(body)
            
        text = body.get('text', '')
            
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided.'})
              }

        # 1. Transform input text to numeric features using the trained Vectorizer
        data_vec = vectorizer.transform([text])
        
        # 2. Predict using the Logistic Regression Model 
        prediction = int(model.predict(data_vec)[0])
        
      # 3. Map numeric result to human-readable label
        result_label = "HAM" if prediction == 1 else "SPAM"
        
        # RESPONSE WITH CORS
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*' # needed for cross-domain web integration
            },
            'body': json.dumps({
                'status': 'success',
                'classification': result_label,
                'input_text': text
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error_message': f"Inference Error: {str(e)}"})
        }

Key features of the Lambda function:

Warm start caching: By defining the model and vectorizer variables outside the lambda_handler, we store them in the container's memory. This significantly reduces cold start latency for subsequent requests.
Dynamic dependency loading: The sys.path.append('/opt/python') line allows us to import heavy libraries from S3/Layers without exceeding the upload limit.
Bimodal input handling: The function is designed to handle both direct JSON testing from the AWS console and stringified payloads sent via API Gateway.

3. The API Gateway - The Bridge to the Web

Photo by Growtika on Unsplash

Creating the REST API

Next we'll create a REST API with a single POST method. Why POST, you might be wondering? Well, we need to securely send a JSON payload containing the user’s text message to our model.

First navigate to the Amazon API Gateway console and select Create API -> REST API.
Give your API a name, such as EmailSpamPredictor-API, and set the Endpoint Type to Regional.
Then in the left sidebar, click Resources and enter a resource name (e.g: / predict as entered by me)
Next click the create method and select POST and then select Lambda Function for integration type
Ensure Lambda Proxy integration is enabled (this allows the full request to pass through to your code).

The CORS Configuration (The Troubleshooting Hub)
This is where many developers encounter the dreaded Connection Error. Since our API is hosted on AWS, and if your front-end is on a separate website, the browser’s Same-Origin Policy will block the request by default.

To fix this, we'll enable CORS:

Access-Control-Allow-Origin: Set to * (or specifically to your domain) to tell the browser that the API is allowed to talk to your front-end.
The OPTIONS method: API Gateway creates an OPTIONS method automatically. This handles the Preflight request where the browser asks, “Are you allowed to receive data from me?” before sending the actual text.
Access-Control-Allow-Headers: In the screenshot, you'll notice headers like Content-Type and Authorization are allowed. This ensures that when our JavaScript fetch() call sets the content type to application/json, the API Gateway doesn't reject it.

Image illustrates the CORS configuration for our project. (Image by author)

Deployment Stages

Once the API is deployed to a production stage, AWS generates a permanent Invoke URL. This acts as the public gateway to our model and typically follows this structure: https://[api-id].execute-api.[region].amazonaws.com/prod/classify.

Connecting the Frontend (The JavaScript Layer)

With the API live, we can now write a simple JavaScript function to talk to our model. This script runs whenever a user clicks the Analyze button on your site.


async function checkSpam() {
    const message = document.getElementById("userInput").value;
    const apiUrl = "YOUR_API_GATEWAY_INVOKE_URL";

    try {
        const response = await fetch(apiUrl, {
            method: "POST",
            headers: {
                "Content-Type": "application/json"
            },
            body: JSON.stringify({ "text": message })
        });

        const data = await response.json();
        
        // Display result on the webpage
        const resultElement = document.getElementById("result");
        resultElement.innerText = `Prediction: ${data.classification}`;
        resultElement.style.color = data.classification === "SPAM" ? "red" : "green";

    } catch (error) {
        console.error("Error:", error);
        alert("Could not connect to the Spam Detector API.");
    }
}

4. How to Run The Project Locally

You can store the front-end as an HTML file. Once it's ready, you shouldn’t just double-click the .html file. Opening it as a file in your browser can cause security restrictions. Instead, you should host it using a simple local server.

Step 1: Open the terminal or Command Prompt.

Step 2: Navigate to your project folder

cd [PATH_TO_YOUR_FOLDER]

Step 3: Start a local Python web server.

python -m http.server 8000

Step 4: Access the application.

Open your browser and navigate to:
http://localhost:8000/your-file-name.html

Watch the Demo:

5. Our Project Architecture

The image illustrates the architecture of our project (Building a Serverless Spam Classifier). It shows the process that takes place from the client input to the final model output. (Image by Author)

Client Front-End Interaction: The process starts on the far left. A user interacts with the web interface (for example, a website or a desktop app). They input text like WIN free iPhone now and trigger a request.
The Entry Point: API Gateway: The request hits the Amazon API Gateway, which acts as the security guard and translator.
(a) CORS OPTIONS handles the pre-flight handshake to ensure the browser has permission to talk to the AWS cloud.
(b) Classification Request (POST) routes the actual message data to your backend logic.
The Engine: AWS Lambda (Python 3.11): The central “lightbulb” represents your Lambda function. This is where the code you wrote lives. It doesn’t run 24/7 – it only wakes up when a request arrives.
Storage & Retrieval: S3 Bucket: Since Lambda is lightweight, it doesn’t store your heavy Machine Learning files internally.
Dependency and Model Download: The function reaches out to the S3 Bucket to pull in the sklearn_lib.zip (the engine) and the .pkl files (the intelligence).
Required Dependency and Model: These assets are loaded into the Lambda’s temporary memory to prepare for the prediction.
The Inference Pipeline: Inside the Lambda, a three-step mathematical cycle occurs:
(a) Text Vectorizer: Translates the words into numbers.
(b) Logistic Regression: Calculates the probability of spam based on those numbers.
(c) Label: Assigns a final result (Spam or Ham).
The Result Delivery: The result is sent back through the API Gateway, including the necessary CORS Headers to ensure the browser accepts it. The front-end then updates to show the “Result: SPAM” with a visual indicator.

6. Conclusion: The Power of Serverless AI

By merging the mathematical simplicity of Logistic Regression with the industrial strength of AWS Serverless Architecture, we have transformed a static Python script into a globally accessible, scalable API.

This project demonstrates that you don’t need a massive budget or a 24/7 dedicated server to deploy high-quality Machine Learning.

Using the S3-to-Lambda workaround allowed us to bypass common storage hurdles, ensuring that our Brain (the model) and its Muscle (Scikit-Learn) could function seamlessly within the cloud’s ephemeral environment. It bridges the gap between experimentation and real-world applications, making AI systems practical, efficient, and accessible.

7. Acknowledgment / References

Pre-trained spam classification model: View on Hugging Face (rakshath1/mail-spam-detector · Hugging Face)
Scikit-learn Documentation
AWS Lambda Documentation
Amazon S3 Documentation
Amazon API Gateway Documentation

Connect With Me

You may also like

Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It

Rudrendu Paul — Wed, 22 Apr 2026 22:33:18 +0000

Your team shipped an LLM-based summaries feature to wave 1 workspaces at week 20 and now the post-launch doc is due. You need a causal effect number, a specific estimate you can defend to a statistician.

The problem is that wave 2 workspaces are still waiting, a product-wide onboarding redesign shipped the same Tuesday, and week 20 also coincided with a quarterly engagement bump. Any comparison between the two groups after week 20 mixes the feature's causal effect with the redesign, the seasonality, and whatever selection criteria determined which workspaces landed in wave 1 in the first place.

This is how most enterprise SaaS teams ship AI features in 2026: one workspace at a time, in waves, on a rollout calendar. Randomization doesn't happen, and because randomization doesn't happen, A/B testing can't give you a clean causal effect. The result is a number on a dashboard that everyone argues over.

Call this the Rollout Calendar Trap: you have real data, a real experiment structure, and a completely invalid comparison. For data scientists shipping AI features in waves, it's the primary source of bad causal claims downstream.

Product experimentation for generative AI features follows this exact pattern: the hypothesis is that the AI feature causes higher engagement, and the wave structure is supposed to test it.

The wave calendar replaced the coin flip, and that substitution breaks the math. A simple A/B comparison assumes randomized assignment that the rollout never produced, so the measurement tool fails even when the experiment design is sound.

Difference-in-differences is the causal inference method that fixes this. It subtracts the time trend by comparing how outcomes shift across time periods for each group, giving you a defensible causal estimate even without randomization.

In this tutorial you'll use it to measure the true causal effect of an AI feature rolled out across enterprise workspaces, with working Python code against a synthetic SaaS product dataset.

By the end you'll know how to run a DiD estimate, how to test its parallel-trends assumption, and what to do when that assumption fails.

Why A/B Testing Breaks for Staged Rollouts
What Difference-in-Differences Does
Prerequisites
Setting Up the Working Example
Step 1: A Simple 2x2 DiD
Step 2: Regression DiD with Fixed Effects
Step 3: Checking the Parallel-Trends Assumption
When Difference-in-Differences Fails
What to Do Next

Why A/B Testing Breaks for Staged Rollouts

Random assignment is the engine that makes A/B testing a valid causal method. When you flip a coin to decide which user gets the feature, the treatment and control groups end up with identical distributions of every confounder (any variable that affects both who gets treatment and what outcome you measure). Any difference in outcomes after assignment is the causal effect of the treatment. Full stop.

A staged rollout across enterprise workspaces breaks that engine in three ways:

1. The wave assignment isn't random.

Product teams choose wave 1 workspaces for various reasons: they have the most engaged admins, the largest seat counts, or the best relationship with customer success. Those reasons correlate directly with your outcome. Wave 1 workspaces were going to show higher engagement anyway, feature or no feature.

2. The calendar introduces a time trend

Between week 20 (wave 1 launch) and week 30 (wave 2 launch), your product gets better, your onboarding improves, your sales team lands bigger customers. Any naïve "engagement after week 20 minus engagement before week 20" comparison picks up all of that along with the feature's effect.

3. Adoption inside treated workspaces is itself selective

Even inside a workspace that received the feature, not every user turns it on. Power users do, and less engaged users often wait months. Comparing users who used the feature against users who didn't introduces selection bias, where the groups differ systematically before you even measure the outcome, on top of the non-random workspace assignment.

A/B testing assumes none of these three problems exist. Staged rollouts guarantee all three. The naïve comparison gives you a number, and that number measures engagement theater.

What Difference-in-Differences Does

Difference-in-differences compares the change in outcomes over time between a treated group and a control group. Subtracting one change from the other cancels any shared time trend (product improvements, seasonality, onboarding changes) because both groups experience it equally, leaving you with just the treatment effect.

Here's a concrete example. Imagine tracking quarterly revenue for coffee shops in two neighborhoods. One neighborhood gets a new competitor in Q3, the other doesn't.

Both neighborhoods experience the same underlying market trends, a local economic upturn, and holiday seasonality. DiD isolates the competitor's impact by subtracting whatever revenue shift happened in both neighborhoods.

Your staged rollout sets up the exact same structure: wave 1 workspaces are the neighborhood with the new entrant, wave 2 is the comparison.

The math formalizes this as a 2x2 table, where rows are groups (treated, control), columns are time periods (pre, post), and each cell holds the mean outcome for that group in that period:

A = mean task completion for wave 1 users before week 20 (coffee shops: Q2 revenue, neighborhood with incoming competitor)
B = mean task completion for wave 1 users after week 20 (coffee shops: Q3 revenue, same neighborhood)
C = mean task completion for wave 2 users before week 20 (coffee shops: Q2 revenue, the untouched neighborhood)
D = mean task completion for wave 2 users after week 20 (coffee shops: Q3 revenue, same)

                         Pre     Post
Treated (wave 1):         A       B
Control (wave 2):         C       D

Naive post-period gap:   B - D     (contaminated by group differences)
Naive treated change:    B - A     (contaminated by time trend)
DiD:                 (B - A) - (D - C)   ← the causal effect

B - A is wave 1's change, but it includes both the treatment effect and whatever time trend moved everyone. D - C is wave 2's change over the same window, same time trend, no treatment. Subtracting one from the other leaves only the treatment effect.

The counterfactual is what wave 1 would have looked like without the treatment. DiD constructs it by saying: wave 1's counterfactual trajectory = wave 1's pre-period level, carried forward with wave 2's post-period trend. The gap between the actual wave 1 trajectory and that counterfactual is the DiD estimate.

Figure 1: Causal inference with difference-in-differences. Blue solid: Wave 1 actual trajectory. Orange dashed: Wave 2 (control, untreated during this window). Blue dotted: the counterfactual, where Wave 1 would have gone based on Wave 2's post-period trend. The green arrow is the DiD estimate: the gap between the actual Wave 1 trajectory and the counterfactual in the post-treatment period. A, B, C, D correspond to the four cells in the table above.

Before week 20, wave 1 and wave 2 track each other closely. That's the parallel-trends requirement at work. At week 20, wave 1 pulls ahead of both wave 2 and its own counterfactual (the dotted line). That post-treatment divergence is the DiD estimate.

The DiD estimate handles two types of bias at once. Permanent differences between treated and control groups (wave 1 workspaces were always more engaged) cancel out because DiD focuses on changes in outcomes across time periods. Time trends that affect both groups (product improvements, market seasonality) cancel out because both groups experience them.

DiD asks one thing in return: parallel pre-treatment trends. The treated and control groups have to be moving in the same direction at the same rate before treatment starts. When that holds, you can extrapolate the shared trend forward and attribute any post-treatment divergence to the treatment. If the trends were already diverging before treatment, DiD is biased, and no amount of clever regression fixes it.

Parallel trends is the assumption you'll test in step 3.

Companion Notebook

All the code in this tutorial, including the synthetic dataset, the DiD regression, the parallel-trends plot, and the placebo pre-trend test, lives in a single executable Jupyter notebook in the GitHub repo for this series on product experimentation and causal inference for GenAI and LLM applications.

You can clone it, run generate_data.py once, and every output in this article reproduces exactly: github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm

Prerequisites

You'll need Python 3.11 or newer and comfort with pandas and basic regression. You can follow along without prior causal inference experience, as the article defines confounders and selection bias inline when they first appear. You'll encounter clustered standard errors and fixed effects in step 2. The article explains what they do and why they matter, but it doesn't derive them from scratch.

Install the packages for this tutorial:

pip install numpy pandas statsmodels linearmodels matplotlib

Clone the companion repo to get the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Setting Up the Working Example

The dataset simulates a SaaS product with an AI summaries feature launched in two waves: wave 1 workspaces get it at week 20, wave 2 at week 30, with 50,000 users total, each with one row of telemetry.

The data generator bakes in a +5 percentage point causal effect on task completion for users in their workspace's post-treatment period. You know the truth upfront, so you can check whether your DiD estimator actually recovers it.

Load the data and inspect the structure:

import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(df.shape)
print(df[["wave", "signup_week", "workspace_id", "task_completed"]].head())
print("\nWave sizes:", df.wave.value_counts().to_dict())
print("Treatment weeks per wave:",
      df.groupby("wave").treatment_week.first().to_dict())

Expected output:

(50000, 16)
   wave  signup_week  workspace_id  task_completed
0     2           10            36               0
1     2           51            44               1
2     2            2            28               1
3     1           15            20               1
4     1           29             0               1
Wave sizes: {2: 25063, 1: 24937}
Treatment weeks per wave: {1: 20, 2: 30}

Here's what's happening: you load 50,000 rows, one per user. Wave 1 has about 24,937 users across 25 workspaces; wave 2 has about 25,063 users across 25 different workspaces. The treatment_week column records when each user's workspace got the AI summaries feature (week 20 for wave 1, week 30 for wave 2). The task_completed column is your outcome: did the AI successfully complete the user's task.

One important detail: signup_week in this dataset records which calendar week a user first joined the product, and we're using it as a time index to assign users to pre- or post-treatment cohorts.

A user who signed up in week 22 joined after the feature launched, so their experience is "post-treatment." A user who signed up in week 14 joined before the launch, so their experience is "pre-treatment."

This works here because each user has one row of telemetry tied to their initial product experience. In a panel dataset with multiple observations per user across time, you'd use an observation timestamp column tied to when each row was recorded.

To keep the analysis clean, restrict to users who signed up before the wave 2 launch (signup_week < 30). Wave 2 then works as a proper control group, since it hasn't been treated yet, while wave 1 has been treated for 10 weeks.

analysis = df[df.signup_week < 30].copy()
analysis["post"] = (analysis.signup_week >= 20).astype(int)
analysis["treated"] = (analysis.wave == 1).astype(int)

print(analysis.groupby(["treated", "post"])
              .agg(n=("user_id", "count"),
                   mean_completion=("task_completed", "mean"))
              .round(3))

Expected output:

                 n  mean_completion
treated post
0       0     9590            0.556
        1     4878            0.555
1       0     9633            0.592
        1     4738            0.643

Here's what's happening: you filter the data to the analysis window (weeks 0 to 29) and create two indicator variables. post is 1 for users in the post-week-20 period, 0 otherwise. treated is 1 for wave 1 users, 0 for wave 2. The groupby shows the four cells of the DiD 2x2 table: (treated=0, post=0), (treated=0, post=1), (treated=1, post=0), (treated=1, post=1). Those four means are everything you need for a first-pass DiD estimate.

Step 1: A Simple 2x2 DiD

Start with the cleanest version. Compute the four cell means by hand, then take the difference of differences:

cells = analysis.groupby(["treated", "post"]).task_completed.mean()

wave2_pre  = cells.loc[(0, 0)]   # control, pre
wave2_post = cells.loc[(0, 1)]   # control, post
wave1_pre  = cells.loc[(1, 0)]   # treated, pre
wave1_post = cells.loc[(1, 1)]   # treated, post

did_effect = (wave1_post - wave1_pre) - (wave2_post - wave2_pre)
print(f"Wave 1 change: {wave1_post - wave1_pre:+.4f}")
print(f"Wave 2 change: {wave2_post - wave2_pre:+.4f}")
print(f"DiD effect:    {did_effect:+.4f}")

Expected output:

Wave 1 change: +0.0515
Wave 2 change: -0.0013
DiD effect:    +0.0527  (ground truth = +0.05)

Here's what's happening: you pull the four cell means, compute wave 1's change in task completion from pre to post, compute wave 2's change over the same calendar window (wave 2 hasn't been treated yet), and take the difference. The DiD estimate is the piece of wave 1's change that can't be explained by whatever time trend also moved wave 2.

On this dataset the simple 2x2 estimate lands at +0.053, which is very close to the true +0.05. But you can't take this number to a product review. You have no standard errors, which means you can't say whether +0.053 is a real signal or within sampling noise. You have no covariate adjustment, so if wave 1 happened to have more heavy users in this cohort, some of that +0.053 could be engagement-tier composition. And you have no way to handle the workspace-level correlation in your data. Step 2 fixes all three.

Step 2: Regression DiD with Fixed Effects

The regression formulation of DiD produces the same point estimate as the 2x2 table when there are no covariates. But it also buys you three things:

Standard errors and p-values computed correctly
Covariate adjustment to reduce variance and sharpen your estimate
Cluster-robust errors that handle correlation within workspaces, which a staged rollout always has

The regression is: outcome ~ treated + post + treated:post + controls. The coefficient on the treated:post interaction is your DiD estimate.

import statsmodels.formula.api as smf

did_model = smf.ols(
    "task_completed ~ treated * post + C(engagement_tier)",
    data=analysis
).fit(
    cov_type="cluster",
    cov_kwds={"groups": analysis.workspace_id}
)

print(did_model.summary().tables[1])

Expected output:

================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                        0.8301      0.007    126.538      0.000       0.817       0.843
C(engagement_tier)[T.light]     -0.4027      0.006    -63.168      0.000      -0.415      -0.390
C(engagement_tier)[T.medium]    -0.1766      0.007    -25.931      0.000      -0.190      -0.163
treated                          0.0367      0.005      6.885      0.000       0.026       0.047
post                            -0.0056      0.008     -0.684      0.494      -0.022       0.011
treated:post                     0.0541      0.011      4.981      0.000       0.033       0.075
================================================================================================

Here's what's happening: you fit an ordinary least squares regression of task completion on the treated indicator, the post indicator, their interaction, and a categorical control for engagement tier.

The treated:post coefficient is the DiD estimate. Users in the same workspace share common shocks, making their outcomes correlated. Grouping by workspace_id corrects for that.

On this dataset the treated:post coefficient comes out at +0.054 with a clustered p-value of <0.001. The ground truth is +0.050. At 0.4 percentage points from the true effect, with a standard error that accounts for workspace-level correlation, that's a number you can put in a product review.

A few practical notes on this regression:

Controls should be time-invariant (engagement tier, signup cohort). Time-varying controls that are themselves affected by treatment will bias the estimate.
Only the interaction has a causal interpretation. The intercept and level terms describe baseline differences between groups, nothing more.
Clustered errors are mandatory. Skip clustering and your standard errors are 3 to 10x too small, test statistics are artificially inflated, and results look far more significant than they are.

Step 3: Checking the Parallel-Trends Assumption

DiD is only valid if wave 1 and wave 2 were moving in the same direction at the same rate before treatment started. You check this by plotting (or tabulating) weekly means for the two waves across the pre-treatment window.

import matplotlib.pyplot as plt
import numpy as np

df_plot = df[df.signup_week < 30].copy()
weekly = (df_plot.groupby(["signup_week", "wave"])
             .task_completed.mean()
             .reset_index()
             .pivot(index="signup_week", columns="wave", values="task_completed"))

# 3-week rolling average to smooth week-to-week sampling noise
smoothed = weekly.rolling(3, center=True, min_periods=2).mean()

TREATMENT_WEEK = 20
pre_idx = smoothed.index[smoothed.index < TREATMENT_WEEK]
post_idx = smoothed.index[smoothed.index >= TREATMENT_WEEK]

# DiD counterfactual: wave 1 pre-period mean + wave 2's post-period change
wave1_pre_mean = smoothed.loc[pre_idx, 1].mean()
wave2_pre_mean = smoothed.loc[pre_idx, 2].mean()
counterfactual = wave1_pre_mean + (smoothed.loc[post_idx, 2].values - wave2_pre_mean)

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.axvspan(-0.5, TREATMENT_WEEK, alpha=0.04, color="#94A3B8", zorder=0)
ax.axvspan(TREATMENT_WEEK, 29.5, alpha=0.06, color="#3B82F6", zorder=0)
ax.plot(smoothed.index, smoothed[2], "s--", color="#F59E0B", linewidth=2,
        markersize=4, label="Wave 2 — control (untreated during this window)", zorder=3)
ax.plot(smoothed.index, smoothed[1], "o-", color="#2563EB", linewidth=2.2,
        markersize=4, label="Wave 1 — treated (AI feature on at week 20)", zorder=4)
ax.plot(post_idx, counterfactual, ":", color="#2563EB", linewidth=2.2,
        label="Wave 1 counterfactual (projected without treatment)", zorder=4)
ax.axvline(TREATMENT_WEEK, color="#DC2626", linestyle="--", linewidth=1.8,
           label="AI feature launched (week 20)")

ax.text(9.5, 0.508, "Pre-treatment period\n(parallel trends required)",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.text(24, 0.508, "Post-treatment",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.set_xlabel("Week", fontsize=11)
ax.set_ylabel("Mean task completion rate", fontsize=11)
ax.set_title("Figure 2: Data-Driven Parallel-Trends Check\n(3-week rolling average, 50k users)",
             fontsize=12, fontweight="bold", pad=14)
ax.legend(loc="upper left", fontsize=9, framealpha=0.92)
ax.set_xlim(-0.5, 29.5)
ax.set_ylim(0.50, 0.72)
ax.grid(True, alpha=0.18, linestyle=":")
ax.tick_params(labelsize=10)
plt.tight_layout()
plt.savefig("parallel_trends.png", dpi=150, bbox_inches="tight")
print("Saved parallel_trends.png")

Expected output (Figure 2, data-driven verification):

Saved parallel_trends.png

Figure 2 is the data-driven parallel-trends check from your actual dataset, plotted as a 3-week rolling average to smooth week-to-week sampling noise. Both waves track each other closely before week 20, and small wiggles in the pre-period affect both groups at the same time, which is exactly what parallel trends looks like. After week 20, wave 1 separates cleanly above the dotted counterfactual line. The gap between the solid blue line and the dotted line in the post-treatment window is the DiD estimate playing out in your actual data.

Here's what's happening: you group by signup week and wave, compute the mean task completion rate per cell, pivot so each wave is a column, and plot the two time series together.

A vertical dashed line marks week 20 when wave 1 got treatment. In the pre-treatment window (weeks 0 to 19) the two series should track each other closely. After week 20, wave 1 should pull ahead of wave 2 by roughly the treatment effect.

To put a number on it, run a placebo regression on the pre-treatment period only. Regress the outcome on a linear time trend interacted with the treated indicator. If the interaction coefficient is near zero and insignificant, the two groups were moving in parallel before treatment:

pre_only = analysis[analysis.post == 0].copy()
pre_only["weeks_since_start"] = pre_only.signup_week - 10  # center

placebo_model = smf.ols(
    "task_completed ~ treated * weeks_since_start + C(engagement_tier)",
    data=pre_only
).fit(
    cov_type="cluster",
    cov_kwds={"groups": pre_only.workspace_id}
)

print("Pre-trend slope difference:",
      placebo_model.params["treated:weeks_since_start"])
print("p-value:",
      placebo_model.pvalues["treated:weeks_since_start"])

Expected output:

Pre-trend slope difference: -0.00095...
p-value: 0.4435...

Here's what's happening: you restrict to pre-treatment observations, fit a regression that lets wave 1 and wave 2 follow different linear trends in the pre-period, and read off the interaction coefficient.

A coefficient close to zero with p > 0.05 means the two waves were moving in parallel before treatment. If that coefficient is large and statistically significant, the parallel-trends assumption is broken: your DiD estimate is absorbing whatever differential trend separated the groups before week 20.

If the placebo test fails, stop and rethink. Your options: restrict to a narrower pre-window where trends were parallel, find a better control group, or switch to synthetic control, which builds a weighted counterfactual from multiple untreated units.

On this synthetic dataset the placebo test passes: the pre-trend slope difference is -0.00095 with p = 0.44, so the parallel-trends assumption holds and the +0.054 estimate from step 2 is trustworthy.

When Difference-in-Differences Fails

DiD is a precise accounting method, and every precise method has specific failure modes worth knowing before you trust its output. Here are four common ones:

1. Non-parallel Pre-trends

When the treated and control groups were already diverging before treatment started, DiD mistakes that pre-existing drift for a treatment effect.

The placebo test in step 3 is your guard. Run it every time. If it fails, you have three options:

Restrict the analysis to a shorter pre-window where trends were parallel and re-run the placebo
Find a better control group whose pre-trend matches the treated group
Switch to synthetic control, which builds a weighted counterfactual from multiple untreated units and picks the weights to match the treated group's pre-treatment trajectory

2. Staggered Adoption

A staged rollout with three or more waves demands a different approach than a clean 2x2. Wave 1 gets treated at week 20, wave 2 at week 30, wave 3 at week 40. Once wave 2 is treated, it's no longer a valid control for wave 1 comparisons that span weeks 30 and beyond. Earlier treated units start acting as controls for later ones, which contaminates the estimate.

That's the Goodman-Bacon decomposition problem, and the standard two-way fixed effects estimator from step 2 will silently absorb it. The Callaway-Sant'Anna estimator (see their 2021 paper) fixes this by averaging only the clean 2x2 comparisons and discarding the contaminated ones. The differences package in Python implements it.

3. Time-varying Confounders that Hit Only the Treated Group

If your marketing team runs a targeted campaign in wave 1 workspaces during week 22, you've got a treatment-specific shock DiD can't net out.

Parallel trends certifies the pre-treatment period, but the post-treatment window remains your responsibility to audit.

Check every product or marketing event inside the analysis window. If you find one, the only options are to redesign the study, restrict the analysis to the window before the shock, or model the shock explicitly as a second treatment variable.

4. Anticipation Effects

If wave 1 customers knew in week 18 that the feature was coming in week 20, some will have started behaving differently before treatment technically started: signing up more, pre-configuring settings, contacting support. That contaminates the "pre" period. The tell is a bump or dip in wave 1 in the weeks immediately before week 20 on the event-study plot.

The fix is to push the pre-period cutoff back. Treat week 18 as the "treatment" start for purposes of the analysis, which removes the anticipation window from your pre-period baseline.

Each of these failure modes has a diagnostic and a specific remedy. Naming them in your analysis builds credibility with skeptical reviewers. DiD is a careful accounting identity – it produces reliable estimates exactly as long as its inputs are clean.

What to Do Next

The regression DiD above is the right tool for a two-wave rollout. If your rollout has three or more waves, switch to the Callaway-Sant'Anna estimator. If your rollout crosses a treatment threshold you set deliberately (confidence scores, query complexity), look into regression discontinuity. If you want to compare a single treated unit against a constructed counterfactual, synthetic control is the right choice.

The companion notebook for this tutorial is here. Clone the repo, generate the synthetic dataset with generate_data.py, and open did_demo.ipynb to reproduce every code block with pre-saved outputs.

If you ship AI features in waves, your rollout calendar is already a DiD study. The only question is whether you run the analysis.

How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP

Rasheedat Atinuke Jamiu — Wed, 22 Apr 2026 20:30:00 +0000

Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.

In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.

Prerequisites
Project Setup
Step 1: Install Packer
Step 2: Set Up Project Directory
Step 3: Install Packer's Plugins
Step 4: Define Your Source
Step 5: Writing the Build Template
Step 6: Writing the GPU Provisioning Script
Step 7:Assembling and Running the Build
Step 8: Test the Image and Verify the GPU Stack
Conclusion
References

Prerequisites

HashiCorp Packer >= 1.9
Google Compute Packer plugin (installed via packer init)
Optionally, the AWS Packer plugin can be used for EC2 builds by adding an amazon-ebs source to node.pkr.hcl
GCP project with Compute Engine API enabled (or AWS account with EC2 access)
GCP authentication (gcloud auth application-default login) or AWS credentials
Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)

Project Setup

Step 1: Install Packer

To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation guides).

First, you'll install the official Packer formula from the terminal.

Install the HashiCorp tap, a repository of all Hashicorp packages.

$ brew tap hashicorp/tap

Now, install Packer with hashicorp/tap/packer.

$ brew install hashicorp/tap/packer

Step 2: Set Up Project Directory

With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your packer_demo folder using the command below:

mkdir -p packer_demo/script && touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh

Your file directory should look like this:

packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script

Step 3: Install Packer's Plugins

In your plugins.pkr.hcl file,, define your plugins in the packer block. The packer {} block contains Packer settings, including specifying a required plugin version. You'll find the required_plugins block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin here.

packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~> 1"
    }
  }
}

Then, initialize your Packer plugin with the command below:

packer init .

Step 4: Define Your Source

With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your project ID, the zone where your machine will be created, the source_image_family (think of this as your base image, such as Debian, Ubuntu, and so on), and your source_image_project_id.

In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the machine type to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.

source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}

Setting on_host_maintenance = "TERMINATE" on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.

You'll define all your variables in the variable.pkr.hcl file, and set the values in the values.pkrvars.hcl. Remember to always add your values.pkrvars.hcl file to Gitignore.

variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}

values.pkrvars.hcl

image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1"

Step 5: Writing the Build Template

Create build.pkr.hcl. The build block creates a temporary instance, runs provisioners, and produces an image.

Provisioners in this template are organized as follows:

First provisioner runs system updates and upgrades.
Second provisioner reboots the instance (expect_disconnect = true).
Third provisioner waits for the instance to come back (pause_before), then runs script/base.sh. This provisioner sets max_retries to handle transient SSH timeouts and pass environment variables for DRIVER_VERSION and CUDA_VERSION.

Lastly, you have the post-processor to tell you the image ID and completion status:

build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}

Step 6: Writing the GPU Provisioning Script

Now we'll go through the base script, and break down some parts of it.

Section 1: Pre-Installation (Kernel Headers)

Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.

log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

Section 2: Installing NVIDIA's Apt Repository

This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.

log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

Section 3: Pinning NVIDIA Drivers Version

Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.

NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit

A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.

log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

Section 4: Installing the Driver

The libnvidia-compute installs only the compute‑related user‑space libraries (CUDA driver components), while the nvidia-dkms-open; installs the open‑source NVIDIA kernel module, built locally via DKMS.

Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.

Here, we're using NVIDIA’s compute‑only driver stack using the open‑source kernel modules, as it deliberately avoids installing any display-related components, which you don't need.

This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.

log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

Section 5: CUDA Toolkit Installation

This part of the script installs the CUDA Toolkit for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.

It adds CUDA binaries to PATH, so commands like nvcc, cuda-gdb, and cuda-memcheck work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.

log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat <<'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

Section 6: NVIDIA Container Toolkit

This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.

log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

Section 7: Installing DCGM (Data Center GPU Manager)

This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.

It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.

The script extracts the installed version and checks that it meets the minimum required version for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.

log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2>/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] && [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

Section 8: Enabling Persistence Mode

The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.

Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.

log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

Section 9: System Tuning for GPU Compute Workloads

This block applies a set of system‑level performance and stability tunings that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.

Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.

Swap and memory behavior: Disabling swap and setting vm.swappiness=0 prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.
Hugepages for large memory allocations: Setting vm.nr_hugepages=2048 allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.

CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.
CPU frequency governor: Installing cpupower and forcing the CPU governor to performance ensures the CPU stays at maximum frequency instead of scaling down.

GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.
NUMA and topology tools: Installing numactl, libnuma-dev, and hwloc provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.
Disabling irqbalance: Stopping and disabling irqbalance it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.

log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

Full base.sh script here:

#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" >&2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] && error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] && error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release && echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat <<'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2>/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] && [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"

Step 7: Assembling and Running the Build

Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.

packer validate -var-file=values.pkrvars.hcl .

If validation succeeds, you’ll see a short confirmation like The configuration is valid.. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:

packer build -var-file=values.pkrvars.hcl .

The build typically takes 15–20 minutes, depending on network speed and package installs. Watch the Packer log for three key checkpoints:

Instance creation — confirms the temporary VM was provisioned.
Provisioner output — shows each script step (updates, reboot, script/base.sh) and any errors.
Image creation — indicates the build finished and an image artifact was written.

If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.

googlecompute.gpu-node: output will be in this color.

==> googlecompute.gpu-node: Checking image does not exist...
==> googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==> googlecompute.gpu-node: no persistent disk to create
==> googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==> googlecompute.gpu-node: Creating instance...
==> googlecompute.gpu-node: Loading zone: us-central1-a
==> googlecompute.gpu-node: Loading machine type: g2-standard-4
==> googlecompute.gpu-node: Requesting instance creation...
==> googlecompute.gpu-node: Waiting for creation operation to complete...
==> googlecompute.gpu-node: Instance has been created!
==> googlecompute.gpu-node: Waiting for the instance to become running...
==> googlecompute.gpu-node: IP: 34.58.58.214
==> googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==> googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==> googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==> googlecompute.gpu-node:
==> googlecompute.gpu-node: No containers need to be restarted.
==> googlecompute.gpu-node:
==> googlecompute.gpu-node: User sessions running outdated binaries:
==> googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==> googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==> googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==> googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==> googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==> googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==> googlecompute.gpu-node: [BASE] Updating system packages...
==> googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==> googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==> googlecompute.gpu-node: [BASE] Installing DCGM...
==> googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==> googlecompute.gpu-node: [BASE] Applying system tuning...
==> googlecompute.gpu-node: vm.swappiness=0
==> googlecompute.gpu-node: vm.nr_hugepages=2048
==> googlecompute.gpu-node: Setting cpu: 0
==> googlecompute.gpu-node: Error setting new values. Common errors:
==> googlecompute.gpu-node: [BASE] ============================================
==> googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==> googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==> googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==> googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==> googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==> googlecompute.gpu-node: [BASE] ============================================
==> googlecompute.gpu-node: Deleting instance...
==> googlecompute.gpu-node: Instance has been deleted!
==> googlecompute.gpu-node: Creating image...
==> googlecompute.gpu-node: Deleting disk...
==> googlecompute.gpu-node: Disk has been deleted!
==> googlecompute.gpu-node: Running post-processor:  (type shell-local)
==> googlecompute.gpu-node (shell-local): Running local shell script: 
==> googlecompute.gpu-node (shell-local): === Image Build Complete ===
==> googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==> googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==> Wait completed after 17 minutes 55 seconds

==> Builds finished. The artifacts of successful builds are:
--> googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134

Step 8: Test the Image and Verify the GPU Stack

Confirm the image exists in the GCP Console: Compute → Storage → Images and locate your newly created OS image.

Create a test VM from the image:

gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING

Once the instance is RUNNING, verify the NVIDIA driver and GPU are visible:

The nvidia-smi output confirms:

Driver 590.48.01 loaded
CUDA 13.1 available
Persistence Mode is On
The L4 GPU is detected with 23GB VRAM
Zero ECC errors
No running processes (clean idle state).

This is exactly what a healthy base image should look like. Notice Disp.A: Off? That confirms our compute-only driver choice is working — no display adapter is active.

Confirm the installed CUDA toolkit by running. nvcc --version. You can see that version 13.1 was installed as specified.

Let's confirm DCGM installation by running dcgmi discovery -l. Successful output indicates DCGM is running and communicating with the driver.

Conclusion

You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.

From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.

The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.

References

NVIDIA Driver Installation Guide (Ubuntu): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/
NVIDIA CUDA Toolkit Documentation: https://docs.nvidia.com/cuda/
NVIDIA Container Toolkit Installation Guide: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
NVIDIA DCGM Documentation: https://docs.nvidia.com/datacenter/dcgm/latest/index.html
NVIDIA Persistence Daemon: https://docs.nvidia.com/deploy/driver-persistence/index.html
HashiCorp Packer Documentation: https://developer.hashicorp.com/packer/docs
Packer Google Compute Builder: https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute

How to Use Context Hub (chub) to Build a Companion Relevance Engine

Nataraj Sundar — Fri, 17 Apr 2026 20:36:32 +0000

Large language models can write code quickly, but they still misremember APIs, miss version-specific details, and forget what they learned at the end of a session.

That is the problem Context Hub is trying to solve.

Context Hub (chub) gives coding agents curated, versioned documentation and skills that they can search and fetch through a CLI. It also gives them two learning loops: local annotations for agent memory and feedback for maintainers.

In this tutorial, you'll learn how the official chub workflow works, how Context Hub organizes docs and skills, how annotations and feedback create a memory loop, and how to build a companion relevance engine that improves retrieval without breaking the upstream content model.

This tutorial uses two public repositories side by side:

the official upstream project: andrewyng/context-hub
the companion implementation for this article: natarajsundar/context-hub-relevance-engine

I've also opened a corresponding upstream pull request from my fork to the main project. If you want to track that work from the article, use the upstream pull request list filtered by author: andrewyng/context-hub pull requests by natarajsundar.

What We'll Build

By the end of this tutorial, you'll have:

a clear mental model for how Context Hub works
a working local install of the official chub CLI
a repeatable workflow for search, fetch, annotations, and feedback
a companion repo that adds an additive reranking layer on top of a Context-Hub-style content tree
a small benchmark and local comparison UI you can run end to end
a clear bridge between the companion repo and the smaller upstream PR

Prerequisites

Before you start, make sure you have:

Node.js 18 or newer
npm
comfort with the terminal
basic familiarity with Markdown

How to Understand Context Hub
How to Understand the Official Repo, the Companion Repo, and the Upstream PR
How to Install and Use the Official CLI
How to Understand Docs, Skills, and the Content Layout
How to Use Incremental Fetch and Layered Sources
How to Use Annotations and Feedback to Create a Memory Loop
How to See Where Relevance Still Misses
How the Companion Relevance Engine Improves Retrieval
How to Run the Companion Repo End to End
How to Read the Benchmark Honestly
How to Connect the Companion Repo to the Upstream PR
Conclusion
Sources

How to Understand Context Hub

Context Hub is easiest to understand as a workflow for turning fast-moving documentation into a reliable input for coding agents.

Instead of asking an agent to rely on whatever it remembers from training data, you give it a predictable contract:

search for the right entry
fetch the right doc or skill
write code against that curated content
save local lessons as annotations
send doc-quality feedback back to maintainers

That system boundary matters.

It makes the agent easier to audit, easier to improve, and easier to extend. It also keeps the interface small enough that you can reason about where the failures happen. If the agent still misses the answer, you can ask whether the problem happened during search, fetch, context selection, or generation.

How to Understand the Official Repo, the Companion repo, and the Upstream PR

This tutorial is intentionally split across two codebases and one contribution path.

The official upstream project, andrewyng/context-hub, is the source of truth for the real CLI, the content model, and the documented workflows. That's the codebase you should use to learn how chub works today.

The companion repository, natarajsundar/context-hub-relevance-engine, is where the relevant ideas in this article are made concrete. It's a companion implementation, not a replacement product. Its job is to make retrieval tradeoffs visible, measurable, and easy to run locally.

The upstream PR is the bridge between those two worlds. The companion repo is where you can iterate faster on benchmarks, reranking, and the comparison UI. The upstream PR is where the smallest reviewable slices can be proposed back to the main project. You can track that thread here: upstream PR search filtered by author.

That three-part framing keeps the article honest:

use the upstream repo to understand the current system
use the companion repo to explore relevant improvements end to end
use the upstream PR to show how a larger idea can be broken into reviewable pieces

How to Install and Use the Official CLI

The official quick start is intentionally small.

npm install -g @aisuite/chub

Once the CLI is installed, you can search for what is available and fetch a specific entry:

chub search openai
chub get openai/chat --lang py

That's the happy path, but it helps to think through the request flow.

In practice, the most useful detail is that the CLI is designed for the agent to use, not just for the human to use by hand.

That's why the upstream CLI also ships a get-api-docs skill. For example, if you use Claude Code, you can copy the skill into your local project like this:

mkdir -p .claude/skills
cp $(npm root -g)/@aisuite/chub/skills/get-api-docs/SKILL.md \
  .claude/skills/get-api-docs.md

That step teaches the agent a retrieval habit:

Before you write code against a third-party SDK or API, use chub instead of guessing.

That behavioral rule is often as important as the docs themselves.

How to Understand Docs, Skills, and the Content Layout

Context Hub separates content into two categories:

docs, which answer “what should the agent know?”
skills, which answer “how should the agent behave?”

That distinction makes the content model easier to scale. Docs can be versioned and language-specific. Skills can stay short and operational.

The directory structure is also predictable. The content guide organizes entries by author, then by docs or skills, then by entry name.

A small example looks like this:

author/docs/payments/python/DOC.md
author/docs/payments/python/references/errors.md
author/skills/login-flows/SKILL.md

This is one of the reasons Context Hub is easy to work with.

The shape of the content is plain Markdown, the main entry file is predictable, and the build output is inspectable. You don't have to reverse engineer a hidden prompt layer to figure out what the agent is reading.

How to Use Incremental Fetch and Layered Sources

One of the best design choices in Context Hub is that it doesn't force you to inject every file into the model on every request.

Instead, the entry file gives you the overview, and the reference files hold the deeper material.

That lets you fetch content in progressively larger slices.

chub get stripe/webhooks --lang py
chub get stripe/webhooks --lang py --file references/raw-body.md
chub get stripe/webhooks --lang py --full

This is a token-budget feature as much as it is a documentation feature. A good agent should first load the overview, decide what part of the task matters, and only then fetch the specific supporting file.

Context Hub also supports layered sources. You can merge public content with your own local build output through ~/.chub/config.yaml.

A minimal configuration looks like this:

sources:
  - name: community
    url: https://cdn.aichub.org/v1
  - name: my-team
    path: /opt/team-docs/dist

That means you can keep public docs in one lane and team-specific runbooks in another lane while still giving the agent one search surface.

How to Use Annotations and Feedback to Create a Memory Loop

Context Hub has two different improvement channels.

Annotations are local. They help your agent remember what worked last time. Feedback is shared. It helps maintainers improve the docs for everyone.

That distinction matters because not every lesson belongs in the shared registry. Some lessons are environment-specific. Others point to content quality issues that should be fixed centrally.

Here is what local memory looks like in practice:

chub annotate stripe/webhooks \
  "Remember: Flask request.data must stay raw for Stripe signature verification."

And here's the feedback path:

chub feedback stripe/webhooks up

That loop is simple, but it's one of the most important ideas in the project. It turns a one-off debugging lesson into either persistent local memory or a signal that the shared docs need to improve.

How to See Where Relevance Still Misses

The upstream project already has a real ranking story. It uses BM25 and lexical rescue so that package-like identifiers, exact tokens, and fuzzy matches still have a chance to surface.

That is a strong baseline.

But developer queries are often much messier than package names.

People search for:

rrf
signin
pg vector
hnsw
raw body stripe

Those aren't “bad” queries. They're realistic shorthand.

And they expose an opportunity in the content model itself: many of the exact answers live in reference files such as references/rrf.md, references/raw-body.md, and references/hnsw.md.

So the question is not whether the current search works at all. It clearly does. The better question is this:

How can you improve retrieval without breaking the content contract that already makes Context Hub useful?

The answer in the companion repo is to keep the current model and add a reranking layer on top of it.

How the Companion Relevance Engine Improves Retrieval

The companion repository in this article is context-hub-relevance-engine.

It keeps the same broad ideas that make Context Hub attractive:

plain Markdown content
DOC.md and SKILL.md entry points
build artifacts you can inspect
local annotations and feedback
progressive fetch behavior

Then it adds one new build artifact: signals.json.

At build time, the engine extracts extra signals such as:

headings from the main file
titles and tokens from reference files
language and version metadata
source metadata and freshness
annotation overlap
feedback priors

The first pass stays cheap and transparent. The reranker only runs after the baseline has done its work.

That approach matters for two reasons.

First, it's additive. You don't have to redesign the content tree.

Second, it's measurable. You can define concrete failure modes, fix them one by one, and run the same benchmark every time you change the scorer.

How to Run the Companion Repo End to End

Open the repository on GitHub, clone it using GitHub’s normal clone flow, and then run the commands below from the project root.

cd context-hub-relevance-engine
npm install
npm run build
npm test

The repository has no third-party runtime dependencies, so npm install is mostly there to keep the workflow familiar. The main commands are all plain Node scripts.

How to Reproduce a Baseline Miss

Start with the query rrf.

node bin/chub-lab.mjs search rrf --mode baseline --lang python

Expected output:

No results.

Now run the improved mode.

node bin/chub-lab.mjs search rrf --mode improved --lang python

Expected top result:

langchain/retrievers [doc] score=320.24
  Composable retrieval patterns for hybrid search, parent documents, query expansion, and reranking.

That win happens because the improved mode looks beyond the top-level entry description. It also sees the reference file title rrf, the related terms from query expansion, and the broader token overlap in the extracted signals.

How to Reproduce a Workflow-intent Win

Try a sign-in query.

node bin/chub-lab.mjs search signin --mode baseline
node bin/chub-lab.mjs search signin --mode improved

The baseline misses. The improved mode returns playwright-community/login-flows because the reranker treats signin, sign in, login, and authentication as related intent.

How to Test the Memory Loop

Write a local note:

node bin/chub-lab.mjs annotate stripe/webhooks \
  "Remember: Flask request.data must stay raw for Stripe signature verification."

Then fetch the doc:

node bin/chub-lab.mjs get stripe/webhooks --lang python

You will see the main doc content, the list of available reference files, and the appended annotation.

That's the behavior you want from an agent memory loop: learn once, reuse many times.

How to Run the Benchmark

Start from an empty store:

npm run reset-store
node bin/chub-lab.mjs evaluate

The included synthetic stress set reports the following summary with an empty store:

Mode	Top-1 Accuracy	MRR
baseline	0.333	0.333
improved	1.000	1.000

You can also seed the store and rerun the evaluation:

npm run seed-demo
node bin/chub-lab.mjs evaluate

That demonstrates how annotations and feedback can push relevant entries even higher when the query overlaps with the agent’s own history.

How to Launch the Local Comparison UI

npm run serve

Then open http://localhost:8787 in your browser.

The UI lets you compare baseline and improved retrieval, inspect stored annotations and feedback, rebuild the local artifacts, and rerun the benchmark from one place.

How to Read the Benchmark Honestly

The benchmark in this repo is intentionally small.

That is a feature, not a flaw.

The point is not to claim universal search quality. The point is to make a handful of realistic failure modes easy to reproduce:

acronym queries
shorthand workflow queries
reference-file topic queries
memory-aware reranking

That keeps the evaluation honest.

If a future scoring change breaks rrf, signin, or raw body stripe, you'll know immediately. And if you add a stronger dataset later, you can keep these tests as regression guards.

The benchmark files included in the repo are:

demo/benchmark.json
docs/benchmark-empty-store.json
docs/benchmark-seeded-store.json
docs/relevance-improvement-plan.md

How to Connect the Companion Repo to the Upstream PR

A good companion repo is broad enough to explore ideas quickly. A good upstream PR is narrow enough to review.

That's why the two shouldn't be identical.

The companion repository is where you can keep the full relevance story together:

the local comparison UI
the synthetic benchmark
the richer reranking signals
the debug and explain surfaces
the documentation that walks through tradeoffs end to end

The upstream PR should be smaller and more surgical. In practice, that usually means proposing the most reviewable slices first, such as:

reference-file signal extraction
explainable score output for debugging
a lightweight benchmark fixture format
one additive reranking hook behind a flag

That keeps the main repository maintainable while still letting the article and companion repo tell the full engineering story. The upstream thread for this work lives here: andrewyng/context-hub pull requests by natarajsundar.

Conclusion

What makes Context Hub interesting is not just that it stores documentation. It gives you a clear system boundary for improving coding agents.

You can inspect what the agent reads. You can decide when it should retrieve. You can layer public and private sources. You can persist local lessons. And you can improve ranking without tearing the whole model apart.

The companion relevance engine shows how to keep what already works, make one part of the system measurably better, and package the result in a way other developers can run, inspect, and extend. The upstream PR, in turn, shows how to turn a broad idea into smaller pieces that are realistic to review in the main project.

Diagram Attribution

All diagrams used in this article were created by the author specifically for this tutorial and its companion repository.

Sources

How to Build a Fashion App That Helps You Organize Your Wardrobe

Mokshita V P — Tue, 14 Apr 2026 16:26:39 +0000

I used to spend too long deciding what to wear, even when my closet was full.

That frustration made the problem feel very clear to me: it was not about having fewer clothes. It was about having better organization, better visibility, and better guidance when making outfit decisions.

So I built a fashion web app that helps users organize their wardrobe, get outfit suggestions, evaluate shopping decisions, and improve recommendations over time using feedback.

In this article, I’ll walk through what the app does, how I built it, the decisions I made along the way, and the challenges that shaped the final result.

Table of Contents
What the App Does
Why I Built It
Tech Stack
Product Walkthrough (What Users See)
How I Built It
Challenges I Faced
What I Learned
What I Want to Improve Next
Future Improvements
Conclusion

What the App Does

At a high level, the app combines six core capabilities:

Wardrobe management
Outfit recommendations
Shopping suggestions
Discard recommendations
Feedback and usage tracking
Secure multi-user accounts

Users can upload clothing items, explore suggested outfits, and mark recommendations as helpful or not helpful. They can also rate outfits and track whether items are worn, kept, or discarded.

That feedback becomes structured data for improving future recommendation quality.

Why I Built It

I wanted to create something that felt personal and actually useful. A lot of fashion apps look polished, but they do not always help with everyday decisions. My goal was to build something that could make wardrobe management easier and outfit selection less overwhelming. The app needed to do three things well:

store each user’s wardrobe data
personalize recommendations
learn from user feedback over time .

That feedback loop mattered to me because it makes the app feel more alive instead of static.

Tech Stack

Here are the tools I used to built the app:

Frontend: React + Vite
Backend: FastAPI
Database: SQLite (local development)
Background jobs: Celery + Redis
Authentication: JWT (access + refresh token flow)
Deployment support: Docker and GitHub Codespaces

This ended up giving me a pretty modular setup, which helped a lot as features started increasing: fast frontend iteration, clean API boundaries, and room to evolve recommendations separately from UI.

Product Walkthrough (What Users See)

1. Onboarding and Account Setup

To start using the app, a user needs to register, verify their email, and complete some profile basics.

Each account is isolated, so wardrobe history and recommendations stay user-specific.

In this onboarding screen above, you can see account creation, email verification, and profile fields for body shape, height, weight, and style preferences.

2. Wardrobe Upload

Users can upload clothing images .

Image analysis labels each item and makes it searchable for recommendations. The wardrobe upload form shows image analysis results with category, dominant color, secondary color, and pattern details listed.

3. Outfit Recommendations

Users can request recommendations, then rate outputs.

Above you can see the outfit recommendation dashboard that shows ranked outfit cards with feedback and rating actions. Recommendations are ranked by a weighted scoring model.

4. Shopping and Discard Assistants

The app evaluates new items against existing wardrobe data and flags low-value wardrobe items that may be worth removing.

You can see the recommendation scores, written reasons (not just a binary decision), and styling guidance for each item above. It also features a "how to style it" incase the user still wants to keep the item.

How I Built It

1. Frontend Setup (React + Vite)

I used React + Vite because I wanted fast iteration and a clean component structure.

The frontend is split into feature areas like onboarding, wardrobe management, outfits, shopping, and discarded-item suggestions. I also keep API calls in a service layer so the UI components stay focused on rendering and interaction.

The snippet below is a simplified example of the API service pattern used in the app. It is not meant to be copy-pasted as-is, but it shows the same structure the frontend uses when talking to the backend.

Example API client pattern:

export async function getOutfitRecommendations(userId, params = {}) {
  const query = new URLSearchParams(params).toString();
  const url = `/users/\({userId}/outfits/recommend\){query ? `?${query}` : ""}`;

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${localStorage.getItem("access_token")}`,
    },
  });

  if (!response.ok) {
    throw new Error("Failed to fetch outfit recommendations");
  }

  return response.json();
}

Here's what's happening in that snippet:

URLSearchParams builds optional query strings like occasion, season, or limit.
The request path is user-scoped, which keeps each user’s recommendations isolated.
The Authorization header sends the access token so the backend can verify the session.
The response is checked before parsing so the UI can surface a useful error if the request fails.

This pattern kept the frontend simple and reusable as the number of API calls grew.

2. Backend Architecture with FastAPI

The backend is organized around clear route groups:

auth routes for register, login, refresh, logout, and sessions
user analysis routes
wardrobe CRUD routes
recommendation routes for outfits, shopping, and discard analysis
feedback routes for ratings and helpfulness signals

One of the most important design choices was enforcing ownership checks on user-scoped resources. That prevented one user from accessing another user’s wardrobe or feedback data.

The backend snippet below is another simplified example from the app’s route layer. It shows the request validation and orchestration logic, while the actual scoring work stays in the recommendation service.

@app.get("/users/{user_id}/outfits/recommend")
def recommend_outfits(user_id: int, occasion: str | None = None, season: str | None = None, limit: int = 10):
    user = get_user_or_404(user_id)
    wardrobe_items = get_user_wardrobe(user_id)

    if len(wardrobe_items) < 2:
        raise HTTPException(status_code=400, detail="Not enough wardrobe items")

    recommendations = outfit_generator.generate_outfit_recommendations(
        wardrobe_items=wardrobe_items,
        body_shape=user.body_shape,
        undertone=user.undertone,
        occasion=occasion,
        season=season,
        top_k=limit,
    )

    return {"user_id": user_id, "recommendations": recommendations}

Here's how to read that code:

get_user_or_404 loads the profile data needed for personalization.
get_user_wardrobe fetches only the current user’s items.
The minimum wardrobe check prevents the recommendation logic from running on incomplete data.
generate_outfit_recommendations handles the scoring logic separately, which keeps the route handler small and easier to test.
The response returns the results in a shape the frontend can consume directly.

That separation helped keep the API layer readable while the recommendation logic stayed isolated in its own service.

3. Recommendation Logic

I intentionally started with deterministic rules before introducing heavy ML. That made behavior easier to debug and explain.

The outfit recommender scores combinations using weighted signals:

$$\text{outfit score} = 0.4 \cdot \text{color harmony} + 0.4 \cdot \text{body-shape fit} + 0.2 \cdot \text{undertone fit}$$

The snippet below is a simplified example from the recommendation engine. It shows how the app combines multiple signals into a single score:

def score_outfit(combo, user_context):
    color_score = color_harmony.score(combo)
    shape_score = body_shape_rules.score(combo, user_context.body_shape)
    undertone_score = undertone_rules.score(combo, user_context.undertone)

    total = 0.4 * color_score + 0.4 * shape_score + 0.2 * undertone_score
    return round(total, 3)

The logic behind this approach is straightforward:

color harmony helps the outfit feel visually coherent
body-shape scoring helps the outfit feel flattering
undertone scoring helps the colors work better with the user’s profile

I used a similar structure for discard recommendations and shopping suggestions, but with different factors and thresholds.

4. Authentication and Secure Multi-user Design

Security was one of the most important parts of this build.

I implemented:

short-lived access tokens
refresh tokens with JTI tracking
token rotation on refresh
session revocation (single session and all sessions)
email verification and password reset flows

The snippet below is a simplified example of the refresh-token lifecycle used in the app. It shows the important control points rather than every helper function:

def refresh_access_token(refresh_token: str):
    payload = decode_jwt(refresh_token)
    jti = payload["jti"]

    token_record = db.get_refresh_token(jti)
    if not token_record or token_record.revoked:
        raise AuthError("Invalid refresh token")

    new_refresh, new_jti = issue_refresh_token(payload["sub"])
    token_record.revoked = True
    token_record.replaced_by_jti = new_jti

    new_access = issue_access_token(payload["sub"])
    return {"access_token": new_access, "refresh_token": new_refresh}

What this code is doing:

It decodes the refresh token and looks up its JTI in the database.
It rejects reused or revoked sessions, which helps prevent replay attacks.
It rotates the refresh token instead of reusing it.
It issues a fresh access token so the session stays valid without forcing the user to log in again.

This design made multi-device sessions safer and gave me server-side control over logout behavior.

5. Background Jobs for Long-running Operations

Image analysis can be expensive, especially when the app needs to classify clothing, analyze colors, and estimate body-shape-related signals. To keep the request path responsive, I added Celery + Redis support for background tasks.

That gave the app two modes:

synchronous processing for simpler local development
queued processing for heavier or slower jobs

That tradeoff mattered because it let me keep the developer experience simple without blocking the app during more expensive work.

6. Data Model and Feedback Capture

A recommendation system only improves if it captures the right signals.

So I added dedicated feedback tables for:

outfit ratings (1-5 + optional comments)
recommendation helpful/unhelpful feedback
item usage actions (worn/kept/discarded)

Here is the shape of one of those models:

class RecommendationFeedback(Base):
    __tablename__ = "recommendation_feedback"

    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey("users.id"), nullable=False)
    recommendation_type = Column(String(50), nullable=False)
    recommendation_id = Column(Integer, nullable=False)
    helpful = Column(Boolean, nullable=False)
    created_at = Column(DateTime, default=datetime.utcnow)

How to read this model:

user_id ties feedback to the person who gave it.
recommendation_type tells me whether the feedback belongs to outfits, shopping, or discard suggestions.
recommendation_id identifies the exact recommendation.
helpful stores the user’s direct response.
created_at makes it possible to analyze feedback trends over time.

This part of the system gives the app a real learning foundation, even though the feedback-to-model-update loop is still a future improvement.

Challenges I Faced

This was the section that taught me the most.

1. Image-heavy endpoints were slower than I wanted

The analyze and wardrobe upload flows were doing a lot of work at once: image validation, classification, color extraction, storage, and database writes.

At first, that made the request flow feel heavier than it should have.

What I changed:

I bounded concurrent image jobs so the app wouldn't try to do too much at once.
I separated slower jobs into background processing where possible.
I used load-test results to confirm which endpoints were actually expensive.

The practical effect was that heavy image requests stopped competing with each other so aggressively. Instead of letting many expensive tasks pile up inside the same request cycle, I limited the active work and pushed slower operations into the queue when needed.

Why this fixed it:

Bounding concurrency prevented the system from overloading CPU-bound tasks.
Moving expensive work into async jobs kept the main request/response cycle more responsive.
Load testing gave me evidence instead of guesswork, so I could tune the system based on real performance behavior.

In other words, I didn't just “optimize” the endpoint in theory. I changed the execution model so expensive analysis could not block every other request behind it.

2. JWT sessions needed real server-side control

A basic JWT setup is easy to get working, but it becomes less useful if you cannot revoke sessions or manage multiple devices cleanly.

What I changed:

I stored refresh tokens in the database.
I tracked token JTI values.
I rotated refresh tokens when users refreshed their session.
I added endpoints for logging out a single session or all sessions.

The important shift here was moving from “token exists, therefore session is valid” to “token exists, matches the database record, and has not been revoked or replaced.” That gave the server the authority to invalidate old sessions immediately.

Why this fixed it:

Server-side token tracking made revocation possible.
Rotation reduced the chance of token reuse.
Session management became visible to the user, which made the app feel more trustworthy.

This is what made logout-all and multi-device management work in a real way instead of just being cosmetic UI actions.

3. User data isolation had to be explicit

Because this is a multi-user app, I had to be careful that one account could never accidentally see another account’s wardrobe data.

What I changed:

I added ownership checks to user-scoped routes.
I kept all wardrobe and feedback queries filtered by user_id.
I used encrypted image storage instead of exposing raw paths.

In practice, this meant every route had to ask the same question: “Does this user own the resource they are trying to access?” If the answer was no, the request stopped immediately.

Why this fixed it:

Ownership checks made data access rules explicit.
User-filtered queries prevented accidental cross-account reads.
Encrypted storage improved privacy and reduced the risk of exposing image data directly.

That combination is what kept wardrobe data, feedback history, and images separated correctly across accounts.

The app includes the frontend, backend, Redis, Celery worker, and Celery Beat, so the first challenge was making the setup feel reproducible instead of fragile.

What I changed:

I defined the stack in Docker Compose.
I documented the required environment variables.
I kept the dev stack aligned with how the app runs in practice.

This removed a lot of setup ambiguity. Instead of asking someone to manually figure out how the frontend, backend, Redis, and workers fit together, I made the stack describe itself.

Why this fixed it:

Docker let contributors start the project with fewer manual steps.
Clear environment configuration reduced setup mistakes.
Matching the stack to the architecture made the app easier to understand and test.

That was important because the app depends on several moving parts, and the simplest way to make the project approachable was to make startup behavior predictable.

What I Learned

This project taught me a few important lessons:

Small features become much more valuable when they work together.
Feedback data is one of the strongest signals for improving recommendations.
Clean data modeling matters a lot when multiple users are involved.
Docker and clear setup instructions make a project much easier for other people to try.

I also learned that a project does not need to be huge to be useful. A focused app that solves one problem well can still feel meaningful.

What I Want to Improve Next

My roadmap from here:

Integrate feedback directly into ranking updates
Add visual analytics for recommendation quality trends
Improve mobile UX parity
Deploy with persistent cloud storage and production database defaults
Provide a public demo mode for easier evaluation

Future Improvements

There are still a few things I would like to add later:

a more advanced recommendation engine
visual analytics for user feedback
better mobile support
live deployment with persistent cloud storage
a public demo mode for easier testing

Conclusion

This project began as a personal frustration and turned into a full web application with authentication, wardrobe storage, recommendation logic, and feedback infrastructure.

The most rewarding part was seeing how practical software decisions, not just flashy UI, can help people make everyday choices faster.

If you want to explore or run the project, check out the repo. You can try the flows and share feedback. I would especially love input on recommendation quality, UX clarity, and what features would make this genuinely useful in daily life.

How the Mixture of Experts Architecture Works in AI Models

Manish Shivanandhan — Tue, 07 Apr 2026 17:18:05 +0000

Artificial intelligence (AI) has seen remarkable advancements over the years, with AI models growing in size and complexity.

Among the innovative approaches gaining traction today is the Mixture of Experts (MoE) architecture. This method optimizes AI model performance by distributing processing tasks across specialized subnetworks known as “experts.”

In this article, we’ll explore how this architecture works, the role of sparsity, routing strategies, and its real-world application in the Mixtral model. We’ll also discuss the challenges these systems face and the solutions developed to address them.

We'll Cover:

Understanding the Mixture of Experts (MoE) Approach
The Role of Sparsity in AI Models
The Art of Routing in MoE Architectures
Load Balancing Challenges and Solutions
- Real-World Application: The Mixtral Model
- Conclusion

Understanding the Mixture of Experts (MoE) Approach

The Mixture of Experts (MoE) is a machine learning technique that divides an AI model into smaller, specialized networks, each focusing on specific tasks.

This is akin to assembling a team where each member possesses unique skills suited for particular challenges.

The idea isn't new. It dates back to a groundbreaking 1991 paper that highlighted the benefits of having separate networks specialize in different training cases.

Fast forward to today, and MoE is experiencing a resurgence, particularly among large language models, which utilize this approach to enhance efficiency and effectiveness.

At its core, this system comprises several components: an input layer, multiple expert networks, a gating network, and an output layer.

The gating network serves as a coordinator, determining which expert networks should be activated for a given task.

By doing so, MoE significantly reduces the need to engage the entire network for every operation. This improves performance and reduces computational overhead.

The Role of Sparsity in AI Models

An essential concept within MoE architecture is sparsity, which refers to activating only a subset of experts for each processing task.

Instead of engaging all network resources, sparsity ensures that only the relevant experts and their parameters are used. This targeted selection significantly reduces computation needs, especially when dealing with complex, high-dimensional data such as natural language processing tasks.

Sparse models excel because they allow for specialized processing. For example, different parts of a sentence may require distinct types of analysis: one expert might be adept at understanding idioms, while another could specialise in parsing complex grammar structures.

By activating only the necessary experts, MoE models can provide more precise and efficient analysis of the input data.

The Art of Routing in MoE Architectures

Routing is another critical component of the Mixture of Experts model.

The gating network plays a crucial role here, as it determines which experts to activate for each input. A successful routing strategy ensures that the network is capable of selecting the most suitable experts, optimizing performance and maintaining balance across the network.

Typically, the routing process involves predicting which expert will provide the best output for a given input. This prediction is made based on the strength of the connection between the expert and the data.

One popular strategy is the “top-k” routing method, where the k most suitable experts are chosen for a task. In practice, a variant known as “top-2” routing is often used, activating the best two experts, which balances effectiveness and computational cost.

Load Balancing Challenges and Solutions

While MoE models have clear advantages, they also introduce specific challenges, particularly regarding load balancing.

The potential issue is that the gating network might consistently select only a few experts, leading to an uneven distribution of tasks. This imbalance can result in some experts being over-utilised and, consequently, over-trained, while others remain underutilised.

To address this challenge, researchers have developed “noisy top-k” gating, a technique introducing Gaussian noise to the selection process. This introduces an element of controlled randomness, promoting a more balanced activation of experts.

By distributing the workload more evenly across experts, this approach mitigates the risk of inefficiencies and ensures that the entire network remains effective.

What Actually Happens During an MoE Inference

To make the Mixture of Experts architecture more concrete, it helps to walk through what happens during a single request.

Consider a prompt like:

“Explain why startups fail due to poor cash flow management.”

In a traditional dense model, every layer and every parameter contribute to generating the response. In an MoE model, the process is more selective.

As the input is processed, each layer passes the token representations to the gating network. This component evaluates all available experts and assigns them scores based on how relevant they are to the input. Instead of activating the full network, the model selects only the top-k experts (commonly two).

For this example, the gating network might select:

One expert specialized in financial reasoning
Another expert better at structuring causal explanations

Only these selected experts process the input, producing intermediate outputs that are then combined and passed to the next layer. The rest of the experts remain inactive for that token.

This selection and combination process repeats across layers, meaning that at any given point, only a small fraction of the model’s total parameters are being used.

The result is a system that behaves like a large, highly capable model, but executes more like a smaller one in terms of compute. This is the practical advantage of MoE: it doesn’t just improve model capacity, it ensures that capacity is used selectively and efficiently for each request.

Real-World Application: The Mixtral Model

A compelling example of the Mixture of Experts architecture in action is the Mixtral model. This open-source large language model exemplifies how MoE can enhance efficiency in processing tasks.

Each layer of the Mixtral model comprises eight experts, each with seven billion parameters. As the model processes each token of input data, the gating network selects the two most suitable experts. These experts handle the task, and their outputs are combined before moving to the next model layer.

This approach allows Mixtral to deliver high performance despite its seemingly modest size for a large language model. By efficiently utilising resources and ensuring specialised processing, Mixtral stands as a testament to the potential of MoE architectures in advancing AI technology.

Conclusion

The Mixture of Experts architecture represents a significant step forward in developing efficient AI systems. With its focus on specialised processing and resource optimisation, MoE offers numerous benefits, particularly for large-scale language models.

Key concepts like sparsity and effective routing ensure that these models can handle complex tasks with precision, while innovations like noisy top-k gating address the common challenges of load balancing.

Despite its complexity and the need for careful tuning, the MoE approach remains promising in elevating AI model performance. As AI continues to advance, architectures like MoE could play a crucial role in powering the next generation of intelligent systems, offering improved efficiency and specialised processing capabilities.

Hope you enjoyed this article. Signup for my free newsletter to get more articles delivered to your inbox. You can also connect with me on Linkedin.

How to Use MLflow to Manage Your Machine Learning Lifecycle

Temitope Oyedele — Mon, 23 Mar 2026 18:52:44 +0000

Training machine learning models usually starts out being organized and ends up in absolute chaos.

We’ve all been there: dozens of experiments scattered across random notebooks, and model files saved as model_v2_final_FINAL.pkl because no one is quite sure which version actually worked.

Once you move from a solo project to a team, or try to push something to production, that "organized chaos" quickly becomes a serious bottleneck.

Solving this mess requires more than just better naming conventions: it requires a way to standardize how we track and hand off our work. This is the specific gap MLflow was built to fill.

Originally released by the team at Databricks in 2018, it has become a standard open-source platform for managing the entire machine learning lifecycle. It acts as a central hub where your experiments, code, and models live together, rather than being tucked away in forgotten folders.

In this tutorial, we'll cover the core philosophy behind MLflow and how its modular architecture solves the 'dependency hell' of machine learning. We'll break down the four primary pillars of Tracking, Projects, Models, and the Model Registry, and walk through a practical implementation of each so you can move your projects from local notebooks to a production-ready lifecycle.

Prerequisites:

To get the most out of this tutorial, you should have:

Basic Python proficiency: Comfort with context managers (with statements) and decorators.
Machine Learning fundamentals: A general understanding of training/testing splits and model evaluation metrics (like accuracy or loss).
Local Environment: Python 3.8+ installed. Familiarity with pip or conda for installing packages is helpful.

MLflow Architecture: The Big Picture

To understand why MLflow is so effective, you have to look at how it's actually put together. MLflow isn't one giant or rigid tool. It’s a modular system designed around four loosely coupled components that are its core pillars.

This is a big deal because it means you don’t have to commit to the entire ecosystem at once. If you only need to track experiments and don't care about the other features, you can just use that part and ignore the rest.

To make this a bit more concrete, here is how those pieces map to things you probably already use:

MLflow Tracking: Logs experiments, metrics, and parameters. (Think: Git commits for ML runs)
MLflow Projects: Packages code for reproducibility. (Think: A Docker image for ML code)
MLflow Models: A standard format for multiple frameworks. (Think: A universal adapter)
Model Registry: Handles versioning and governing models. (Think: A CI/CD pipeline for models)

Architecturally, you can think of MLflow in two layers: the Client and the Server.

The Client is where you spend most of your time. It’s your training script or your Jupyter notebook where you log metrics or register a model.

The Server is the brain in the background that handles the storage. It consists of a Tracking Server, a Backend Store (usually a database like PostgreSQL), and an Artifact Store. That’s the place where big files like model weights live, such as S3 or GCS.

This separation is why MLflow is so flexible. You can start with everything running locally on your laptop using just your file system. When you're ready to scale up to a larger team, you can swap that out for a centralized server and cloud storage with almost no changes to your actual code. It grows with your project instead of forcing you to start over once things get serious.

Now, let's look at each of these four pillars of MLflow so you understand how they work.

Understanding MLflow Tracking

For most teams, the Tracking component is the front door to MLflow. Its job is simple: it acts as a digital lab notebook that records everything happening during a training run.

Instead of you frantically trying to remember what your learning rate was or where you saved that accuracy plot, MLflow just sits in the background and logs it for you.

The core unit here is the run. Think of a run as a single execution of your training code. During that run, the architecture captures four specific types of information:

Parameters: Your inputs, like batch size or the number of trees in a forest.
Metrics: Your outputs, like accuracy or loss, which can be tracked over time.
Artifacts: The "heavy" stuff, such as model weights, confusion matrices, or images.
Tags and Metadata: Context like which developer ran the code and which Git commit was used.

A Tracking Example

Seeing this in practice is the best way to understand how the architecture actually works. You don't need to rebuild your entire pipeline – you just wrap your training logic in a context manager.

Here is what a basic integration looks like in Python:

import mlflow 
import mlflow.sklearn 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score 

# This block opens the run and keeps things organized
with mlflow.start_run():    
    # Log parameters    
    mlflow.log_param("n_estimators", 100)    
    mlflow.log_param("max_depth", 5)    
    
    # Train the model    
    model = RandomForestClassifier(n_estimators=100, max_depth=5)    
    model.fit(X_train, y_train)    
    
    # Log metrics    
    accuracy = accuracy_score(y_test, model.predict(X_test))    
    mlflow.log_metric("accuracy", accuracy)    
    
    # Log the model as an artifact    
    mlflow.sklearn.log_model(model, "random_forest_model")

The mlflow.start_run() context manager creates a new run and automatically closes it when the block exits. Everything logged inside that block is associated with that run and stored in the Backend Store.

Where Does the Data Actually Go?

When you’re just starting out on your laptop, MLflow keeps things simple by creating a local ./mlruns directory. The real power shows up when you move to a team environment and point everyone to a centralized Tracking Server.

The system splits the data based on how "heavy" it is. Your structured data (parameters and metrics) is small and needs to be searchable, so it goes into a SQL database like PostgreSQL. Your unstructured data (the actual model files or large plots) is too bulky for a database. The architecture ships that off to an Artifact Store like Amazon S3 or Google Cloud Storage.

Why Bother with This Setup?

Relying on "vibes" and messy naming conventions is a recipe for disaster once your project grows. It might work for a day or two, but it falls apart the moment you need to compare twenty different versions of a model.

By separating the tracking into its own architectural pillar, MLflow gives you a queryable history. Instead of digging through old notebooks, you can just hop into the UI, filter for the best results, and see exactly which configuration got you there. It takes the guesswork out of the "science" part of data science.

Understanding MLflow Projects

You can train the most accurate model in the world, but if your colleague can’t reproduce your results on their machine, that model isn't worth much.

This is where MLflow Projects come in. They solve the reproducibility headache by providing a standard way to package your code, your dependencies, and your entry points into one neat bundle.

Think of an MLflow Project as a directory (or a Git repo) with a special "instruction manual" at its root called an MLproject file. This file tells anyone (or any server) exactly what environment is needed and how to kick off the execution.

The MLproject File

Instead of sending someone a long README with installation steps, you just give them this file. Here is what a typical MLproject setup looks like for a training pipeline:

name: my_ml_project
conda_env: conda.yaml

entry_points:
  train:
    parameters:
      learning_rate: {type: float, default: 0.01}
      epochs: {type: int, default: 50}
      data_path: {type: str}
    command: "python train.py --lr {learning_rate} --epochs {epochs} --data {data_path}"
  
  evaluate:
    parameters:
      model_path: {type: str}
    command: "python evaluate.py --model {model_path}"

The conda_env line points to a conda.yaml file that lists the exact Python packages and versions your code needs. If you want even more isolation, MLflow supports Docker environments too.

The beauty of this setup is the simplicity. Anyone with MLflow installed can run your entire project with a single command:

mlflow run . -P learning_rate=0.001 -P epochs=100 -P data_path=./data/train.csv

Why this Actually Matters

MLflow Projects really shine in two specific scenarios. The first is onboarding. A new team member can clone your repo and be up and running in minutes, rather than spending their entire first day debugging library version conflicts.

The second is CI/CD. Because these projects are triggered programmatically, they fit perfectly into automated retraining pipelines. When reproducibility is non-negotiable, having a "single source of truth" for how to run your code makes life a lot easier for everyone involved.

Understanding the MLflow Model Registry

Tracking experiments tells you which model is the "winner," but the Model Registry is where you actually manage that winner’s journey from your notebook to a live production environment.

Think of it as the governance layer. It handles versioning, stage management, and creates a clear audit trail so you never have to guess which model is currently running in the wild.

The Registry uses a few simple concepts to keep things organized:

Registered Model: This is the overall name for your project, like CustomerChurnPredictor.
Model Version: Every time you push a new iteration, MLflow auto-increments the version (v1, v2, and so on).
Stage: These are labels like Staging, Production, or Archived. They tell your team exactly where a model stands in its lifecycle.
Annotations: These are just notes and tags. They’re great for documenting why a specific version was promoted or what its quirks are.

Moving a Model through the Pipeline

In a real-world workflow, you don't just "deploy" a file. You transition it through stages. Here's how that looks using the MLflow Client:

Python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# First, we register the model from a run that went well
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/random_forest_model",
    name="CustomerChurnPredictor"
)

# Then, we move Version 1 to Staging so the QA team can look at it
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Staging"
)

# Once everything checks out, we promote it to Production
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Production"
)

Why Does This Matter?

The Model Registry solves a problem that usually gets messy the moment a team grows: knowing exactly which version is live, who approved it, and what it was compared against. Without this, that information usually ends up buried in Slack threads or outdated spreadsheets.

It also makes rollbacks incredibly painless. If Version 3 starts acting up in production, you don't need to redeploy your entire stack. You can just transition Version 2 back to the "Production" stage in the registry. Since your serving infrastructure is built to always pull the "Production" tag, it will automatically swap back to the stable version.

How the Components Fit Together

To see how all of this actually works in the real world, it helps to walk through a typical workflow from start to finish. It's essentially a relay race where each component hands off the baton to the next one.

It starts with a data scientist running a handful of experiments. Every time they hit run, MLflow Tracking is in the background taking notes. It logs metrics and saves model artifacts into the Backend Store automatically. At this stage, everything is about exploration and finding that one winner.

Once that best run is identified, the model gets officially registered in the Model Registry. This is where the team takes over. They can hop into the UI to check the annotations, review the evaluation results, and move the model into Staging. After it passes a few more validation tests, it gets the green light and is promoted to Production.

When it is time to actually serve the model, the deployment system simply asks the Registry for the current Production version. This happens whether you are using Kubernetes, a cloud endpoint, or MLflow’s built-in server.

Because the MLproject file handled the dependencies and the MLflow Models format handled the framework details, the serving infrastructure does not have to care if the model was built with Scikit-learn or PyTorch. The hand-off is smooth because all the necessary info is already there.

This flow is what turns MLflow from a collection of useful utilities into a full MLOps platform. It connects the messy experimental phase of data science to the rigid world of production software.

Wrapping Up

At the end of the day, MLflow architecture is built to stay out of your way. It doesn't force you to change how you write your code or which libraries you use. Instead, it just provides the structure needed to make your machine learning projects reproducible and easier to manage as a team.

Whether you're just trying to get away from naming files model_final_v2.pkl or you are building a complex CI/CD pipeline for your models, understanding these four pillars is the best place to start. The best way to learn is to just fire up a local tracking server and start logging. You will probably find that once you have that "source of truth" for your experiments, you will never want to go back to the old way of doing things.

How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD

Sandeep Bharadwaj Mannapur — Tue, 17 Mar 2026 20:33:56 +0000

Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and trust over time.

Most ML systems fail in production for boring (and painful) reasons: the training code and the serving code drift apart, input data changes shape, a “small” preprocessing tweak breaks predictions, or the model silently degrades because real-world behavior shifts. None of these problems are solved by a better algorithm, they’re solved by engineering: repeatable pipelines, validation, versioning, monitoring, and automated checks.

In this hands-on handbook, you’ll build a complete mini ML platform on your local machine, an end-to-end project that takes a model from training to deployment with the core “last mile” infrastructure in place. We’ll use a fraud detection example (predicting fraudulent transactions), but the same workflow works for churn prediction or any binary classification problem. Everything runs locally (no cloud required), and every step is copy-paste runnable so you can follow along and verify outputs as you go.

By the end, you'll have a production-ready ML pipeline running on your machine – from training the model to serving predictions, with the infrastructure to test, monitor, and iterate with confidence. And yes, we'll do it in a hands-on manner with code snippets you can copy-paste and run. Let's dive in!

📦 Get the Complete Code
All code from this handbook is available in a ready-to-run repository:
Repository: https://github.com/sandeepmb/freecodecamp-local-ml-platform
Clone it and follow along, or use it as a reference implementation.

Project Overview and Setup
Build a Simple Model and API (The Naive Approach)
- Train a Quick Model
- Serve Predictions with FastAPI
Where the Naive Approach Breaks
Add Experiment Tracking and Model Registry with MLflow
Ensure Feature Consistency with Feast
Add Data Validation with Great Expectations
- Define Expectations
- Integrate Validation into FastAPI
Monitor Model Performance and Data Drift
Automate Testing and Deployment with CI/CD
Incident Response Playbook
How to Put It All Together
What’s Next: Scale to Production
Conclusion
References

Project Overview and Setup

Before we jump into coding, let's set the stage. Our use-case is credit card fraud detection – a binary classification problem where we predict whether a transaction is fraudulent (is_fraud = 1) or legitimate (is_fraud = 0). This is a common ML task and a good proxy for production ML challenges because fraud patterns can change over time (allowing us to discuss model drift), and bad input data (for example, malformed transaction info) can cause serious issues if not handled properly.

Tech Stack

We will use Python-based tools that are popular in MLOps but still beginner-friendly:

Tool	Purpose	Why We Chose It
MLflow	Experiment tracking and model registry	Open-source, widely adopted, great UI
Feast	Feature store for consistent feature serving	Production-grade, runs locally, same API for offline/online
FastAPI	High-performance web framework for serving predictions	Fast, automatic docs, modern Python
Great Expectations	Data validation framework	Declarative expectations, great reports
Evidently	Monitoring for data drift and model decay	Beautiful reports, easy to integrate
Docker	Containerization for environment consistency	Industry standard, works everywhere
GitHub Actions	CI/CD automation	Free for public repos, tight GitHub integration

Let me explain each tool briefly:

MLflow is an open-source platform designed to manage the ML lifecycle. It provides experiment tracking (logging parameters, metrics, and artifacts), a model registry (versioning models with aliases), and model serving capabilities. We'll use it to ensure our experiments are reproducible and our models are versioned.

Feast (Feature Store) is an open-source feature store that helps manage and serve features consistently between training and inference. This prevents a common problem called "training-serving skew" where the features used in production differ slightly from those used in training, causing silent accuracy degradation.

FastAPI is a modern, fast web framework for building APIs with Python. It's known for being easy to use, efficient, and producing automatic interactive documentation. We'll use it to serve our model predictions.

Great Expectations is an open-source tool for data quality testing. It allows us to define "expectations" on data (like "amount should be positive" or "hour should be between 0 and 23") and test incoming data against them.

Evidently is an open-source library for monitoring data and model performance over time. It can detect data drift (when input distributions change) and model decay (when accuracy drops).

Docker ensures the same environment and dependencies in development and deployment, avoiding the classic "works on my machine" problem.

GitHub Actions provides CI/CD automation. An efficient CI/CD pipeline helps integrate and deploy changes faster and with fewer errors.

💡 Mental Model: Think of this as building a "safety net" around your ML model. Each tool we add catches a different failure mode, like defensive driving for machine learning.

Prerequisites

You'll need:

Python 3.9+ installed on your machine
Docker Desktop installed and running
GitHub account (if you want to try the CI/CD pipeline)
Basic familiarity with Python and ML concepts (what training and prediction mean)

You don't need MLOps or Kubernetes experience. Everything will be done locally with just Python and Docker – no cloud and no Kubernetes needed.

Project Structure

Let's set up a basic project structure on your local machine. Open your terminal and run:

# Create project directory and subfolders
mkdir ml-platform-tutorial && cd ml-platform-tutorial
mkdir -p data models src tests feature_repo

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Your project structure should look like this:

ml-platform-tutorial/
├── data/              # Training and test datasets
├── models/            # Saved model files
├── src/               # Source code
├── tests/             # Test files
├── feature_repo/      # Feast feature repository
├── venv/              # Virtual environment
└── requirements.txt   # Dependencies

Next, create a requirements.txt with all the necessary libraries:

# requirements.txt

# Core ML libraries
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0

# Experiment tracking and model registry
mlflow==2.10.0

# Feature store
feast==0.36.0

# API framework
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.26.0

# Data validation
great-expectations==0.18.8

# Monitoring
evidently==0.7.20

# Testing
pytest==8.0.0
pytest-cov==4.1.0

# Utilities
pyarrow==15.0.0
pydantic==2.6.0

📌 Version Note: Exact versions are pinned to ensure reproducibility. Newer versions may work, but all examples were tested with the versions listed here.

Install the dependencies:

pip install -r requirements.txt

This might take a few minutes as it installs all the packages. Once complete, we're ready to start building our project step by step.

Checkpoint: You should have a project folder with data/, models/, src/, tests/, and feature_repo/ directories, and an activated virtual environment with all dependencies installed. Verify by running python -c "import mlflow; import feast; import fastapi; print('All imports successful!')".

Figure 1: The Complete ML Platform We'll Build

Don't worry if this looks complex, we'll build each component step by step, starting with the simplest piece and connecting them together.

1. Build a Simple Model and API (The Naive Approach)

To illustrate why we need all these tools, let's start by building a naive ML system without any MLOps infrastructure. We'll train a simple model and deploy it quickly, then observe what problems arise. This "naive approach" is how most ML projects start – and understanding its limitations will motivate the solutions we implement later.

1.1 Train a Quick Model

First, we need some data. For simplicity, we'll generate a synthetic dataset for fraud detection so that we don't rely on any external data files. The dataset will have features like:

amount: Transaction amount in dollars
hour: Hour of the day (0-23) when the transaction occurred
day_of_week: Day of the week (0=Monday, 6=Sunday)
merchant_category: Type of merchant (grocery, restaurant, retail, online, travel)
is_fraud: Label indicating if the transaction is fraudulent (1) or legitimate (0)

We will simulate that only ~2% of transactions are fraud, which is an imbalance typical in real fraud data. This imbalance is important because it affects how we evaluate our model.

Create src/generate_data.py:

# src/generate_data.py
"""
Generate synthetic fraud detection dataset.

This script creates realistic-looking transaction data where fraudulent
transactions have different patterns than legitimate ones:
- Fraud tends to have higher amounts
- Fraud tends to occur late at night
- Fraud is more common for online and travel merchants
"""
import pandas as pd
import numpy as np

def generate_transactions(n_samples=10000, fraud_ratio=0.02, seed=42):
    """
    Generate synthetic fraud detection dataset.
    
    Args:
        n_samples: Total number of transactions to generate
        fraud_ratio: Proportion of fraudulent transactions (default 2%)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with transaction features and fraud labels
    
    Fraud transactions have different patterns:
    - Higher amounts (mean \(245 vs \)33 for legit)
    - Late night hours (0-5, 23)
    - More likely to be online or travel merchants
    """
    np.random.seed(seed)
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud

    # Legitimate transactions: normal shopping patterns
    # - Amounts follow a log-normal distribution (most small, some large)
    # - Hours are uniformly distributed throughout the day
    # - Merchant categories weighted toward everyday shopping
    legit = pd.DataFrame({
        "amount": np.random.lognormal(mean=3.5, sigma=1.2, size=n_legit),  # ~$33 average
        "hour": np.random.randint(0, 24, size=n_legit),
        "day_of_week": np.random.randint(0, 7, size=n_legit),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_legit,
            p=[0.30, 0.25, 0.25, 0.15, 0.05]  # Weighted toward everyday shopping
        ),
        "is_fraud": 0
    })
    
    # Fraudulent transactions: suspicious patterns
    # - Higher amounts (fraudsters go big)
    # - Late night hours (less scrutiny)
    # - More online and travel (easier to exploit)
    fraud = pd.DataFrame({
        "amount": np.random.lognormal(mean=5.5, sigma=1.5, size=n_fraud),  # ~$245 average
        "hour": np.random.choice([0, 1, 2, 3, 4, 5, 23], size=n_fraud),  # Late night
        "day_of_week": np.random.randint(0, 7, size=n_fraud),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_fraud,
            p=[0.05, 0.05, 0.10, 0.60, 0.20]  # Weighted toward online/travel
        ),
        "is_fraud": 1
    })
    
    # Combine and shuffle
    df = pd.concat([legit, fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    return df

if __name__ == "__main__":
    # Generate dataset
    print("Generating synthetic fraud detection dataset...")
    df = generate_transactions(n_samples=10000, fraud_ratio=0.02)
    
    # Split into train (80%) and test (20%)
    train_df = df.sample(frac=0.8, random_state=42)
    test_df = df.drop(train_df.index)
    
    # Save to CSV files
    train_df.to_csv("data/train.csv", index=False)
    test_df.to_csv("data/test.csv", index=False)
    
    # Print summary statistics
    print(f"\nDataset generated successfully!")
    print(f"Training set: {len(train_df):,} transactions")
    print(f"Test set: {len(test_df):,} transactions")
    print(f"Overall fraud ratio: {df['is_fraud'].mean():.2%}")
    print(f"\nLegitimate transactions - Average amount: ${df[df['is_fraud']==0]['amount'].mean():.2f}")
    print(f"Fraudulent transactions - Average amount: ${df[df['is_fraud']==1]['amount'].mean():.2f}")
    print(f"\nMerchant category distribution (fraud):")
    print(df[df['is_fraud']==1]['merchant_category'].value_counts(normalize=True))

Run the data generation script:

python src/generate_data.py

You should see output like:

Generating synthetic fraud detection dataset...

Dataset generated successfully!
Training set: 8,000 transactions
Test set: 2,000 transactions
Overall fraud ratio: 2.00%

Legitimate transactions - Average amount: $33.45
Fraudulent transactions - Average amount: $245.67

Merchant category distribution (fraud):
online        0.60
travel        0.20
retail        0.10
restaurant    0.05
grocery       0.05

Now you have data/train.csv and data/test.csv with ~8000 training and ~2000 testing transactions.

Why This Matters: The synthetic data has realistic patterns — fraud is rare (2%), high-value, late-night, and concentrated in certain merchant categories. These patterns give our model something to learn.

Now, let's train a quick model. We'll use a simple Random Forest classifier from scikit-learn to predict is_fraud. In this naive version, we won't do much feature engineering – just label encode the categorical merchant_category and feed everything to the model.

Create src/train_naive.py:

# src/train_naive.py
"""
Train a fraud detection model - NAIVE VERSION.

This script demonstrates the "quick and dirty" approach to ML:
- No experiment tracking
- No model versioning
- Just train and save to a pickle file

We'll improve on this in later sections.
"""
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report
)

def main():
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    print(f"Training samples: {len(train_df):,}")
    print(f"Test samples: {len(test_df):,}")
    print(f"Training fraud ratio: {train_df['is_fraud'].mean():.2%}")
    
    # Encode the categorical feature
    # We need to save the encoder to use the same mapping at inference time
    print("\nEncoding categorical features...")
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    print(f"Merchant category mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
    
    # Prepare features and labels
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    # Train a Random Forest classifier
    print("\nTraining Random Forest model...")
    model = RandomForestClassifier(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of each tree
        random_state=42,       # For reproducibility
        n_jobs=-1              # Use all CPU cores
    )
    model.fit(X_train, y_train)
    print("Training complete!")
    
    # Evaluate on test data
    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"  True Negatives:  {cm[0][0]:,} (correctly identified legitimate)")
    print(f"  False Positives: {cm[0][1]:,} (legitimate flagged as fraud)")
    print(f"  False Negatives: {cm[1][0]:,} (fraud missed - DANGEROUS!)")
    print(f"  True Positives:  {cm[1][1]:,} (correctly caught fraud)")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Feature importance
    print("\nFeature Importance:")
    for name, importance in sorted(
        zip(feature_cols, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    ):
        print(f"  {name}: {importance:.4f}")
    
    # Save the model and encoder together
    print("\nSaving model to models/model.pkl...")
    with open("models/model.pkl", "wb") as f:
        pickle.dump((model, encoder), f)
    
    print("\nModel trained and saved successfully!")
    print("\nWARNING: This naive approach has several problems:")
    print("  - No record of hyperparameters or metrics")
    print("  - No model versioning")
    print("  - No way to reproduce this exact model")
    print("  - We'll fix these issues in the following sections!")

if __name__ == "__main__":
    main()

Run the training script:

python src/train_naive.py

You should see output similar to:

Loading data...
Training samples: 8,000
Test samples: 2,000
Training fraud ratio: 2.00%

Encoding categorical features...
Merchant category mapping: {'grocery': 0, 'online': 1, 'restaurant': 2, 'retail': 3, 'travel': 4}

Training Random Forest model...
Training complete!

==================================================
MODEL EVALUATION
==================================================

Accuracy:  0.9820
Precision: 0.7273
Recall:    0.6154
F1-score:  0.6667

Confusion Matrix:
  True Negatives:  1,956 (correctly identified legitimate)
  False Positives: 4 (legitimate flagged as fraud)
  False Negatives: 32 (fraud missed - DANGEROUS!)
  True Positives:  8 (correctly caught fraud)

Feature Importance:
  amount: 0.5423
  hour: 0.2156
  merchant_encoded: 0.1345
  day_of_week: 0.1076

Important observation: You'll see ~98% accuracy but a lower F1-score (around 0.5-0.7). With only 2% fraud, accuracy is extremely misleading! A model that always predicts "not fraud" would achieve 98% accuracy while catching zero fraud. This is why we focus on F1-score, precision, and recall for imbalanced classification problems.

💡 If you're new to imbalanced classification, remember: high accuracy can be meaningless when the positive class is rare.

The script outputs a file models/model.pkl containing both the trained model and the label encoder (we need both for inference).

Checkpoint: You should now have:

data/train.csv (~8,000 rows)
data/test.csv (~2,000 rows)
models/model.pkl (trained model + encoder)

The model should show ~98% accuracy but F1 around 0.5-0.7. Verify the files exist: ls -la data/ models/

1.2 Serve Predictions with FastAPI

Now that we have a model, let's deploy it as an API so that clients can get predictions. We'll use FastAPI because it's straightforward, very fast, and produces automatic interactive documentation.

FastAPI is known for:

Easy to use: Pythonic syntax with type hints
High performance: One of the fastest Python frameworks
Automatic documentation: Swagger UI out of the box
Data validation: Using Pydantic models

Create src/serve_naive.py:

# src/serve_naive.py
"""
Serve fraud detection model as a REST API - NAIVE VERSION.

This is a simple API that:
1. Loads the trained model at startup
2. Accepts transaction data via POST request
3. Returns fraud prediction

We'll improve this with validation, monitoring, and better
model loading in later sections.
"""
import pickle
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional

# Load the trained model and encoder at startup
# This is loaded once when the server starts, not on every request
print("Loading model...")
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)
print("Model loaded successfully!")

# Create the FastAPI application
app = FastAPI(
    title="Fraud Detection API",
    description="""
    Predict whether a credit card transaction is fraudulent.
    
    This API accepts transaction details and returns:
    - Whether the transaction is predicted to be fraud
    - The probability of fraud (0.0 to 1.0)
    
    **Note:** This is the naive version without validation or monitoring.
    """,
    version="1.0.0"
)

# Define the input schema using Pydantic
# This provides automatic validation and documentation
class Transaction(BaseModel):
    """Schema for a transaction to be evaluated for fraud."""
    amount: float = Field(
        ..., 
        description="Transaction amount in dollars",
        example=150.00
    )
    hour: int = Field(
        ..., 
        description="Hour of the day (0-23)",
        example=14
    )
    day_of_week: int = Field(
        ..., 
        description="Day of week (0=Monday, 6=Sunday)",
        example=3
    )
    merchant_category: str = Field(
        ..., 
        description="Type of merchant",
        example="online"
    )

class PredictionResponse(BaseModel):
    """Schema for the prediction response."""
    is_fraud: bool = Field(description="Whether the transaction is predicted as fraud")
    fraud_probability: float = Field(description="Probability of fraud (0.0 to 1.0)")
    
@app.post("/predict", response_model=PredictionResponse)
def predict(transaction: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Takes transaction details and returns a fraud prediction
    along with the probability score.
    """
    # Convert the request to a dictionary
    data = transaction.dict()
    
    # Encode the merchant category using the same encoder from training
    # This ensures consistency between training and serving
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        # Handle unknown merchant categories
        # In production, we'd want better handling here
        data["merchant_encoded"] = 0
    
    # Prepare features in the same order as training
    X = [[
        data["amount"],
        data["hour"],
        data["day_of_week"],
        data["merchant_encoded"]
    ]]
    
    # Get prediction and probability
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0][1]  # Probability of class 1 (fraud)
    
    return PredictionResponse(
        is_fraud=bool(prediction),
        fraud_probability=round(float(probability), 4)
    )

@app.get("/health")
def health_check():
    """
    Health check endpoint.
    
    Returns the status of the API. Useful for:
    - Load balancer health checks
    - Kubernetes liveness probes
    - Monitoring systems
    """
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/")
def root():
    """Root endpoint with API information."""
    return {
        "message": "Fraud Detection API",
        "version": "1.0.0",
        "docs": "/docs",
        "health": "/health"
    }

A few important things to note about this code:

Pydantic Models: We use BaseModel to define the expected input JSON schema. FastAPI automatically validates incoming requests against this schema.
Type Hints: The type hints (float, int, str) provide both documentation and runtime validation.
Feature Encoding: On each request, we encode the merchant category using the same LabelEncoder we saved from training. This ensures consistency between training and serving.
Health Endpoint: The /health endpoint is standard practice for production APIs - it allows load balancers and monitoring systems to check if the service is running.

To run this API, use Uvicorn (an ASGI server):

uvicorn src.serve_naive:app --reload --host 0.0.0.0 --port 8000

The --reload flag enables auto-reload during development (the server restarts when you change code).

You should see:

Loading model...
Model loaded successfully!
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process

Now open your browser and go to http://localhost:8000/docs. You'll see the Swagger UI – an auto-generated interactive documentation where you can test the API directly from your browser!

Test the API using curl in another terminal:

# Test with a legitimate-looking transaction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}'

Expected response:

{"is_fraud": false, "fraud_probability": 0.02}

# Test with a suspicious transaction (high amount, late night, online)
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 500.0, "hour": 3, "day_of_week": 1, "merchant_category": "online"}'

Expected response:

{"is_fraud": true, "fraud_probability": 0.78}

We have a working model served as an API! In a real scenario, we could now integrate this API with a payment processing frontend, mobile app, or any system that needs fraud predictions.

But before we celebrate, let's examine this naive approach for potential pitfalls...

Checkpoint: Your API should be running at http://localhost:8000. The Swagger UI at /docs should show both endpoints (/predict and /health). Test with curl or the Swagger UI to verify predictions are returned.

2. Where the Naive Approach Breaks

Our quick-and-dirty ML pipeline works on the surface: it can train a model and serve predictions. However, hidden problems will emerge if we try to maintain or scale this system in production.

This section is critical: understanding these issues will motivate the solutions we implement in the following sections. Let's go through the problems one by one.

Problem 1: No Experiment Tracking (Reproducibility)

Try this thought experiment: Run train_naive.py again with different hyperparameters (change n_estimators to 200, or max_depth to 15). Would you be able to exactly reproduce the previous model's results if someone asked?

Probably not. Currently, we have no record of:

Which hyperparameters we used
What metrics we achieved
What version of the data we trained on
What library versions were installed
When the training happened
Who ran the training

Three months from now, if your manager asks "How was this model trained? Can you reproduce the results?" – you'd be in trouble. You might have the code, but you don't know which version of the code, which parameters, or which data produced the model that's currently in production.

Experiment tracking is the practice of logging all these details (code versions, parameters, metrics, data versions, artifacts) so experiments can be compared and replicated. Our naive approach lacks this entirely, making our results hard to trust or build upon.

Problem 2: Model Versioning and Deployment Chaos

We trained one model and saved it as model.pkl. Now consider this scenario:

You train a new model with different hyperparameters
You overwrite model.pkl with the new model
You deploy it to production
Users start complaining about more false positives
You want to roll back to the previous model
Problem: The previous model was overwritten and is gone forever

There's no systematic versioning. Questions you cannot answer:

Which model version is currently in production?
What were the metrics for model v1 vs v2?
When was each model trained and by whom?
Can we instantly roll back if the new model performs worse?
What changed between versions?

Without version control for models, you're flying blind. Imagine deploying code without Git – that's what we're doing with our model.

Problem 3: No Data Validation – Garbage In, Garbage Out

Right now, our API will accept any input and try to make a prediction. Let's see what happens with bad data.

Create a test script src/test_bad_data.py:

# src/test_bad_data.py
"""Test what happens when we send garbage data to the API."""
import requests

BASE_URL = "http://localhost:8000"

print("Testing API with various bad inputs...\n")

# Test 1: Negative amount
print("Test 1: Negative amount")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -500.0,        # Negative amount - impossible!
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 2: Invalid hour
print("Test 2: Hour = 25 (should be 0-23)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 25,              # Invalid hour!
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 3: Invalid day of week
print("Test 3: day_of_week = 10 (should be 0-6)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 10,       # Invalid day!
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 4: Unknown merchant category
print("Test 4: Unknown merchant category")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "unknown_category"  # Not in training data!
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 5: All bad at once
print("Test 5: Everything wrong")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -1000.0,
    "hour": 99,
    "day_of_week": 15,
    "merchant_category": "totally_fake"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

print("Observation: The API happily accepts ALL garbage and returns predictions!")
print("This is dangerous - bad data leads to bad predictions with no warning.")

Run it (make sure your API is still running):

python src/test_bad_data.py

You'll see something like:

Testing API with various bad inputs...

Test 1: Negative amount
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.15}

Test 2: Hour = 25 (should be 0-23)
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.08}

...

Observation: The API happily accepts ALL garbage and returns predictions!

The API accepts garbage and returns predictions with no warning! In production, this could mean:

Incorrect predictions based on impossible data
Fraud going undetected because of malformed input
Legitimate transactions blocked based on corrupted data
No way to debug why predictions are wrong

As the saying goes: "Garbage in, garbage out." But even worse – we don't even know garbage went in!

Problem 4: Model Drift – Performance Decay Over Time

Here's a scenario that happens in every production ML system:

January: You train your model on historical fraud data. It achieves 98% accuracy and 0.67 F1-score. Everyone's happy.
February: The model is deployed and working well. Fraud is being caught.
March: Fraudsters adapt. They start using different patterns – smaller amounts, different merchant categories, different times of day.
April: Your model's accuracy has dropped from 98% to 85%. F1-score dropped from 0.67 to 0.35. Fraud is slipping through.
May: A major fraud incident occurs. Investigation reveals the model has been underperforming for 2 months.

The problem: Nobody noticed for 2 months because there was no monitoring.

This phenomenon is called data drift (when input data distributions change) or concept drift (when the relationship between inputs and outputs changes). Both are inevitable in real-world systems.

Without monitoring:

You don't know when performance degrades
You don't know why performance degrades
You can't take corrective action until users complain
By then, significant damage may have occurred

Problem 5: No CI/CD or Deployment Safety

Our "deployment process" was literally:

SSH into the server (or run locally)
Run python src/train_naive.py
Copy model.pkl to the right place
Restart the API
Hope for the best

There's:

No automated testing: A typo could break everything
No staging environment: We test directly in production
No gradual rollout: 100% of traffic hits the new model immediately
No rollback capability: If something breaks, we have to manually fix it
No audit trail: Who deployed what and when?

This is how production incidents happen. A rushed deployment at 5 PM on Friday breaks the fraud detection system, and nobody notices until Monday when fraud losses have spiked.

Figure 2: Problems with the Naive Approach

Summary: What We Need to Fix

Our simple ML service is missing critical infrastructure. Here's the mapping of problems to solutions:

Problem	Impact	Solution	Section
No experiment tracking	Can't reproduce or compare models	MLflow Tracking	3
No model versioning	Can't roll back or audit	MLflow Registry	3
No feature consistency	Training-serving skew	Feast Feature Store	4
No data validation	Garbage predictions	Great Expectations	5
No monitoring	Drift goes unnoticed	Evidently	6
No CI/CD	Risky deployments	GitHub Actions + Docker	7

The good news: We can fix each of these by incrementally adding components to our pipeline. Each tool addresses a specific problem, and together they form a robust ML platform.

Let's start fixing these issues, one by one.

3. Add Experiment Tracking and Model Registry with MLflow

What breaks without this: You can't reproduce yesterday's results, can't compare experiments, and can't roll back when a new model fails in production.

Our first fix addresses Problems 1 and 2: experiment reproducibility and model versioning.

MLflow is an open-source platform designed to manage the ML lifecycle. We'll use two of its key components:

MLflow Tracking: Log experiments (parameters, metrics, artifacts) so you can compare runs and reproduce results
MLflow Model Registry: Version your models with aliases (champion, challenger) and manage the deployment lifecycle

Why This Matters: Without tracking, ML is guesswork. With MLflow, every run is logged with parameters, metrics, and artifacts. You can compare runs side-by-side, understand what actually improved your model, and reproduce any past experiment. The Model Registry adds governance – you know exactly which model is in production and can roll back in seconds.

3.1 How to Set Up the MLflow Tracking Server

MLflow can log experiments to a local directory by default, but to use the full UI and model registry, it's best to run the MLflow tracking server.

Open a new terminal (keep it separate from your API terminal) and run:

# Create a directory for MLflow data
mkdir -p mlruns

# Start the MLflow server
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns

Let's break down these parameters:

--host 0.0.0.0: Listen on all network interfaces
--port 5000: Run on port 5000
--backend-store-uri sqlite:///mlflow.db: Store experiment metadata in a SQLite database (for production, you'd use PostgreSQL or MySQL)
--default-artifact-root ./mlruns: Store model artifacts (files) in the mlruns directory

You should see:

[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://0.0.0.0:5000

Now open your browser and navigate to http://localhost:5000. You'll see the MLflow UI – it should be empty initially since we haven't logged any experiments yet.

3.2 How to Log Experiments in Code

Now let's modify our training script to log everything to MLflow. Create src/train_mlflow.py:

# src/train_mlflow.py
"""
Train fraud detection model with MLflow experiment tracking.

This script demonstrates proper ML experiment tracking:
- Log all hyperparameters
- Log all metrics (train and test)
- Log the trained model as an artifact
- Register the model in the Model Registry

Compare this to train_naive.py to see the difference!
"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score
)
import pickle
from datetime import datetime

# Configure MLflow to use our tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create or get the experiment
# All runs will be grouped under this experiment name
mlflow.set_experiment("fraud-detection")

def load_and_preprocess_data():
    """Load and preprocess the training and test data."""
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # Encode categorical feature
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    # Prepare features
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    return X_train, y_train, X_test, y_test, encoder

def train_and_log_model(
    n_estimators: int = 100,
    max_depth: int = 10,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1
):
    """
    Train a model and log everything to MLflow.
    
    Args:
        n_estimators: Number of trees in the forest
        max_depth: Maximum depth of each tree
        min_samples_split: Minimum samples required to split a node
        min_samples_leaf: Minimum samples required at a leaf node
    """
    X_train, y_train, X_test, y_test, encoder = load_and_preprocess_data()
    
    # Start an MLflow run - everything logged will be associated with this run
    with mlflow.start_run():
        # Add a descriptive run name
        run_name = f"rf_est{n_estimators}_depth{max_depth}_{datetime.now().strftime('%H%M%S')}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        # Log all hyperparameters
        # These are the "knobs" we can tune
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)
        mlflow.log_param("min_samples_leaf", min_samples_leaf)
        mlflow.log_param("model_type", "RandomForestClassifier")
        
        # Log data information
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples", len(X_test))
        mlflow.log_param("fraud_ratio", float(y_train.mean()))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Train the model
        print(f"\nTraining model: n_estimators={n_estimators}, max_depth={max_depth}")
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Evaluate and log metrics for BOTH train and test sets
        # This helps detect overfitting
        for dataset_name, X, y in [("train", X_train, y_train), ("test", X_test, y_test)]:
            y_pred = model.predict(X)
            y_prob = model.predict_proba(X)[:, 1]
            
            # Calculate all metrics
            accuracy = accuracy_score(y, y_pred)
            precision = precision_score(y, y_pred, zero_division=0)
            recall = recall_score(y, y_pred, zero_division=0)
            f1 = f1_score(y, y_pred, zero_division=0)
            roc_auc = roc_auc_score(y, y_prob)
            
            # Log metrics with dataset prefix
            mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
            mlflow.log_metric(f"{dataset_name}_precision", precision)
            mlflow.log_metric(f"{dataset_name}_recall", recall)
            mlflow.log_metric(f"{dataset_name}_f1", f1)
            mlflow.log_metric(f"{dataset_name}_roc_auc", roc_auc)
            
            print(f"  {dataset_name.upper()} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, ROC-AUC: {roc_auc:.4f}")
        
        # Log feature importance
        for feature, importance in zip(
            ["amount", "hour", "day_of_week", "merchant_encoded"],
            model.feature_importances_
        ):
            mlflow.log_metric(f"importance_{feature}", importance)
        
        # Log the model to MLflow AND register it in the Model Registry
        # This creates a new version of the model automatically
        print("\nRegistering model in MLflow Model Registry...")
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name="fraud-detection-model",
            input_example=X_train.iloc[:5]  # Example input for documentation
        )
        
        # Save and log the encoder as a separate artifact
        # We need this for inference
        with open("encoder.pkl", "wb") as f:
            pickle.dump(encoder, f)
        mlflow.log_artifact("encoder.pkl")
        
        # Get the run ID for reference
        run_id = mlflow.active_run().info.run_id
        print(f"\nMLflow Run ID: {run_id}")
        print(f"View this run: http://localhost:5000/#/experiments/1/runs/{run_id}")
        
        return model, encoder

def run_experiment_sweep():
    """
    Run multiple experiments with different hyperparameters.
    
    This demonstrates how MLflow helps compare different configurations.
    """
    print("="*60)
    print("RUNNING HYPERPARAMETER EXPERIMENT SWEEP")
    print("="*60)
    
    # Define different configurations to try
    experiments = [
        {"n_estimators": 50, "max_depth": 5},
        {"n_estimators": 100, "max_depth": 10},
        {"n_estimators": 100, "max_depth": 15},
        {"n_estimators": 200, "max_depth": 10},
        {"n_estimators": 200, "max_depth": 20},
    ]
    
    for i, params in enumerate(experiments, 1):
        print(f"\n--- Experiment {i}/{len(experiments)} ---")
        train_and_log_model(**params)
    
    print("\n" + "="*60)
    print("EXPERIMENT SWEEP COMPLETE!")
    print("="*60)
    print("\nView all experiments at: http://localhost:5000")
    print("Compare runs to find the best hyperparameters!")

if __name__ == "__main__":
    run_experiment_sweep()

This script:

Connects to MLflow: mlflow.set_tracking_uri("http://localhost:5000")
Creates an experiment: mlflow.set_experiment("fraud-detection")
Logs parameters: All hyperparameters and data info
Logs metrics: Accuracy, precision, recall, F1, ROC-AUC for both train and test sets
Logs the model: Saves the trained model as an artifact
Registers the model: Adds it to the Model Registry with automatic versioning

Run the experiment sweep:

python src/train_mlflow.py

You'll see output for each experiment:

============================================================
RUNNING HYPERPARAMETER EXPERIMENT SWEEP
============================================================

--- Experiment 1/5 ---
Loading data...
Training model: n_estimators=50, max_depth=5
  TRAIN - Accuracy: 0.9821, F1: 0.6545, ROC-AUC: 0.9234
  TEST - Accuracy: 0.9795, F1: 0.5714, ROC-AUC: 0.8956

Registering model in MLflow Model Registry...
MLflow Run ID: abc123...

--- Experiment 5/5 ---
Training model: n_estimators=200, max_depth=20
  TRAIN - Accuracy: 0.9856, F1: 0.7123, ROC-AUC: 0.9567
  TEST - Accuracy: 0.9810, F1: 0.6667, ROC-AUC: 0.9234

============================================================
EXPERIMENT SWEEP COMPLETE!
============================================================

All 5 runs are now logged to MLflow with full metrics comparison available in the UI.

Now refresh the MLflow UI at http://localhost:5000. You'll see:

Experiments tab: Shows the "fraud-detection" experiment with 5 runs
Each run: Shows parameters, metrics, and artifacts
Compare: You can select multiple runs and compare them side-by-side
Models tab: Shows "fraud-detection-model" with 5 versions

MLflow Tracking UI: Compare runs, metrics, and models at a glance

3.3 How to Use the Model Registry

The Model Registry provides a central hub for managing model versions and their lifecycle stages.

In the MLflow UI:

Click the "Models" tab in the top navigation
Click "fraud-detection-model"
You'll see all 5 versions listed with their metrics

Model Aliases: MLflow now uses aliases instead of stages. If you've seen older tutorials using "Staging" and "Production" stages, aliases are the newer, more flexible approach.

@champion: The production model serving live traffic
@challenger: Candidate model being tested
You can create custom aliases like @baseline, @latest and so on.

Assign an alias:

Open MLflow UI → Models → fraud-detection-model
Click on the version you want to promote
Click "Add Alias"
Enter champion and save

Now you've assigned the @champion alias to your best model. Your API will load whichever version has this alias, making rollbacks as simple as moving the alias to a different version.

Figure 3: MLflow Model Lifecycle — From Training to Production

3.4 Update API to Load from Registry

Now let's update our API to load the champion model from the MLflow Registry instead of a pickle file. Create src/serve_mlflow.py:

# src/serve_mlflow.py
"""
Serve fraud detection model from MLflow Model Registry.

This version loads the @champion model from MLflow, which means:
- Always serves the latest @champion model
- Can roll back by changing the @champion alias
- No manual file copying needed
"""
import mlflow
import mlflow.sklearn
import pickle
import os
from fastapi import FastAPI
from pydantic import BaseModel, Field

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

print("Loading model from MLflow Model Registry...")

# Load the champion model from the registry
# This automatically gets whichever version has the @champion alias
try:
    model = mlflow.sklearn.load_model("models:/fraud-detection-model@champion")
    print("Successfully loaded champion model from MLflow!")
except Exception as e:
    print(f"Error loading from MLflow: {e}")
    print("Make sure you've assigned the @champion alias to a model in the MLflow UI")
    raise

# Load the encoder (saved as an artifact)
# In a real system, you might also version this in MLflow
with open("encoder.pkl", "rb") as f:
    encoder = pickle.load(f)
print("Encoder loaded successfully!")

app = FastAPI(
    title="Fraud Detection API (MLflow)",
    description="""
    Fraud detection API that loads models from MLflow Model Registry.
    
    This version always serves the model with the @champion alias.
    To update the model:
    1. Train a new model with train_mlflow.py
    2. Compare metrics in MLflow UI
    3. Promote the best model to Production
    4. Restart this API
    
    To roll back: Move the @champion alias to a previous version in MLflow UI.
    """,
    version="2.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount in dollars", example=150.00)
    hour: int = Field(..., description="Hour of the day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Monday, 6=Sunday)", example=3)
    merchant_category: str = Field(..., description="Type of merchant", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_source: str = "MLflow Production"

@app.post("/predict", response_model=PredictionResponse)
def predict(tx: Transaction):
    """Predict whether a transaction is fraudulent using the champion model."""
    data = tx.dict()
    
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        data["merchant_encoded"] = 0
    
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        model_source="MLflow Production"
    )

@app.get("/health")
def health():
    return {"status": "healthy", "model_source": "MLflow Registry"}

@app.get("/model-info")
def model_info():
    """Get information about the currently loaded model."""
    return {
        "registry": "MLflow",
        "model_name": "fraud-detection-model",
        "alias": "champion",
        "tracking_uri": "http://localhost:5000"
    }

Stop your old API (Ctrl+C) and start this new one:

uvicorn src.serve_mlflow:app --reload --host 0.0.0.0 --port 8000

Now deploying a new model is a controlled, auditable process:

Train new model → Automatically registered as new version
Compare metrics → Use MLflow UI to compare with current Production
Set as champion → Assign @champion alias in MLflow UI
Restart API → Loads new Production model
Roll back if needed → Move @champion alias to previous version

Checkpoint:

MLflow UI (http://localhost:5000) should show the "fraud-detection" experiment with 5 runs
The "Models" tab should show "fraud-detection-model" with 5 versions
One version should have @champion alias
The API should load and serve @champion model

4. Ensure Feature Consistency with Feast

⚠️ First time hearing about feature stores? Don't worry.
You don't need to master every Feast detail on the first read.
Focus on why feature consistency matters — you can revisit the implementation later.
Key takeaway: Training and serving must compute features the same way, or your model silently fails.

What breaks without this: Your model sees different feature values in production than it saw during training. Accuracy drops silently. This is called "training-serving skew" and it's one of the most common causes of ML system failures.

One subtle but critical issue in ML systems is training-serving skew – when data transformations at training time differ from inference time. Even small discrepancies can severely degrade performance.

Why This Matters: Imagine you're computing "average transaction amount per merchant category" as a feature. During training, you compute it using pandas in a notebook. During serving, you compute it using SQL in a different system. Small differences in how these computations handle edge cases (nulls, rounding, time windows) cause the model to see different features in production than it was trained on.

The result? Silent failures where accuracy drops but nothing errors out. Your model is making predictions based on features it's never seen before, and you have no idea.

In our naive implementation, we did handle one simple case: we saved the LabelEncoder to ensure merchant_category is encoded the same way in training and serving. But imagine if we had more complex feature engineering:

Rolling averages over time windows
User-level aggregations
Cross-feature interactions
Real-time features from streaming data

Maintaining consistency manually becomes impossible.

4.1 What is Feast and Why Use It?

In production ML platforms, teams use a feature store to guarantee feature consistency between training and serving. Feast is one popular open-source option.

In this tutorial, we use Feast not because you must, but because it makes the training-serving contract explicit and teachable. The principles apply whether you use Feast, Tecton, Featureform, or a custom solution.

Feast provides:

Capability	Description
Single source of truth	Define features once, use everywhere
Offline/online consistency	Same features for training and serving
Point-in-time correctness	Prevents data leakage in training
Low-latency serving	Millisecond feature retrieval
Feature versioning	Track changes to feature definitions

How Feast works:

Define features in Python code (feature definitions)
Materialize features from your data sources to the online store
Retrieve features using the same API for both training (offline) and serving (online)

This ensures that training and serving use exactly the same feature computation logic.

4.2 Install and Initialize Feast

We already installed Feast via requirements.txt. Now let's initialize a feature repository.

# Navigate to the feature_repo directory
cd feature_repo

# Initialize Feast (this creates template files)
feast init . --minimal

# Go back to project root
cd ..

This creates the basic Feast structure:

feature_repo/
├── feature_store.yaml    # Feast configuration
└── __init__.py

4.3 Define Feature Definitions

First, let's create the Feast configuration file:

# feature_repo/feature_store.yaml
project: fraud_detection
registry: ../data/registry.db
provider: local
online_store:
  type: sqlite
  path: ../data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 3

This configuration:

Names our project "fraud_detection"
Uses SQLite for the online store (for production, you'd use Redis or DynamoDB)
Uses local files for the offline store (for production, you'd use BigQuery or Snowflake)

Now create the feature definitions:

# feature_repo/features.py
"""
Feast feature definitions for fraud detection.

This file defines:
- Entities: The keys we use to look up features (merchant_category)
- Data Sources: Where the raw feature data comes from (Parquet file)
- Feature Views: The features themselves and their schemas

The key insight: These definitions are the SINGLE SOURCE OF TRUTH.
Both training and serving use these exact definitions.
"""
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# =============================================================================
# ENTITIES
# =============================================================================
# An entity is the "key" we use to look up features.
# For merchant-level features, the entity is merchant_category.

merchant = Entity(
    name="merchant_category",
    description="Merchant category for the transaction (for example, 'online', 'grocery')",
    value_type=ValueType.STRING,
)

# =============================================================================
# DATA SOURCES
# =============================================================================
# Data sources tell Feast where to find the raw feature data.
# For local development, we use a Parquet file.
# For production, this could be BigQuery, Snowflake, S3, etc.

merchant_stats_source = FileSource(
    name="merchant_stats_source",
    path="../data/merchant_features.parquet",  # We'll create this file
    timestamp_field="event_timestamp",       # Required for point-in-time joins
)

# =============================================================================
# FEATURE VIEWS
# =============================================================================
# A Feature View defines a group of related features.
# It specifies:
# - Which entity the features are for
# - The schema (names and types of features)
# - Where the data comes from
# - How long features are valid (TTL)

merchant_stats_fv = FeatureView(
    name="merchant_stats",
    description="Aggregated statistics per merchant category",
    entities=[merchant],
    ttl=timedelta(days=7),  # Features are valid for 7 days
    schema=[
        Field(name="avg_amount", dtype=Float32, description="Average transaction amount"),
        Field(name="transaction_count", dtype=Int64, description="Number of transactions"),
        Field(name="fraud_rate", dtype=Float32, description="Historical fraud rate"),
    ],
    source=merchant_stats_source,
    online=True,  # Enable online serving (low-latency retrieval)
)

4.4 Materialize Features to Online Store

Now we need to:

Compute the features from our training data
Save them in a format Feast can read
Apply the Feast definitions
Materialize features to the online store

Create src/prepare_feast_features.py:

# src/prepare_feast_features.py
"""
Prepare feature data for Feast.

This script:
1. Computes aggregated merchant features from training data
2. Saves them in Parquet format (Feast's offline store format)
3. Applies Feast feature definitions
4. Materializes features to the online store for low-latency serving

Run this whenever your training data changes or you want to refresh features.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import subprocess
import os

def compute_merchant_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute aggregated features by merchant category.
    
    THIS IS THE SINGLE SOURCE OF TRUTH FOR FEATURE COMPUTATION.
    
    Both training and serving will use features computed by this exact logic.
    Any change here automatically applies everywhere.
    
    Args:
        df: Transaction DataFrame with columns: amount, merchant_category, is_fraud
        
    Returns:
        DataFrame with computed features per merchant category
    """
    print("Computing merchant-level features...")
    
    # Group by merchant category and compute aggregates
    stats = df.groupby('merchant_category').agg({
        'amount': ['mean', 'count'],
        'is_fraud': 'mean'
    }).reset_index()
    
    # Flatten column names
    stats.columns = ['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']
    
    # Add timestamp for Feast (required for point-in-time correct joins)
    stats['event_timestamp'] = datetime.now()
    
    # Convert types to match Feast schema
    stats['avg_amount'] = stats['avg_amount'].astype('float32')
    stats['transaction_count'] = stats['transaction_count'].astype('int64')
    stats['fraud_rate'] = stats['fraud_rate'].astype('float32')
    
    return stats

def main():
    print("="*60)
    print("FEAST FEATURE PREPARATION")
    print("="*60)
    
    # Load training data
    print("\n1. Loading training data...")
    train_df = pd.read_csv('data/train.csv')
    print(f"   Loaded {len(train_df):,} transactions")
    
    # Compute merchant features
    print("\n2. Computing merchant features...")
    merchant_features = compute_merchant_features(train_df)
    
    print("\n   Computed features:")
    print(merchant_features.to_string(index=False))
    
    # Save as Parquet (required format for Feast file source)
    print("\n3. Saving features to Parquet...")
    os.makedirs('data', exist_ok=True)
    output_path = 'data/merchant_features.parquet'
    merchant_features.to_parquet(output_path, index=False)
    print(f"   Saved to {output_path}")
    
    # Apply Feast feature definitions
    print("\n4. Applying Feast feature definitions...")
    try:
        result = subprocess.run(
            ['feast', 'apply'],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Feature definitions applied successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error applying Feast: {e.stderr}")
        raise
    
    # Materialize features to online store
    print("\n5. Materializing features to online store...")
    try:
        result = subprocess.run(
            ['feast', 'materialize-incremental', datetime.now().isoformat()],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Features materialized successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error materializing: {e.stderr}")
        raise
    
    print("\n" + "="*60)
    print("FEAST FEATURE PREPARATION COMPLETE!")
    print("="*60)
    print("\nYou can now:")
    print("  - Retrieve features for training: get_training_features()")
    print("  - Retrieve features for serving: get_online_features()")
    print("  - View feature stats: feast feature-views list")

if __name__ == "__main__":
    main()

Run the feature preparation:

python src/prepare_feast_features.py

You should see:

============================================================
FEAST FEATURE PREPARATION
============================================================

1. Loading training data... 8,000 transactions
2. Computing merchant features...
   grocery: avg=$31.24, fraud_rate=0.85%
   online: avg=$98.45, fraud_rate=4.87%
   restaurant: avg=$28.12, fraud_rate=0.50%
   retail: avg=$45.67, fraud_rate=1.02%
   travel: avg=$156.23, fraud_rate=4.18%
3. Saving to data/merchant_features.parquet ✓
4. Applying Feast definitions... ✓
5. Materializing to online store... ✓

FEAST FEATURE PREPARATION COMPLETE!

4.5 Retrieve Features for Training and Serving

Now let's create utilities to retrieve features consistently for both training and serving:

# src/feast_features.py
"""
Feast feature retrieval for training and serving.

This module provides functions to retrieve features from Feast:
- get_training_features(): For offline training (historical features)
- get_online_features(): For real-time serving (low-latency)

IMPORTANT: Both functions use the SAME feature definitions,
ensuring consistency between training and serving.
"""
import pandas as pd
from feast import FeatureStore
from datetime import datetime

# Initialize Feast store (points to our feature_repo)
store = FeatureStore(repo_path="feature_repo")

def get_training_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Get features for training using Feast's offline store.
    
    Uses point-in-time correct joins to prevent data leakage.
    This means features are looked up as of the time each transaction occurred,
    not as of "now" - preventing you from accidentally using future data.
    
    Args:
        df: DataFrame with at least 'merchant_category' column
        
    Returns:
        DataFrame with original columns plus Feast features
    """
    print("Retrieving training features from Feast offline store...")
    
    # Prepare entity dataframe with timestamps
    # Each row needs: entity key(s) + event_timestamp
    entity_df = df[['merchant_category']].copy()
    entity_df['event_timestamp'] = datetime.now()  # See note below
    entity_df = entity_df.drop_duplicates()
    
    # ⚠️ Simplification: For clarity, we use the current timestamp here.
    # In real systems, this would be the actual event time of each transaction.
    
    # Retrieve historical features
    # Feast handles the point-in-time join automatically
    training_data = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
    ).to_df()
    
    # Merge features back with original dataframe
    result = df.merge(
        training_data[['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']],
        on='merchant_category',
        how='left'
    )
    
    print(f"Retrieved features for {len(entity_df)} unique merchants")
    return result

def get_online_features(merchant_category: str) -> dict:
    """
    Get features for real-time serving using Feast's online store.
    
    This is optimized for low-latency retrieval (milliseconds).
    Use this in your prediction API for real-time inference.
    
    Args:
        merchant_category: The merchant category to look up
        
    Returns:
        Dictionary with feature names and values
    """
    # Retrieve from online store (low-latency)
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": merchant_category}],
    ).to_dict()
    
    # Format the response
    return {
        'merchant_avg_amount': feature_vector['avg_amount'][0],
        'merchant_tx_count': feature_vector['transaction_count'][0],
        'merchant_fraud_rate': feature_vector['fraud_rate'][0],
    }

def get_online_features_batch(merchant_categories: list) -> pd.DataFrame:
    """
    Get features for multiple merchants at once (batch serving).
    
    More efficient than calling get_online_features() in a loop.
    
    Args:
        merchant_categories: List of merchant categories to look up
        
    Returns:
        DataFrame with features for each merchant
    """
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": mc} for mc in merchant_categories],
    ).to_df()
    
    return feature_vector

if __name__ == "__main__":
    # Test the feature retrieval functions
    print("="*60)
    print("TESTING FEAST FEATURE RETRIEVAL")
    print("="*60)
    
    # Test offline retrieval (for training)
    print("\n1. Testing OFFLINE feature retrieval (for training)...")
    train_df = pd.read_csv('data/train.csv').head(10)
    enriched = get_training_features(train_df)
    print("\n   Sample enriched training data:")
    print(enriched[['amount', 'merchant_category', 'avg_amount', 'fraud_rate']].head())
    
    # Test online retrieval (for serving)
    print("\n2. Testing ONLINE feature retrieval (for serving)...")
    for category in ['online', 'grocery', 'travel', 'restaurant', 'retail']:
        features = get_online_features(category)
        print(f"   {category}: avg_amount=${features['merchant_avg_amount']:.2f}, "
              f"fraud_rate={features['merchant_fraud_rate']:.2%}")
    
    # Test batch retrieval
    print("\n3. Testing BATCH online retrieval...")
    batch_features = get_online_features_batch(['online', 'grocery', 'travel'])
    print(batch_features)
    
    print("\n" + "="*60)
    print("FEAST FEATURE RETRIEVAL TEST COMPLETE!")
    print("="*60)

Test the feature retrieval:

python src/feast_features.py

You should see:

============================================================
TESTING FEAST FEATURE RETRIEVAL
============================================================

1. Testing OFFLINE feature retrieval (for training)...
Retrieving training features from Feast offline store...
Retrieved features for 5 unique merchants

   Sample enriched training data:
   amount merchant_category  avg_amount  fraud_rate
    45.23           grocery       31.24      0.0085
   123.45            online       98.45      0.0487
    ...

2. Testing ONLINE feature retrieval (for serving)...
   online: avg_amount=$98.45, fraud_rate=4.87%
   grocery: avg_amount=$31.24, fraud_rate=0.85%
   travel: avg_amount=$156.23, fraud_rate=4.18%
   restaurant: avg_amount=$28.12, fraud_rate=0.50%
   retail: avg_amount=$45.67, fraud_rate=1.02%

3. Testing BATCH online retrieval...
  merchant_category  avg_amount  transaction_count  fraud_rate
               online       98.45               1234      0.0487
              grocery       31.24               2345      0.0085
               travel      156.23                478      0.0418

Why Feast Over Custom Code?

Aspect	Custom Code	Feast
Consistency	Manual effort to keep in sync	Automatic - same definitions everywhere
Point-in-time correctness	Must implement yourself	Built-in
Online serving	Must build your own cache	Built-in online store
Feature versioning	Not supported	Built-in
Scalability	Limited	Production-ready (BigQuery, Redis, etc.)
Team collaboration	Difficult	Feature registry with documentation
Monitoring	Manual	Built-in feature statistics

💡 Mental Model: Treat feature definitions like database schemas.
You wouldn't compute a column one way in your application and a different way in your reports. Features deserve the same discipline — define once, use everywhere.

Checkpoint: After running prepare_feast_features.py, you should have:

data/merchant_features.parquet (computed features)
data/registry.db (Feast registry)
data/online_store.db (SQLite online store)

Running python src/feast_features.py should successfully retrieve features for all merchant categories.

5. Add Data Validation with Great Expectations

What breaks without this: Your API accepts garbage input (negative amounts, invalid hours) and returns meaningless predictions. Worse, you have no idea it happened.

Recall that our API currently trusts input blindly. We saw how garbage data produces a prediction with no warning. Great Expectations is an open-source tool for data quality testing – defining rules (expectations) and testing data against them.

Why This Matters: Data validation acts as a gatekeeper. Bad data is rejected before it can harm predictions. As the saying goes, "Garbage in, garbage out" – feeding unreliable data yields unreliable results. With validation, we transform this to "Garbage in, error out" – much better for debugging and reliability.

5.1 Define Expectations

What are reasonable expectations for our transaction data? Based on domain knowledge:

Field	Expectation	Reason
`amount`	Positive (> 0)	Negative transactions don't make sense
`amount`	Below $50,000	Extremely large amounts are outliers/errors
`hour`	0-23 inclusive	Valid hours in a day
`day_of_week`	0-6 inclusive	Valid days (Mon=0, Sun=6)
`merchant_category`	One of known categories	Must match training data
All fields	Not null	Required for prediction

Create src/data_validation.py:

# src/data_validation.py
"""
Data validation for fraud detection.

This module provides functions to validate input data BEFORE making predictions.
Invalid data is rejected with clear error messages.

The key insight: It's better to reject bad input than to make garbage predictions.
"""
import pandas as pd
from typing import Dict, List, Any, Optional

# Define the valid merchant categories (must match training data!)
VALID_CATEGORIES = ["grocery", "restaurant", "retail", "online", "travel"]

def validate_transaction(data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Validate a single transaction for fraud prediction.
    
    Checks all business rules and data quality requirements.
    Returns a dictionary with 'valid' (bool) and 'errors' (list).
    
    Args:
        data: Dictionary with transaction fields
        
    Returns:
        {"valid": bool, "errors": list of error messages}
        
    Example:
        >>> validate_transaction({"amount": -100, "hour": 25, ...})
        {"valid": False, "errors": ["amount must be positive", "hour must be 0-23"]}
    """
    errors = []
    
    # ==========================================================================
    # Amount Validation
    # ==========================================================================
    amount = data.get("amount")
    if amount is None:
        errors.append("amount is required")
    elif not isinstance(amount, (int, float)):
        errors.append(f"amount must be a number (got {type(amount).__name__})")
    elif amount <= 0:
        errors.append("amount must be positive")
    elif amount > 50000:
        errors.append(f"amount exceeds maximum allowed value of \(50,000 (got \){amount:,.2f})")
    
    # ==========================================================================
    # Hour Validation
    # ==========================================================================
    hour = data.get("hour")
    if hour is None:
        errors.append("hour is required")
    elif not isinstance(hour, int):
        errors.append(f"hour must be an integer (got {type(hour).__name__})")
    elif not (0 <= hour <= 23):
        errors.append(f"hour must be between 0 and 23 (got {hour})")
    
    # ==========================================================================
    # Day of Week Validation
    # ==========================================================================
    day = data.get("day_of_week")
    if day is None:
        errors.append("day_of_week is required")
    elif not isinstance(day, int):
        errors.append(f"day_of_week must be an integer (got {type(day).__name__})")
    elif not (0 <= day <= 6):
        errors.append(f"day_of_week must be between 0 (Monday) and 6 (Sunday) (got {day})")
    
    # ==========================================================================
    # Merchant Category Validation
    # ==========================================================================
    category = data.get("merchant_category")
    if category is None:
        errors.append("merchant_category is required")
    elif not isinstance(category, str):
        errors.append(f"merchant_category must be a string (got {type(category).__name__})")
    elif category not in VALID_CATEGORIES:
        errors.append(
            f"merchant_category must be one of {VALID_CATEGORIES} (got '{category}')"
        )
    
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }

def validate_batch(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Validate a batch of transactions using Great Expectations.
    
    This is useful for validating training data or batch prediction requests.
    Uses Great Expectations for more sophisticated validation.
    
    Args:
        df: DataFrame with transaction data
        
    Returns:
        Dictionary with validation results
    """
    import great_expectations as gx
    
    # Convert to Great Expectations dataset
    ge_df = gx.from_pandas(df)
    
    results = []
    
    # Amount expectations
    r = ge_df.expect_column_values_to_be_between(
        'amount', min_value=0.01, max_value=50000, mostly=0.99
    )
    results.append(('amount_range', r.success, r.result))
    
    # Hour expectations
    r = ge_df.expect_column_values_to_be_between(
        'hour', min_value=0, max_value=23
    )
    results.append(('hour_range', r.success, r.result))
    
    # Day of week expectations
    r = ge_df.expect_column_values_to_be_between(
        'day_of_week', min_value=0, max_value=6
    )
    results.append(('day_range', r.success, r.result))
    
    # Merchant category expectations
    r = ge_df.expect_column_values_to_be_in_set(
        'merchant_category', VALID_CATEGORIES
    )
    results.append(('category_valid', r.success, r.result))
    
    # No nulls in critical fields
    for col in ['amount', 'hour', 'day_of_week', 'merchant_category']:
        r = ge_df.expect_column_values_to_not_be_null(col)
        results.append((f'{col}_not_null', r.success, r.result))
    
    # Summarize results
    passed = sum(1 for _, success, _ in results if success)
    total = len(results)
    
    return {
        'success': passed == total,
        'passed': passed,
        'total': total,
        'pass_rate': passed / total,
        'details': {name: {'passed': success, 'result': result} 
                   for name, success, result in results}
    }

if __name__ == "__main__":
    print("="*60)
    print("TESTING DATA VALIDATION")
    print("="*60)
    
    # Test single transaction validation
    print("\n1. Single Transaction Validation")
    print("-"*40)
    
    test_cases = [
        {
            "name": "Valid transaction",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Negative amount",
            "data": {"amount": -100.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Invalid hour",
            "data": {"amount": 50.0, "hour": 25, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Unknown merchant",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "unknown"}
        },
        {
            "name": "Everything wrong",
            "data": {"amount": -999, "hour": 99, "day_of_week": 15, "merchant_category": "fake"}
        },
    ]
    
    for tc in test_cases:
        result = validate_transaction(tc["data"])
        status = "PASS" if result["valid"] else "FAIL"
        print(f"\n{tc['name']}: {status}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  - {error}")
    
    # Test batch validation
    print("\n\n2. Batch Validation with Great Expectations")
    print("-"*40)
    
    train_df = pd.read_csv('data/train.csv')
    results = validate_batch(train_df)
    
    print(f"\nTraining data validation: {results['passed']}/{results['total']} checks passed")
    print(f"Pass rate: {results['pass_rate']:.1%}")
    
    if not results['success']:
        print("\nFailed checks:")
        for name, detail in results['details'].items():
            if not detail['passed']:
                print(f"  - {name}")

When to Use Which Validation Approach

Approach	Use Case	Latency	When to Use
Custom Python (`validate_transaction`)	Real-time API requests	<1ms	Every prediction request
Great Expectations	Batch data quality	Seconds	Training data, periodic audits, CI/CD

We use both in this tutorial because they serve different purposes:

Custom validation is your runtime gatekeeper — fast enough for every request
Great Expectations is your batch auditor — thorough checks on datasets

5.2 Integrate Validation into FastAPI

Now let's update our API to reject invalid input with clear error messages:

# src/serve_validated.py
"""
Serve fraud detection model with input validation.

This version adds data validation BEFORE making predictions:
- Invalid inputs are rejected with HTTP 400 and clear error messages
- Valid inputs are processed and predictions returned

This is much safer than the naive version which accepted garbage.
"""
import pickle
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.data_validation import validate_transaction

# Load model
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)

app = FastAPI(
    title="Fraud Detection API (Validated)",
    description="""
    Fraud detection API with input validation.
    
    All inputs are validated before prediction:
    - amount: Must be positive and below $50,000
    - hour: Must be 0-23
    - day_of_week: Must be 0-6
    - merchant_category: Must be one of: grocery, restaurant, retail, online, travel
    
    Invalid inputs return HTTP 400 with detailed error messages.
    """,
    version="3.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount (must be positive)", example=150.00)
    hour: int = Field(..., description="Hour of day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Mon, 6=Sun)", example=3)
    merchant_category: str = Field(..., description="Merchant type", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    validation_passed: bool = True

class ValidationErrorResponse(BaseModel):
    detail: dict

@app.post("/predict", response_model=PredictionResponse, responses={400: {"model": ValidationErrorResponse}})
def predict(tx: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Input is validated before prediction. Invalid inputs return HTTP 400.
    """
    data = tx.dict()
    
    # VALIDATE INPUT BEFORE MAKING PREDICTION
    validation = validate_transaction(data)
    
    if not validation["valid"]:
        raise HTTPException(
            status_code=400,
            detail={
                "message": "Validation failed",
                "errors": validation["errors"],
                "input": data
            }
        )
    
    # Input is valid - make prediction
    data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        validation_passed=True
    )

@app.get("/health")
def health():
    return {"status": "healthy", "validation": "enabled"}

Start the validated API:

uvicorn src.serve_validated:app --reload --host 0.0.0.0 --port 8000

Now test with bad data:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}'

Response (HTTP 400):

{
  "detail": {
    "message": "Validation failed",
    "errors": [
      "amount must be positive",
      "hour must be between 0 and 23 (got 25)",
      "day_of_week must be between 0 (Monday) and 6 (Sunday) (got 10)",
      "merchant_category must be one of ['grocery', 'restaurant', 'retail', 'online', 'travel'] (got 'fake')"
    ],
    "input": {"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}
  }
}

This is a huge improvement! Instead of silently accepting garbage and returning meaningless predictions, we now:

Reject invalid input immediately
Provide clear, actionable error messages
Return the original input for debugging
Use proper HTTP status codes (400 for client error)

Checkpoint: Your validated API should:

Accept valid transactions and return predictions
Reject invalid transactions with HTTP 400 and detailed error messages
Show validation errors for each invalid field

6. Monitor Model Performance and Data Drift

What breaks without this: Your model's accuracy drops from 98% to 70% over two months. Nobody notices until customers complain. By then, significant damage has occurred.

Even with a great model and clean input data, time can be an enemy. Model performance can decline as real-world data evolves – this is known as model drift or model decay.

Why This Matters: In traditional software, you monitor CPU, memory, error rates, and response times. In ML, you must also monitor:

Data quality (are inputs within expected ranges?)
Model performance (is accuracy holding up?)
Data drift (has input distribution changed?)
Prediction drift (has the distribution of predictions changed?)

Without monitoring, your model could be silently failing for weeks before anyone notices. By then, significant damage may have occurred – fraud slipping through, good customers blocked, revenue lost.

6.1 The Four Pillars of ML Observability

Pillar	What to Monitor	Why It Matters
Data Quality	Are inputs valid? Nulls? Outliers?	Bad data causes bad predictions
Model Performance	Accuracy, precision, recall, F1	Is the model still working?
Data Drift	Has input distribution changed from training?	Model may not generalize to new data
Prediction Drift	Has prediction distribution changed?	May indicate data or concept drift

6.2 Build a Drift Monitor with Evidently

Evidently is an open-source library specifically designed for ML monitoring. It can detect drift, generate reports, and integrate with monitoring systems.

Create src/monitoring.py:

# src/monitoring.py
"""
Model monitoring with Evidently.

This module provides tools to:
1. Detect data drift between training and production data
2. Generate detailed HTML reports
3. Track drift over time
4. Alert when drift exceeds thresholds

In production, you would run drift checks periodically (hourly, daily)
and alert when significant drift is detected.
"""
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric
)
from datetime import datetime
from typing import List, Dict, Any, Optional

class DriftMonitor:
    """
    Monitor for detecting data drift between reference (training) and current data.
    
    Implementation Note: We use two approaches here:
    1. Scipy's KS-test — A lightweight statistical method that works anywhere (our fallback)
    2. Evidently — A full-featured library with beautiful reports (our primary tool)
    
    The KS-test is included as defensive coding — if Evidently fails to generate 
    a report, we still get drift detection.
    
    Usage:
        monitor = DriftMonitor(training_data)
        result = monitor.check_drift(production_data)
        if result['drift_detected']:
            alert("Drift detected!")
    """
    
    def __init__(self, reference_data: pd.DataFrame, feature_columns: Optional[List[str]] = None):
        """
        Initialize the drift monitor with reference (training) data.
        
        Args:
            reference_data: The training data to compare against
            feature_columns: Columns to monitor (default: all numeric columns)
        """
        self.reference = reference_data
        self.feature_columns = feature_columns or reference_data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        self.history: List[Dict[str, Any]] = []
        
        print(f"Drift monitor initialized with {len(self.reference):,} reference samples")
        print(f"Monitoring columns: {self.feature_columns}")
    
    def check_drift(self, current_data: pd.DataFrame, threshold: float = 0.1) -> Dict[str, Any]:
        """
        Check for drift between reference and current data.
        
        Args:
            current_data: Current/production data to check
            threshold: Drift share threshold for alerting (default 10%)
            
        Returns:
            Dictionary with drift results
        """
        from scipy import stats
        
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        # Simple statistical drift detection using KS test
        drifted_columns = []
        for col in self.feature_columns:
            statistic, p_value = stats.ks_2samp(
                ref_subset[col].dropna(),
                cur_subset[col].dropna()
            )
            if p_value < 0.05:  # 5% significance level
                drifted_columns.append(col)
        
        n_features = len(self.feature_columns)
        n_drifted = len(drifted_columns)
        drift_share = n_drifted / n_features if n_features > 0 else 0
        
        result = {
            'timestamp': datetime.now().isoformat(),
            'drift_detected': n_drifted > 0,
            'drift_share': drift_share,
            'drifted_columns': drifted_columns,
            'n_features': n_features,
            'n_drifted': n_drifted,
            'current_samples': len(current_data),
            'threshold': threshold,
            'alert': drift_share > threshold
        }
        
        self.history.append(result)
        
        return result
    
    def generate_report(self, current_data: pd.DataFrame, output_path: str = "drift_report.html"):
        """
        Generate a detailed HTML drift report using Evidently.
        
        Opens in browser for visual inspection of drift patterns.
        """
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        try:
            report = Report(metrics=[DataDriftPreset()])
            report.run(reference_data=ref_subset, current_data=cur_subset)
            
            # Save HTML report
            with open(output_path, 'w') as f:
                f.write(report.show(mode='inline').data)
            
            print(f"Drift report saved to {output_path}")
            print(f"Open this file in a browser to view detailed visualizations.")
        except Exception as e:
            print(f"Could not generate Evidently report: {e}")
            print(f"Using simplified drift detection instead.")
    
    def get_alerts(self, threshold: float = 0.1) -> List[Dict[str, Any]]:
        """
        Get all alerts from history where drift exceeded threshold.
        """
        return [
            {
                'timestamp': r['timestamp'],
                'severity': 'HIGH' if r['drift_share'] > 0.3 else 'MEDIUM',
                'drift_share': r['drift_share'],
                'message': f"Drift detected: {r['drift_share']:.1%} of features drifted",
                'drifted_columns': r['drifted_columns']
            }
            for r in self.history
            if r['drift_share'] > threshold
        ]
    
    def summary(self) -> Dict[str, Any]:
        """Get summary statistics from monitoring history."""
        if not self.history:
            return {"message": "No drift checks performed yet"}
        
        drift_shares = [r['drift_share'] for r in self.history]
        alerts = [r for r in self.history if r['alert']]
        
        return {
            'total_checks': len(self.history),
            'total_alerts': len(alerts),
            'avg_drift_share': np.mean(drift_shares),
            'max_drift_share': np.max(drift_shares),
            'first_check': self.history[0]['timestamp'],
            'last_check': self.history[-1]['timestamp']
        }


def simulate_drift_scenarios():
    """
    Demonstrate drift detection with different scenarios.
    
    This simulates what happens when production data differs from training data.
    """
    from src.generate_data import generate_transactions
    
    print("="*70)
    print("DRIFT DETECTION SIMULATION")
    print("="*70)
    
    # Load reference (training) data
    print("\n1. Loading reference data (training set)...")
    reference = pd.read_csv('data/train.csv')
    feature_cols = ['amount', 'hour', 'day_of_week']
    
    # Initialize drift monitor
    monitor = DriftMonitor(reference, feature_cols)
    
    # Scenario 1: Similar data (should show minimal drift)
    print("\n" + "-"*70)
    print("SCENARIO 1: Test data (similar distribution)")
    print("-"*70)
    test_data = pd.read_csv('data/test.csv')
    result = monitor.check_drift(test_data)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 2: Fraud spike (10% fraud instead of 2%)
    print("\n" + "-"*70)
    print("SCENARIO 2: Fraud spike (10% fraud rate instead of 2%)")
    print("-"*70)
    fraud_spike = generate_transactions(n_samples=2000, fraud_ratio=0.10, seed=101)
    result = monitor.check_drift(fraud_spike)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 3: Amount inflation (everything costs more)
    print("\n" + "-"*70)
    print("SCENARIO 3: Amount inflation (2x multiplier)")
    print("-"*70)
    inflated = test_data.copy()
    inflated['amount'] = inflated['amount'] * 2
    result = monitor.check_drift(inflated)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 4: Time shift (more late-night transactions)
    print("\n" + "-"*70)
    print("SCENARIO 4: Time shift (mostly late-night transactions)")
    print("-"*70)
    night_shift = test_data.copy()
    night_shift['hour'] = np.random.choice([0, 1, 2, 3, 22, 23], size=len(night_shift))
    result = monitor.check_drift(night_shift)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Generate detailed report for the most drifted scenario
    print("\n" + "-"*70)
    print("GENERATING DETAILED REPORT")
    print("-"*70)
    monitor.generate_report(night_shift, "drift_report.html")
    
    # Print summary
    print("\n" + "-"*70)
    print("MONITORING SUMMARY")
    print("-"*70)
    summary = monitor.summary()
    print(f"  Total checks: {summary['total_checks']}")
    print(f"  Total alerts: {summary['total_alerts']}")
    print(f"  Average drift share: {summary['avg_drift_share']:.1%}")
    print(f"  Maximum drift share: {summary['max_drift_share']:.1%}")
    
    # Print alerts
    alerts = monitor.get_alerts()
    if alerts:
        print(f"\n  Alerts ({len(alerts)}):")
        for alert in alerts:
            print(f"    [{alert['severity']}] {alert['message']}")
    
    print("\n" + "="*70)
    print("DRIFT DETECTION SIMULATION COMPLETE")
    print("="*70)
    print("\nOpen drift_report.html in your browser to see detailed visualizations!")


if __name__ == "__main__":
    simulate_drift_scenarios()

Run the drift simulation:

python src/monitoring.py

You'll see output showing how drift detection works in different scenarios. Then open drift_report.html in your browser to see beautiful visualizations of the drift patterns.

6.3 Production Monitoring Strategy

In a production environment, you would:

Log all predictions to a database or data warehouse
Run drift checks periodically (hourly for high-traffic systems, daily for lower traffic)
Set up alerts when drift exceeds thresholds (integrate with PagerDuty, Slack, etc.)
Trigger retraining if drift is severe or sustained
Create dashboards to track drift over time (Grafana, Datadog, etc.)

Checkpoint: Running python src/monitoring.py should:

Show minimal drift for similar data (test set)
Show significant drift for modified data (fraud spike, inflation, time shift)
Generate an HTML report that you can view in your browser

7. Automate Testing and Deployment with CI/CD

What breaks without this: A typo in your code breaks the API. You deploy on Friday at 5 PM. Nobody notices until Monday. Fraud losses spike over the weekend.

CI/CD (Continuous Integration/Continuous Deployment) ensures reliable, repeatable releases. As JFrog notes: "A strong CI/CD pipeline enables ML teams to build robust, bug-free models more quickly and efficiently."

Why This Matters: In ML, changes aren't just code – they're also data and models. CI/CD ensures that when you change training logic, data preprocessing, or hyperparameters, tests verify the change doesn't break anything before it reaches production. It's the difference between deploying with confidence and deploying with crossed fingers.

7.1 Write Tests for Data and Model

Create tests/test_data_and_model.py:

# tests/test_data_and_model.py
"""
Tests for data quality and model performance.

These tests run in CI/CD to ensure:
1. Data meets quality requirements
2. Model meets performance thresholds
3. No regressions are introduced

Run with: pytest tests/test_data_and_model.py -v
"""
import pandas as pd
import pickle
import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

class TestDataQuality:
    """Tests for training data quality."""
    
    @pytest.fixture
    def train_data(self):
        return pd.read_csv("data/train.csv")
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_train_data_has_expected_columns(self, train_data):
        """Training data must have all required columns."""
        required_columns = {"amount", "hour", "day_of_week", "merchant_category", "is_fraud"}
        actual_columns = set(train_data.columns)
        missing = required_columns - actual_columns
        assert not missing, f"Missing columns: {missing}"
    
    def test_train_data_not_empty(self, train_data):
        """Training data must have rows."""
        assert len(train_data) > 0, "Training data is empty"
        assert len(train_data) >= 1000, f"Training data too small: {len(train_data)} rows"
    
    def test_no_negative_amounts(self, train_data):
        """Transaction amounts must be non-negative."""
        negative_count = (train_data["amount"] < 0).sum()
        assert negative_count == 0, f"Found {negative_count} negative amounts"
    
    def test_amounts_reasonable(self, train_data):
        """Transaction amounts should be within reasonable bounds."""
        max_amount = train_data["amount"].max()
        assert max_amount <= 100000, f"Max amount {max_amount} exceeds reasonable limit"
    
    def test_hours_valid(self, train_data):
        """Hours must be 0-23."""
        invalid = train_data[(train_data["hour"] < 0) | (train_data["hour"] > 23)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid hours"
    
    def test_days_valid(self, train_data):
        """Days of week must be 0-6."""
        invalid = train_data[(train_data["day_of_week"] < 0) | (train_data["day_of_week"] > 6)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid days"
    
    def test_merchant_categories_valid(self, train_data):
        """Merchant categories must be from known set."""
        valid_categories = {"grocery", "restaurant", "retail", "online", "travel"}
        actual_categories = set(train_data["merchant_category"].unique())
        invalid = actual_categories - valid_categories
        assert not invalid, f"Invalid merchant categories: {invalid}"
    
    def test_fraud_ratio_reasonable(self, train_data):
        """Fraud ratio should be realistic (between 0.1% and 50%)."""
        fraud_ratio = train_data["is_fraud"].mean()
        assert 0.001 <= fraud_ratio <= 0.5, f"Fraud ratio {fraud_ratio:.2%} is unrealistic"
    
    def test_no_nulls_in_critical_columns(self, train_data):
        """Critical columns must not have null values."""
        critical = ["amount", "hour", "day_of_week", "merchant_category", "is_fraud"]
        for col in critical:
            null_count = train_data[col].isnull().sum()
            assert null_count == 0, f"Column {col} has {null_count} null values"


class TestModelPerformance:
    """Tests for model performance thresholds."""
    
    @pytest.fixture
    def model_and_encoder(self):
        with open("models/model.pkl", "rb") as f:
            return pickle.load(f)
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_model_loads_successfully(self, model_and_encoder):
        """Model file must load without errors."""
        model, encoder = model_and_encoder
        assert model is not None, "Model is None"
        assert encoder is not None, "Encoder is None"
    
    def test_model_can_predict(self, model_and_encoder, test_data):
        """Model must be able to make predictions."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        predictions = model.predict(X)
        assert len(predictions) == len(X), "Prediction count mismatch"
    
    def test_accuracy_threshold(self, model_and_encoder, test_data):
        """Model accuracy must be at least 90%."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        accuracy = model.score(X, y)
        assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"
    
    def test_f1_threshold(self, model_and_encoder, test_data):
        """Model F1-score must be at least 0.3 (sanity check for imbalanced data)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred)
        assert f1 >= 0.3, f"F1-score {f1:.2f} below 0.3 threshold"
    
    def test_precision_not_zero(self, model_and_encoder, test_data):
        """Model precision must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        precision = precision_score(y, y_pred, zero_division=0)
        assert precision > 0, "Model has zero precision (predicts no fraud)"
    
    def test_recall_not_zero(self, model_and_encoder, test_data):
        """Model recall must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        recall = recall_score(y, y_pred, zero_division=0)
        assert recall > 0, "Model has zero recall (misses all fraud)"

Create tests/test_api.py:

# tests/test_api.py
"""
Tests for the FastAPI prediction service.

These tests ensure the API:
1. Returns correct responses for valid inputs
2. Rejects invalid inputs with proper error messages
3. Health check works

Run with: pytest tests/test_api.py -v
Note: Requires the API to be running on localhost:8000
"""
import pytest
import httpx

BASE_URL = "http://localhost:8000"

class TestPredictionEndpoint:
    """Tests for the /predict endpoint."""
    
    def test_valid_prediction_returns_200(self):
        """Valid input should return HTTP 200 with prediction."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        assert "is_fraud" in data
        assert "fraud_probability" in data
        assert isinstance(data["is_fraud"], bool)
        assert 0 <= data["fraud_probability"] <= 1
    
    def test_high_risk_transaction(self):
        """High-risk transaction should have higher fraud probability."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 500.0,
            "hour": 3,  # Late night
            "day_of_week": 1,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        # High-risk transactions should have elevated probability
        # (not asserting exact value as model may vary)
        assert data["fraud_probability"] >= 0.0
    
    def test_negative_amount_rejected(self):
        """Negative amount should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": -100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
        assert "errors" in response.json()["detail"]
    
    def test_invalid_hour_rejected(self):
        """Invalid hour should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 25,  # Invalid
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_invalid_merchant_rejected(self):
        """Unknown merchant category should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "unknown_category"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_missing_field_rejected(self):
        """Missing required field should be rejected."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14
            # Missing day_of_week and merchant_category
        }, timeout=10)
        
        assert response.status_code == 422  # Pydantic validation error


class TestHealthEndpoint:
    """Tests for the /health endpoint."""
    
    def test_health_returns_200(self):
        """Health endpoint should return 200."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        assert response.status_code == 200
    
    def test_health_returns_healthy_status(self):
        """Health endpoint should indicate healthy status."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        data = response.json()
        assert data["status"] == "healthy"

Run tests locally:

# Run data and model tests (API not needed)
pytest tests/test_data_and_model.py -v

# Run API tests (requires API to be running)
pytest tests/test_api.py -v

7.2 GitHub Actions Workflow

⚠️ Note for Production Teams
In real ML teams, you typically don't retrain full models inside CI — it's slow and resource-intensive.
Here we do it to keep everything local, reproducible, and self-contained for learning.
Production pipelines usually separate training (scheduled jobs) from testing (CI/CD).

Create .github/workflows/ci.yml:

# .github/workflows/ci.yml
name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      
      - name: Generate training data
        run: python src/generate_data.py
      
      - name: Train model
        run: python src/train_naive.py
      
      - name: Run data quality tests
        run: pytest tests/test_data_and_model.py -v --tb=short
      
      - name: Build Docker image
        run: docker build -t fraud-detection-api .
      
      - name: Run container for API tests
        run: |
          docker run -d -p 8000:8000 --name test-api fraud-detection-api
          sleep 10  # Wait for API to start
          curl -f http://localhost:8000/health || exit 1
      
      - name: Run API tests
        run: pytest tests/test_api.py -v --tb=short
      
      - name: Cleanup
        if: always()
        run: docker stop test-api || true

7.3 Dockerize the Application

Create Dockerfile:

# Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ src/
COPY models/ models/
COPY data/ data/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the API
CMD ["uvicorn", "src.serve_validated:app", "--host", "0.0.0.0", "--port", "8000"]

Create .dockerignore:

# .dockerignore
venv/
__pycache__/
*.pyc
.git/
.github/
mlruns/
*.db
*.html
.pytest_cache/

Build and run locally:

# Build the Docker image
docker build -t fraud-detection-api .

# Run the container
docker run -p 8000:8000 fraud-detection-api

# Test it
curl http://localhost:8000/health

Checkpoint:

All tests pass: pytest tests/test_data_and_model.py -v
Docker image builds successfully
Container runs and responds to health checks

8. Incident Response Playbook

When things go wrong in production (and they will), you need a plan. This section provides playbooks for common ML incidents.

Scenario: False Positive Spike

Symptoms: Your fraud model suddenly flags 40% of legitimate transactions as fraud, blocking customers and overwhelming your manual review team.

Severity: HIGH - Direct customer impact

Phase 1: Mitigation (0-5 minutes)

Acknowledge the incident - Notify stakeholders that you're aware and responding
Roll back to previous model - In MLflow UI, move the @champion alias to the previous model version
Restart the API - docker restart fraud-api or redeploy
Verify - Check that false positive rate has returned to normal
Communicate - "Issue detected and mitigated. Investigating root cause."

Phase 2: Diagnosis (5-60 minutes)

Check drift report - Run python src/monitoring.py with recent production data
Check data validation logs - Did upstream data format change?
Check recent deployments - Was there a new model or code deployed recently?
Compare metrics - What's different between the rolled-back and problematic model?

Example root causes:

Upstream system sent amounts in cents instead of dollars
New merchant category appeared that wasn't in training data
Holiday shopping patterns differed significantly from training data

Phase 3: Remediation (1-24 hours)

Fix the root cause - Add validation for the edge case, or update training data
Retrain if needed - Include new patterns in training data
Add test case - Prevent this from happening again
Document - Add to runbook for future reference

Scenario: Gradual Performance Decay

Symptoms: Monitoring shows fraud recall dropping 2% per week over a month. No sudden failures, just slow degradation.

Severity: MEDIUM - Gradual impact, time to respond

Response:

Investigate drift report - Look for gradual distribution changes
```
python src/monitoring.py
```
Collect recent labeled data - Get confirmed fraud cases from the past month
Analyze patterns - What's different about recent fraud?
- New attack vectors?
- Different time patterns?
- New merchant categories?
Retrain on combined data - Include both old and new patterns
```
python src/train_mlflow.py
```
Deploy via canary - Route 10% of traffic to the new model first
- Monitor metrics for 1-2 days
- If metrics improve, increase to 50%, then 100%
- If metrics worsen, roll back
Set up recurring retraining - Schedule weekly or monthly retraining

Scenario: Upstream Data Schema Change

Symptoms: API starts returning 500 errors. Logs show KeyError: 'merchant_category'.

Severity: HIGH - Service is down

Response:

Check error logs - Identify the exact error
```
KeyError: 'merchant_category'
```
Check upstream data - Did the field name change?
- merchant_category -> category
- amount -> transaction_amount

Immediate fix - Add field name mapping

# Quick fix in API
if 'category' in data and 'merchant_category' not in data:
    data['merchant_category'] = data['category']

Long-term fix - Add validation that catches schema changes

required_fields = ['amount', 'hour', 'day_of_week', 'merchant_category']
missing = [f for f in required_fields if f not in data]
if missing:
    raise ValidationError(f"Missing fields: {missing}")

Add integration test - Test with upstream system in CI/CD

9. How to Put It All Together

Let's step back and appreciate what we've built. Our initial naive system has transformed into a local ML platform with production-grade components.

💡 Mental Model: Each tool in this stack is a "catch net" for a specific failure mode:

MLflow catches "which model is this?"

Feast catches "are features consistent?"

Great Expectations catches "is this data valid?"

Evidently catches "has the world changed?"

CI/CD catches "did we break something?"

Together, they form defense-in-depth for ML systems.

Component	Tool	Problem Solved
Experiment Tracking	MLflow	Every run logged, reproducible
Model Registry	MLflow	Versioned models, rollback capability
Feature Store	Feast	Consistent features, no training-serving skew
Data Validation	Great Expectations	Bad data rejected with clear errors
Monitoring	Evidently	Drift detected before it causes problems
Containerization	Docker	Environment consistency everywhere
CI/CD	GitHub Actions	Automated testing and safe deployments

The Complete Workflow

Here's how all the pieces work together in practice:

Data arrives - New transaction data comes in from upstream systems
Validation gate - Great Expectations rules check data quality. Bad data is rejected with clear error messages before it can cause harm.
Feature computation - Feast computes features using the same definitions for both training and serving. No more training-serving skew.
Training - When you retrain, MLflow logs all parameters, metrics, and artifacts. Every experiment is reproducible and comparable.
Model registry - Trained models are automatically versioned. You can compare metrics, promote the best to Production, and roll back if needed.
Serving - FastAPI loads the @champion model from MLflow. Each request is validated, features are retrieved from Feast, and predictions are returned.
Monitoring - Evidently checks for drift periodically. If input distributions change significantly, alerts are triggered.
Retraining loop - When drift is detected, you retrain on new data, compare metrics, and promote if better. The cycle continues.
CI/CD safety net - All code changes go through automated tests. Docker ensures environment consistency. Nothing reaches production without passing the pipeline.

10. What's Next: Scale to Production

This project runs locally, but the principles and tools extend directly to production deployments. Here's how each component scales:

Scaling Feast for Production

We used Feast with local SQLite stores. For production:

Component	Local	Production
Online Store	SQLite	Redis, DynamoDB, or PostgreSQL
Offline Store	Parquet files	BigQuery, Snowflake, or Redshift
Feature Server	Embedded	Dedicated Feast serving cluster

Benefits at scale:

Sub-10ms feature retrieval
Horizontal scaling for high throughput
Feature monitoring and statistics
Point-in-time joins at petabyte scale

Scaling MLflow for Production

Component	Local	Production
Backend Store	SQLite	PostgreSQL or MySQL
Artifact Store	Local filesystem	S3, GCS, or Azure Blob
Tracking Server	Single instance	Load-balanced cluster

Kubernetes Deployment

When you outgrow Docker Compose:

KServe or Seldon for serverless model serving with auto-scaling
Horizontal Pod Autoscaler to scale based on CPU/memory/custom metrics
Canary deployments to safely roll out new models (route 10% traffic first)
GPU scheduling for inference-heavy models

Advanced Monitoring

Expand observability with:

Prometheus + Grafana for real-time dashboards
OpenTelemetry for distributed tracing
PagerDuty/Slack integration for alerts
Labeled data collection for continuous model evaluation

A/B Testing and Multi-Armed Bandits

How to Use the Model Registry:

Serve multiple models concurrently (champion vs challengers)
Route traffic dynamically based on context
Collect metrics for each model variant
Automatically promote the best performer

Conclusion

Congratulations on building a production-ready ML system on your local machine!

What we assembled here is a microcosm of real-world ML platforms:

We started with just a model saved to a pickle file
We ended up with MLOps best practices: experiment tracking, model versioning, feature stores, data validation, monitoring, containerization, and CI/CD

The tools we used are production-grade:

MLflow powers ML platforms at companies like Microsoft, Facebook, and Databricks
Feast is used by companies like Gojek, Shopify, and Robinhood
FastAPI is one of the fastest Python web frameworks
Great Expectations is used at companies like GitHub and Shopify
Evidently is used for monitoring ML in production at scale

The principles apply at any scale:

Always track experiments
Always version models
Always validate data
Always monitor for drift
Always containerize for consistency
Always automate testing

Next Steps You Can Try

Deploy to the cloud - Push your Docker container to AWS ECS, Google Cloud Run, or Azure Container Instances
Add model explainability - Use SHAP or LIME to explain individual predictions
Implement A/B testing - Serve multiple models and compare performance
Add feature importance monitoring - Track how feature importance changes over time
Set up real-time alerting - Connect Evidently to Slack or PagerDuty
Implement continuous training - Automatically retrain when drift is detected
Add bias and fairness monitoring - Ensure your model treats all groups fairly

Remember that productionizing ML is an iterative process. There's always another layer of robustness to add, another edge case to handle, another metric to track. But with the foundation you've built here, you're well on your way to taking models from promising notebook experiments to deployed, monitored, and maintainable production applications.

Happy building, and may your models be accurate and your pipelines resilient!

Get the Complete Code

The entire project from this handbook is available as a public GitHub repository:

🔗 github.com/sandeepmb/freecodecamp-local-ml-platform

The repository includes:

All source code (src/ directory)
Test files (tests/ directory)
Feast feature definitions (feature_repo/)
Docker and CI/CD configuration
Ready-to-run scripts

Quick Start:

git clone https://github.com/sandeepmb/freecodecamp-local-ml-platform.git
cd freecodecamp-local-ml-platform
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python src/generate_data.py
python src/train_naive.py

References

MLflow Documentation - Experiment tracking and model registry
Feast Documentation - Feature store
Feast Quickstart - Getting started with Feast
FastAPI Documentation - Modern Python web framework
Great Expectations - Data validation
Evidently AI Documentation - ML monitoring
CI/CD for Machine Learning (JFrog) - CI/CD best practices
Training-Serving Skew Explained - Understanding skew
Docker Documentation - Containerization
GitHub Actions Documentation - CI/CD automation

How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks)

Chidozie Managwu — Mon, 16 Mar 2026 17:43:51 +0000

Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.

They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.

In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).

Why RAG Alone Does Not Equal Production-Ready
The Architecture You Are Building
Project Setup and Structure
How to Build the RAG Layer with FAISS
How to Add the LLM Call with Structured Output
How to Add Guardrails: Retrieval Gate and Fallbacks
FastAPI App: Creating the /answer Endpoint
How to Add Beginner-Friendly Evals
What to Improve Next: Realistic Upgrades

Why RAG Alone Does Not Equal Production-Ready

Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.

Production issues usually arise from the silent failures in the system surrounding the model:

Weak retrieval: If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.
Lack of visibility: Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.
Fragility: A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.
No regression testing: In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.

We’ll solve each of these issues systematically in this guide.

Prerequisites

This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.

Knowledge

You should be comfortable with:

Python fundamentals (functions, modules, virtual environments)
Basic HTTP + JSON (requests, response payloads)
APIs with FastAPI (what an endpoint is and how to run a server)
High-level LLM concepts (prompting, temperature, structured outputs)

Tools + Accounts

You’ll need:

Python 3.10+
A working OpenAI-compatible API key (OpenAI or any provider that supports the same request/response shape)
A local environment where you can run a FastAPI app (Mac/Linux/Windows)

What This Tutorial Covers (and What It Doesn’t)

We’ll build a production-minded baseline:

A FAISS-backed retriever with a persisted index + metadata
A retrieval gate to prevent “forced hallucination”
Structured JSON outputs so your backend is stable
Fallback behavior for timeouts and provider errors
A small eval harness to prevent regressions

We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.

The Architecture You Are Building

The flow of our application follows a disciplined path so every answer is grounded in evidence:

User query: The user submits a question via a FastAPI endpoint.
Retrieval: The system embeds the question and retrieves the top-k most similar document chunks.
The retrieval gate: We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.
Augmentation and generation: If the gate passes, we send a context-augmented prompt to the LLM.
Structured response: The model returns a JSON object containing the answer, sources used, and a confidence level.

Project Setup and Structure

To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.

Project Structure

.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py

Install Dependencies

First, create a virtual environment to isolate your project:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv

Configure the Environment

Create a .env file in the root directory. We are targeting OpenAI-compatible providers:

OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-mini

Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example X-API-Key), and the way you extract embeddings and final message content in embed_texts() and call_llm().

How to Build the RAG Layer with FAISS

In rag.py, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.

What is FAISS (and What Does It Do)?

FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:

“Given this question embedding, which document chunks are closest to it?”

In this tutorial, we use IndexFlatIP inner product and normalise vectors with faiss.normalize_L2(...). With normalised vectors, the inner product behaves like cosine similarity, giving us a stable score we can use for a retrieval gate.

Chunking Strategy With Overlap

We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.

Implementation of `rag.py`

import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List[str]:
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text[i : i + size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List[str]) -> np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32")
    return vectors

def build_index() -> None:
    all_chunks: List[str] = []
    metadata: List[Dict] = []

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -> List[Dict]:
    index, metadata = load_index()

    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx == -1:
            continue
        m = metadata[idx]
        results.append(
            {"score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)}
        )
    return results

How to Add the LLM Call with Structured Output

A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.

We solve this with structured output: instruct the model to return a strict JSON object, then parse it safely.

Implementation of `llm.py`

import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -> Dict[str, Any]:
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()["choices"][0]["message"]["content"]

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", [])
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "unknown_error",
        }

How to Add Guardrails: Retrieval Gate and Fallbacks

Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.

The Retrieval Gate: How It Works and How to Add It

In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.

The solution is the retrieval gate:

Retrieve top-k chunks and get the top similarity score
If the score is below a threshold (for example 0.30), refuse immediately
Only call the LLM when retrieval is strong enough to ground the answer

A threshold of 0.30 is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).

Fallbacks and Why They Matter

Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.

In this tutorial, fallbacks are implemented inside call_llm() so your FastAPI layer stays simple.

FastAPI App: Creating the /answer Endpoint

The app.py file is the conductor. It ties retrieval, guardrails, prompting, and generation together.

Implementation of `app.py`

from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results[0]["score"] if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score < 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join([f"Source {r['source']}: {r['text']}" for r in results])
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response["latency_sec"] = round(time.time() - start_time, 2)
    response["retrieval"] = {"top_score": top_score, "k": 5}
    return response

Centralized Prompt – Template: prompts.py

A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.

Example `prompts.py`

SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""

How to Add Beginner-Friendly Evals

In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.

Instead of “does it output exactly this string,” you test:

Should the app refuse when the retrieval is weak?
When it answers, does it include sources?
Is the behaviour stable across prompt tweaks and model changes?

Step 1: Create `evals/eval_set.json`

This should contain both positive and negative cases.

[
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
]

Step 2: Create `evals/run_evals.py`

This runner calls your API endpoint (end-to-end) and checks expected behaviours.

import json
import requests

API_URL = "http://127.0.0.1:8000/answer"

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case["question"]}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case["expect_refusal"])

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case['id']}")
        else:
            failed += 1
            print(f"FAIL {case['id']} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()

How to Use Evals in Practice

Run your server:

uvicorn app:app --reload

In another terminal, run evals:

python evals/run_evals.py

If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.

What to Improve Next: Realistic Upgrades

Building a reliable RAG app is iterative. Here are realistic next steps:

Semantic chunking: Break text based on meaning instead of character count.
Reranking: Use a cross-encoder reranker to reorder the top-k chunks for higher precision.
Metadata filtering: Filter results by category, date, or department to reduce false positives.
Better citations: Store chunk IDs and show exactly which chunk(s) the answer came from.
Observability: Add request IDs, structured logs, and traces so “what happened?” is answerable.
Async + background indexing: Move index building to a background job and keep the API responsive.

Final Thoughts: Production-Ready Is a Set of Habits

Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.

Retrieval quality is measurable: Use similarity scores to gate your LLM.
Refusal is a feature: It is better to say “I do not know” than to lie.
Fallbacks are mandatory: Design for the moment the API goes down.
Evals prevent regressions: Never deploy a change without running your tests.

About Me

I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.

My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.

How to Containerize Your MLOps Pipeline from Training to Serving

Balajee Asish Brahmandam — Thu, 12 Mar 2026 22:34:01 +0000

Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deploy it.

The model depended on a specific version of scikit-learn that conflicted with the production Python environment. The feature engineering pipeline required a NumPy build compiled against OpenBLAS, but the deployment servers ran MKL. A preprocessing step used a system library that existed on the data scientist's MacBook but not on the Ubuntu deployment target.

Three weeks of debugging later, we had the model running in production. Three weeks. For a model that was technically finished.

That experience is what pushed me to containerize our entire MLOps pipeline end to end. Not because Docker is trendy in ML circles, but because the alternative (hand-tuning environments, writing installation scripts that break on the next OS update, praying that what worked in training works in production) was costing us more time than the actual model development.

In this tutorial, you'll learn how to structure training and serving containers with multi-stage builds, how to set up experiment tracking with MLflow, how to version your training data with DVC, how to configure GPU passthrough for training, and how to tie it all together into a single Compose file with profiles. This is based on a year of running containerized ML pipelines across three teams.

Prerequisites

Docker Engine 24+ or Docker Desktop 4.20+ with Compose v2.22.0+
For GPU training, you'll need the NVIDIA Container Toolkit installed on the host and a compatible GPU driver. Run nvidia-smi to verify your GPU is visible, and docker compose version to check your Compose version.
Familiarity with Python, basic Docker concepts, and ML workflows (training, evaluation, serving) is assumed.

The MLOps Lifecycle: Where Containers Fit

If you've built a machine learning model, you know the process has a lot of stages. But if you're coming from a software engineering background (or you're a data scientist who mostly works in notebooks), it helps to see the full picture of what an MLOps pipeline looks like and where Docker fits into each stage.

An MLOps pipeline is a chain of interdependent stages:

Data ingestion and validation. Raw data comes in from databases, APIs, or file systems. You clean it, validate it, and store it in a format your model can use.
Feature engineering. You transform raw data into features the model can learn from. This might be as simple as normalizing numbers or as complex as generating embeddings.
Experiment tracking. You log every training run's configuration (hyperparameters, data version, code version) and results (accuracy, loss, evaluation metrics) so you can compare experiments and reproduce the best ones.
Model training. The model learns from your features. This is the compute-heavy part that often needs GPUs.
Evaluation. You measure the trained model against test data to see if it's good enough to deploy.
Packaging and serving. You wrap the trained model in an API so other systems can send it data and get predictions back.
Monitoring. You watch the model in production to catch problems like data drift (when the real-world data starts looking different from the training data) or performance degradation.

Each stage has different computational needs. Training might require GPUs and terabytes of memory. Serving needs low latency and horizontal scaling. Feature engineering might need distributed processing tools like Spark or Dask.

The thing that changed our approach: you don't containerize the entire pipeline as one monolithic image. You containerize each stage independently, with shared interfaces between them.

Think of it like microservices applied to ML infrastructure. Each container does one thing, does it well, and communicates with the others through well-defined interfaces: model artifacts stored in a registry, metrics logged to MLflow, data versioned in object storage.

This gives you the flexibility to:

Scale training on expensive GPU instances while running serving on cheaper CPU nodes
Update your feature engineering code without rebuilding your training environment
Version each stage independently in your container registry
Let data scientists and ML engineers work on training while platform engineers optimize serving

How to Build the Training Container

The training container is where most teams start, and where most teams make their first mistake.

The temptation is to create one massive image with every possible library, every CUDA version, every data processing tool. I've seen training images exceed 15GB. They take twenty minutes to build, ten minutes to push, and break whenever someone adds a new dependency.

Here's the pattern that works: use multi-stage builds to separate the build environment from the runtime environment, and use cache mounts to avoid re-downloading packages on every build.

If you're new to these concepts: a multi-stage build lets you use one Docker image to build your software and a different, smaller image to run it. You copy only the final artifacts from the build stage to the runtime stage, leaving behind compilers, build tools, and other things you don't need in production.

A cache mount tells Docker to keep a directory (like pip's download cache) between builds, so it doesn't re-download packages that haven't changed.

Here's the training Dockerfile:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

# System dependencies (rarely change)
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Dependencies (change occasionally)
COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

# Training code (changes frequently)
COPY src/ /app/src/
COPY configs/ /app/configs/

WORKDIR /app
ENTRYPOINT ["python", "-m", "src.train"]

Notice the layer ordering. Docker builds images in layers, and it caches each layer. If a layer hasn't changed, Docker reuses the cached version instead of rebuilding it. But here's the catch: if one layer changes, Docker rebuilds that layer and every layer after it.

That's why we put things in order of how often they change:

System packages at the top (they almost never change). Installing python3.11 and git takes time, but you only do it once.
Python dependencies in the middle (they change when you add or update a library). This layer rebuilds when requirements-train.txt changes.
Your actual code at the bottom (changes on every commit). This is the layer that rebuilds most often.

With this ordering, a code change only rebuilds the final layer, not the entire image. If you put COPY src/ before pip install, every code change would trigger a full reinstall of all Python packages. That's the mistake I see most often in ML Dockerfiles.

The --mount=type=cache,target=/root/.cache/pip line on the pip install command tells Docker to persist pip's download cache between builds. When you do update requirements, pip checks the cache first and only downloads packages that are new or changed. On a project with hundreds of ML dependencies (PyTorch alone pulls in dozens of sub-packages), this saves five to ten minutes per build.

Separate Training from Serving Requirements

Your training environment needs libraries that your serving environment does not. Training needs experiment tracking tools like MLflow, data processing libraries like pandas and polars, visualization libraries for debugging, and hyperparameter tuning frameworks. Serving needs a lightweight inference runtime, an API framework like FastAPI, health check endpoints, and minimal overhead.

It's a good idea to maintain separate requirements files:

# requirements-train.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
pandas==2.2.3
polars==1.20.0
dvc[s3]==3.59.1
optuna==4.2.0
matplotlib==3.10.0

# requirements-serve.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
fastapi==0.115.0
uvicorn[standard]==0.34.0
pydantic==2.10.0

The overlap is smaller than you'd think. torch and scikit-learn appear in both because the model needs them for inference. Everything else in the training file is baggage that slows down serving deployments and increases the attack surface.

CUDA and Driver Compatibility

One thing that will bite you if you ignore it: the CUDA runtime version inside your container must be compatible with the GPU driver version on the host. The rule is that the host driver must be equal to or newer than the CUDA version in the container. For example, CUDA 12.6 requires driver version 560.28+ on Linux.

Make sure you check your host driver version before choosing your base image:

# On the host machine
nvidia-smi
# Look for "Driver Version: 560.35.03" and "CUDA Version: 12.6"

# The CUDA version shown by nvidia-smi is the maximum CUDA version
# your driver supports, not the version installed

If your host driver is 535.x, don't use a cuda:12.6 base image. Use cuda:12.2 or upgrade the driver. Mismatched versions produce cryptic errors like CUDA error: no kernel image is available for execution on the device that are painful to debug.

Pin your base images to specific tags (not latest) and document the minimum driver version in your README. When you deploy to new hardware, the driver version check should be part of your provisioning checklist.

How to Set Up Experiment Tracking with MLflow

If you've ever trained a model and thought "wait, which hyperparameters gave me that good result last week?", you need experiment tracking. Without it, ML development turns into a mess of Jupyter notebooks, screenshots of metrics, and spreadsheets that nobody keeps up to date.

MLflow is the most widely adopted open-source tool for this. It logs three things for every training run: parameters (learning rate, batch size, number of epochs), metrics (accuracy, loss, F1 score), and artifacts (the trained model file, plots, evaluation reports). It stores all of this in a database and gives you a web UI to compare runs side by side.

Running MLflow as a containerized service means the tracking server is persistent and shared across your team, not running on one person's laptop:

services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  mlflow-artifacts:
  postgres-data:

Let me break down what's happening here.

The mlflow service runs the MLflow tracking server. It stores experiment metadata (parameters, metrics) in a Postgres database and saves artifacts (model files, plots) to a Docker volume.

The depends_on with condition: service_healthy tells Compose to wait until Postgres is actually ready to accept connections before starting MLflow. Without this, MLflow would crash on startup because the database isn't ready yet.

The db service runs Postgres with a health check that uses pg_isready, a built-in Postgres utility that checks if the database is accepting connections. The start_period gives Postgres 10 seconds to initialize before health checks start counting failures.

Your training code connects to MLflow by setting one environment variable:

import os
import mlflow

# This tells MLflow where to log experiments
# When running inside Docker Compose, "mlflow" resolves to the mlflow container
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow:5000"

# Example: logging a training run
with mlflow.start_run(run_name="fraud-detector-v2"):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)
    mlflow.log_param("epochs", 50)

    # ... train your model here ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.log_metric("precision", 0.93)
    mlflow.log_metric("recall", 0.89)

    # Log the trained model as an artifact
    mlflow.sklearn.log_model(model, "model")
    # Or for PyTorch: mlflow.pytorch.log_model(model, "model")

After the run completes, open http://localhost:5000 in your browser. You'll see a table of all your runs with their parameters and metrics. Click any run to see details, compare it with other runs, or download the model artifact. No more "I think experiment 7 was the good one" conversations.

A note on the password in the YAML: for local development this is fine. For staging and production, use Docker secrets or inject the credentials from your CI environment. Don't commit real database passwords to your repo.

How to Version Training Data with DVC

Models are reproducible only if you can also reproduce the data they were trained on. This is a problem Git can't solve on its own, because training datasets are often gigabytes or terabytes in size and Git isn't designed for large binary files.

DVC (Data Version Control) fills this gap. It works like Git, but for data. Here's the concept: instead of storing your 10GB training dataset in Git, DVC stores a small text file (a .dvc file) that acts as a pointer to the actual data. The real data lives in cloud storage (S3, Google Cloud Storage, Azure Blob). When you check out a specific Git commit, DVC knows which version of the data goes with that commit and can pull it from remote storage.

The workflow on your local machine looks like this:

# Initialize DVC in your project (one time)
dvc init

# Add your training data to DVC tracking
dvc add data/training_data.parquet
# This creates data/training_data.parquet.dvc (small pointer file)
# and adds training_data.parquet to .gitignore

# Push the actual data to remote storage
dvc push

# Commit the pointer file to Git
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"

Now your Git repo contains the pointer file, and the real data lives in S3. When someone else (or a container) needs the data, they run dvc pull and DVC downloads it from remote storage.

The training Dockerfile includes DVC, and the entrypoint pulls the correct data version before training begins:

# syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

COPY src/ /app/src/
COPY configs/ /app/configs/

# DVC tracking files (these are small text files in Git)
COPY data/*.dvc /app/data/
COPY .dvc/ /app/.dvc/

WORKDIR /app
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]

The entrypoint script pulls the data and then starts training:

#!/bin/bash
set -e

echo "Pulling training data from remote storage..."
dvc pull data/

echo "Starting training run..."
python -m src.train "$@"

For DVC to pull from S3, the container needs AWS credentials. You can pass them as environment variables in your Compose file or mount them from the host:

training:
  build: { context: ., dockerfile: Dockerfile.train }
  environment:
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - AWS_DEFAULT_REGION=us-east-1

Combined with MLflow's experiment logging, you get a complete provenance chain: this model was trained on this version of the data (tracked by DVC), with these parameters (logged in MLflow), producing these metrics.

You can reproduce any past experiment by checking out the Git commit and running the training container.

How to Build the Serving Container

"Serving" means wrapping your trained model in an API so other systems can send it data and get predictions back. For example, a fraud detection model might expose a /predict endpoint that accepts transaction data and returns a fraud probability.

The serving container has different priorities than the training container. Training optimizes for flexibility and raw compute. Serving optimizes for speed, small size, and reliability:

FROM python:3.11-slim AS serving

WORKDIR /app

# Install curl for healthcheck
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

COPY requirements-serve.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-serve.txt

COPY src/serving/ /app/src/serving/

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.serving.app:app", "--host", "0.0.0.0"]

A few things to understand if you're new to this:

uvicorn is a lightweight Python web server that runs FastAPI applications. FastAPI is a framework for building APIs in Python. Together, they let you turn your model into a web service that responds to HTTP requests.

HEALTHCHECK tells Docker to periodically check if your container is actually working, not just running. Every 30 seconds, Docker runs the curl command against the /health endpoint. If it fails three times in a row, Docker marks the container as unhealthy. This matters because your model server might be running but not ready (maybe the model file is still downloading), and you don't want to send traffic to a server that can't respond.

start-period of 60 seconds is important for ML serving containers. Model loading can take time, especially for large models (loading a 2GB model from a registry takes a while). Without start-period, the health check would start failing immediately, count those failures toward the retry limit, and the orchestrator might kill the container before the model finishes loading. The start period gives the container grace time to initialize.

Notice we're using python:3.11-slim here, not the NVIDIA CUDA image. Most trained models can run inference on CPU. If you need GPU inference (for example, running a large language model or doing real-time video processing), use the CUDA base image instead, but be aware that it makes the serving container much larger.

If you want to skip the curl dependency, use Python's built-in urllib for the health check:

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1

Decouple Models from Containers

This is one of the most important patterns in this article, and the one beginners most often get wrong.

The temptation is to copy your trained model file (the .pkl, .pt, or .onnx file that contains the learned weights) directly into the Docker image during the build. Don't do this. When you embed model files in your Docker image, every model update requires a new image build and push. For a 2GB model, that means rebuilding the container, uploading 2GB to a registry, and redeploying, even though only the model changed and the code is identical.

Instead, have your serving container download the model from a model registry (like MLflow) or cloud storage (like S3) at startup. The container image stays small and generic. Model updates become a configuration change (pointing to a new model version) rather than a deployment.

Here's a full serving app using FastAPI with the modern lifespan pattern. If you've used Flask, FastAPI is similar but faster and with built-in request validation:

import os
from contextlib import asynccontextmanager

import mlflow
from fastapi import FastAPI

# MODEL_URI points to a specific model version in MLflow's registry
# Format: "models://" where stage is Staging or Production
MODEL_URI = os.environ.get("MODEL_URI", "models:/fraud-detector/production")
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    # This runs once when the server starts up
    global model
    print(f"Loading model from {MODEL_URI}...")
    model = mlflow.pyfunc.load_model(MODEL_URI)
    print("Model loaded successfully.")
    yield
    # This runs when the server shuts down
    print("Shutting down model server.")


app = FastAPI(lifespan=lifespan)


@app.get("/health")
async def health():
    """Used by Docker HEALTHCHECK to verify the server is ready."""
    if model is None:
        return {"status": "loading"}, 503
    return {"status": "healthy"}


@app.post("/predict")
async def predict(features: dict):
    """Accept features as JSON, return model prediction."""
    import pandas as pd

    # Convert the input dict into a DataFrame (what most sklearn/mlflow models expect)
    df = pd.DataFrame([features])
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}

When a client sends a POST request to /predict with JSON like {"amount": 500, "merchant_category": "electronics", "hour": 23}, the model returns a prediction. The /health endpoint returns 503 while the model is loading and 200 once it's ready, which is exactly what the Docker HEALTHCHECK checks for.

Promoting a new model version means updating the MODEL_URI environment variable and restarting the container. The MLflow model registry supports stage transitions (Staging, Production, Archived), so you can promote a model in the MLflow UI and then point your serving container at the new version.

For zero-downtime model updates, implement a reload endpoint that swaps models without restarting:

@app.post("/admin/reload")
async def reload_model():
    global model
    model = mlflow.pyfunc.load_model(MODEL_URI)
    return {"status": "reloaded"}

How to Configure GPU Passthrough for Training

By default, Docker containers can't see the GPU hardware on the host machine. "GPU passthrough" means giving a container access to the host's GPUs so that libraries like PyTorch and TensorFlow can use them for accelerated computation.

This requires two things on the host (the machine running Docker, not inside the container):

NVIDIA GPU drivers installed and working. Verify with nvidia-smi. If that command shows your GPUs, you're good.
NVIDIA Container Toolkit installed. This is the bridge between Docker and the GPU drivers. Install it from the NVIDIA docs and verify with docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi. If you see your GPU listed, the toolkit is working.

Once the host is set up, GPU access in Docker Compose looks like this:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

The deploy.resources.reservations.devices block is saying: "this container needs all available NVIDIA GPUs." Inside the container, PyTorch and TensorFlow will see the GPUs and use them automatically. You can verify by adding print(torch.cuda.is_available()) to your training script, which should print True.

If you're running Compose v2.30.0+, you can use the shorter gpus syntax:

services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    gpus: all
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

For multi-GPU training with frameworks like PyTorch's DistributedDataParallel, you can assign specific GPUs using device_ids. This matters when running multiple training jobs at the same time:

services:
  training-job-1:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

  training-job-2:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

Note that CUDA_VISIBLE_DEVICES inside the container is relative to the devices assigned by Docker, not the host GPU indices. Both containers see their GPUs as device 0 and 1, even though they're using different physical GPUs.

How to Tie It All Together with Compose Profiles

If you're new to Compose profiles: by default, docker compose up starts every service defined in your docker-compose.yml. But you don't always want everything running. Your MLflow server and serving API should run all the time, but the training container should only launch when you're actually training a model (and it needs a GPU, which you might not have on your laptop).

Profiles solve this. When you add profiles: ["train"] to a service, that service is excluded from docker compose up by default. It only starts when you explicitly activate the profile with docker compose --profile train. This means one file defines your entire ML infrastructure, but you control what runs and when.

Here's the complete docker-compose.yml that ties every piece from this article together:

services:
  # --- Always-on infrastructure ---
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: >
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  serving:
    build: { context: ., dockerfile: Dockerfile.serve }
    ports:
      - "8000:8000"
    environment:
      - MODEL_URI=models:/fraud-detector/production
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      mlflow: { condition: service_started }

  # --- Training (on-demand) ---
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    profiles: ["train"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      mlflow: { condition: service_started }

volumes:
  postgres-data:
  mlflow-artifacts:

The day-to-day workflow with this file:

# Step 1: Start the infrastructure (MLflow + Postgres + serving API)
# The -d flag runs everything in the background
docker compose up -d

# Step 2: Open the MLflow UI to see past experiments
open http://localhost:5000    # macOS
# xdg-open http://localhost:5000  # Linux

# Step 3: Check that the serving API is healthy
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Step 4: Run a training job (pulls data via DVC, logs to MLflow)
# This only starts the "training" service because of the profile flag
docker compose --profile train run training

# Step 5: Watch training progress in the MLflow UI at localhost:5000
# You'll see metrics updating in real time if your training code logs them

# Step 6: After training completes, promote the model in MLflow UI
# Click the model, go to "Register Model", set stage to "Production"

# Step 7: Restart the serving container to pick up the new model version
docker compose restart serving

# Step 8: Test the new model
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "hour": 23}'

This single-file approach means a new team member can clone the repo, run docker compose up -d, and have the complete ML infrastructure running locally within minutes. The same containers deploy to staging and production with only environment variable changes (database credentials, model URIs, GPU allocation).

Reproducibility: The Whole Point

Everything in this article serves one goal: reproducibility. The ability to take any commit hash, build the same containers, pull the same data, and produce the same model.

Here are the practices that make this work:

Pin Everything

Pin your base images to specific digests, not just tags. Pin your Python packages to exact versions with pip freeze > requirements.txt. Use fixed random seeds in your training code and log them in MLflow.

Log Everything

Every training run should log the exact library versions (pip freeze), the Git commit hash, the DVC data version, all hyperparameters, and all evaluation metrics to MLflow. You can automate this:

import subprocess
import mlflow

with mlflow.start_run():
    # Log environment info automatically
    pip_freeze = subprocess.check_output(["pip", "freeze"]).decode()
    mlflow.log_text(pip_freeze, "pip_freeze.txt")

    git_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    mlflow.log_param("git_commit", git_hash)

    # ... rest of training ...

Version Everything

Git for code, DVC for data, MLflow for experiments, Docker digests for environments. The combination creates a complete provenance chain. When a stakeholder asks why a model made a particular prediction, you can trace it back to the exact code, data, and hyperparameters that produced it. For regulated industries like finance and healthcare, that traceability is a compliance requirement, not a nice-to-have.

Where This Breaks Down

This approach works well for small-to-medium teams running on single hosts or small clusters. Here's where you'll hit walls:

Large datasets. Don't mount multi-terabyte datasets into containers. Use object storage (S3, GCS) and stream data during training. DVC handles the versioning, but the data itself should live outside Docker entirely.

GPU driver mismatches. Your container's CUDA version must be compatible with the host driver. Test on identical hardware and driver versions to what you'll run in production. Document the minimum driver version in your README.

Multi-node training. When you need to distribute training across multiple machines, you'll outgrow Compose. Kubernetes with Kubeflow or KServe is the standard path for distributed training and auto-scaled serving.

Serving at scale. A single container running uvicorn handles moderate traffic. For high-throughput inference, you'll need a load balancer, multiple replicas, and potentially a dedicated serving framework like NVIDIA Triton Inference Server or TensorFlow Serving. Compose can run multiple replicas with docker compose up --scale serving=3, but it doesn't give you the routing, health-based load balancing, or rolling updates that a real orchestrator provides.

Secrets in production. The Compose file above uses plaintext passwords for local development. In production, use Docker secrets, HashiCorp Vault, or your cloud provider's secret manager. Never commit credentials to your repo.

Conclusion

Containerizing your MLOps pipeline turns fragile, environment-dependent models into reproducible, deployable artifacts. Multi-stage builds keep images lean. MLflow gives you experiment tracking and model lineage. DVC links code to data. GPU passthrough preserves training performance. A single Compose file with profiles ties the whole workflow together.

That fraud detection model I mentioned at the start? We eventually containerized the entire pipeline around it. The next model we shipped went from "notebook finished" to "running in production" in two days instead of three weeks. Most of that time was spent on evaluation and review, not fighting environments.

Containerization doesn't make your models better. It gets the infrastructure out of the way so you can focus on the work that does.

But even with these caveats, containerized MLOps eliminates the most common source of ML project delays: environment mismatch between development and production. The three weeks we spent debugging that fraud detection model deployment? That doesn't happen anymore.

If you found this useful, you can find me writing about MLOps, containerized workflows, and production AI systems on my blog.

Machine Learning - freeCodeCamp.org

Product Experimentation with Synthetic Control: Causal Inference for Global LLM Rollouts in Python

Table of Contents

Why Global Rollouts Break Naïve Measurement

What Synthetic Control Actually Does

Prerequisites

Setting Up the Working Example

Step 1: Fit Donor Weights with SLSQP

Step 2: Plot Treated vs Synthetic Control Trajectories

Step 3: In-Space Placebo Permutation Test

Step 4: Leave-One-Out Donor Sensitivity

Step 5: Cluster Bootstrap 95% Confidence Intervals

When Synthetic Control Fails

1. Donor Pool Contamination (Violates No Interference)

2. Fundamentally Different Units (Violates Pre-period Fit)

3. Post-Treatment Shocks to Donors (Violate Stable Donor Composition)

4. Overfitting Risk When J Approaches T₀ (Degrades Pre-period Fit in Practice)

What to Do Next

Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python

Table of Contents

Why Threshold Routing is a Natural Experiment

What Regression Discontinuity Actually Does

Prerequisites

Setting Up the Working Example

Step 1: A Sharp RDD with Local Linear Regression

Step 2: Try Different Bandwidths

Step 3: Checking for Manipulation at the Threshold

Step 4: Quadratic Specification as a Robustness Check

Step 5: Bootstrap Confidence Intervals

When Regression Discontinuity Fails

What to Do Next

AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)

Paper Overview

Table of Contents

Prerequisites

Executive Summary

Goals of the Paper

Methodology

Pre-Training

Fine-Tuning (Adapting to Tasks)

Transformer vs. BERT vs. GPT

Transformer vs BERT vs GPT: Key Differences

Model Architecture

Key Techniques

Key Findings

Conclusions

Limitations

Related Work & Context

Final Insight

Resources:

Contact Me

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Table Of Contents

Prerequisites

The Dataset

Mean: The Sensitive Giant

Median: The Robust Middle

Beyond Averages: Understanding Spread with Quartiles

The IQR: Detecting Outliers

A Simple Example to Understand IQR

Step 1: Find the Median (Q2):

Step 2: Find Q1 (Lower Quartile):

Step 3: Find Q3 (Upper Quartile):

Step 4: Calculate IQR:

Step 5: Find Outlier Bounds:

Applying IQR to Our Dataset

Revisiting the Mean After Removing Outliers

Final Comparison and Insights

Conclusion

Connect with me

Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python

Table of Contents

Why Opt-in Features Break Naïve Comparisons

1. Selection on engagement

2. Selection on intent

3. Selection on risk tolerance

What Propensity Scores Actually Do

Prerequisites

Setting Up the Working Example

Step 1: Estimate the Propensity Score