<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Machine Learning - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Machine Learning - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 21 May 2026 10:20:57 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/machine-learning/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation with Synthetic Control: Causal Inference for Global LLM Rollouts in Python ]]>
                </title>
                <description>
                    <![CDATA[ Every product experimentation team doing causal inference on LLM-based features eventually hits the same wall: when the provider ships a new model version, there's no holdout. Your infrastructure team ]]>
                </description>
                <link>https://www.freecodecamp.org/news/product-experimentation-with-synthetic-control-causal-inference-for-global-llm-rollouts-in-python/</link>
                <guid isPermaLink="false">6a02b2a8937b84f7790d481e</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ synthetic-control ]]>
                    </category>
                
                    <category>
                        <![CDATA[ generative ai ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Tue, 12 May 2026 04:55:04 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/06d252e7-e613-46c7-b5ce-c5daa14cec21.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every product experimentation team doing causal inference on LLM-based features eventually hits the same wall: when the provider ships a new model version, there's no holdout.</p>
<p>Your infrastructure team upgrades every workspace from Claude 4.5 to Claude 4.6 overnight. All 50 production workspaces get the new model at the same time. A week later, task completion climbs across the board. The head of product calls it a win.</p>
<p>But you know something's off. No holdout group ran 4.5 through the upgrade week. The naïve before/after picks up whatever else changed that week alongside the model: a new onboarding flow, a seasonal uptick, a high-profile customer onboarding.</p>
<p>This is the Global Rollout Problem. It appears whenever a team ships a model upgrade to the entire user base simultaneously. For product teams running generative AI features, it's one of the most common measurement traps in the stack. Staged rollouts buy you a control group, global rollouts eliminate it.</p>
<p>In 2026, global model upgrades are the norm: every API provider pushes new versions, and every team using Claude, GPT, or Gemini has experienced the sudden jump from one version to the next with no opt-out.</p>
<p>Synthetic control is the tool that data scientists use when the control group is missing. You build a weighted combination of untreated units (other workspaces or regions that weren't upgraded at the same time) whose pre-upgrade behavior matches that of the treated unit. Compare the treated unit to its synthetic twin after the upgrade, and the gap is the causal estimate, conditional on three identification assumptions that we'll name explicitly.</p>
<p>In this tutorial, you'll build a synthetic control from scratch in Python using <code>scipy.optimize</code>, apply it to a 50,000-user synthetic SaaS dataset, and validate with a placebo permutation test, leave-one-out donor sensitivity, and a cluster bootstrap 95% confidence interval.</p>
<p><strong>Companion code:</strong> every code block runs end-to-end in the companion notebook at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control</a>. The notebook (<code>synthetic_control_demo.ipynb</code>) has all outputs pre-executed, so you can read along on GitHub before running anything locally.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-global-rollouts-break-naive-measurement">Why Global Rollouts Break Naïve Measurement</a></p>
</li>
<li><p><a href="#heading-what-synthetic-control-actually-does">What Synthetic Control Actually Does</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-fit-donor-weights-with-slsqp">Step 1: Fit Donor Weights with SLSQP</a></p>
</li>
<li><p><a href="#heading-step-2-plot-treated-vs-synthetic-control-trajectories">Step 2: Plot Treated vs Synthetic Control Trajectories</a></p>
</li>
<li><p><a href="#heading-step-3-in-space-placebo-permutation-test">Step 3: In-Space Placebo Permutation Test</a></p>
</li>
<li><p><a href="#heading-step-4-leave-one-out-donor-sensitivity">Step 4: Leave-One-Out Donor Sensitivity</a></p>
</li>
<li><p><a href="#heading-step-5-cluster-bootstrap-95-confidence-intervals">Step 5: Cluster Bootstrap 95% Confidence Intervals</a></p>
</li>
<li><p><a href="#heading-when-synthetic-control-fails">When Synthetic Control Fails</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-global-rollouts-break-naive-measurement">Why Global Rollouts Break Naïve Measurement</h2>
<p>The math of an A/B test is elegant because of one assumption: treatment assignment is independent of everything else. Flip a coin: half your workspaces get Claude 4.6, and half stay on 4.5. The coin flip breaks every possible confound. The global rollout world has no coin.</p>
<p>Three mechanisms make the naive before/after misleading.</p>
<ol>
<li><p><strong>Co-occurring product changes:</strong> Shipping a model upgrade rarely happens in isolation. The same week, the onboarding team ships a redesigned tutorial, the pricing team runs a promotion, or customer success reaches out to enterprise accounts about the new capabilities. Your before/after picks up the sum.</p>
</li>
<li><p><strong>Seasonal and market drift:</strong> Weekly usage patterns, monthly billing cycles, and quarterly procurement cycles all move outcome metrics. A 3 pp lift in week 20 looks like the model upgrade, but in fact, users returned from spring break.</p>
</li>
<li><p><strong>Peer-company dynamics:</strong> A competitor releases a buggy update, and your users migrate over for a week. Your task completion rate spikes because the new users had easier queries, with zero contribution from the model itself.</p>
</li>
</ol>
<p>All three produce the same symptom: a raw before/after that folds the upgrade's causal effect together with the causal effect of every other week-20 event.</p>
<p>In this tutorial's dataset, the naïve gap is +0.0515, nearly equal to the ground-truth +0.05. That coincidence is the scariest failure mode: the naive number sometimes lands correctly by accident, and without a counterfactual, you can't tell luck from truth.</p>
<h2 id="heading-what-synthetic-control-actually-does">What Synthetic Control Actually Does</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/d06bde67-30dd-4bc4-b019-5189ac5424a7.png" alt="d06bde67-30dd-4bc4-b019-5189ac5424a7" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><em>Figure 1 (above): Schematic of the synthetic control construction. The gray curves are donor workspaces that remain on the old model. The dashed navy curve is the weighted combination of donors that best tracks the treated unit (red) during the pre-treatment window marked by the blue bracket below the x-axis.</em></p>
<p><em>After the treatment date (week 20, dotted vertical line), the weights stay frozen, and the dashed curve projects forward as the counterfactual, while the treated unit moves upward. The gap between the two curves in the post-treatment window is the causal-effect estimate.</em></p>
<p><em>The key design choice the figure illustrates is that weights are fit once, using only pre-treatment data, and never refit using post-treatment data.</em></p>
<p>Synthetic control finds a weighted combination of untreated units whose outcome trajectory closely matches the treated unit's in the pre-treatment period. Once the weights are fixed, you project the synthetic unit's trajectory forward into the post-treatment period and read off the gap between the two lines.</p>
<p>In your AI product context: if wave-2 workspaces didn't get the model upgrade at the same time as wave-1 workspaces, each wave-2 workspace is a candidate donor. The optimizer finds the combination of wave-2 workspaces whose weighted pre-upgrade trajectory best matches wave 1's. After week 20 (when wave 1 was upgraded), the gap between wave 1 and its synthetic twin is the causal-effect estimate, provided that the following three identification assumptions hold.</p>
<p>These identification assumptions work together.</p>
<ul>
<li><p>First, <strong>pre-period fit</strong> (the convex-hull condition): the treated unit's pre-treatment trajectory must lie inside the convex hull of the donor trajectories, which is what the non-negativity and sum-to-1 constraints enforce.</p>
</li>
<li><p>Second, <strong>no interference for donors</strong> (SUTVA for the donor pool): the treatment on the treated unit must not affect the donors. Shared API rate-limit pools or users migrating between workspaces both break this.</p>
</li>
<li><p>Third, <strong>stable donor composition</strong>: the donors must not experience structural breaks unrelated to the treatment during the post-period. Violate any one, and the gap is biased even when the pre-period fit looks perfect. The failure modes section walks through each.</p>
</li>
</ul>
<p>One geometric note: with T₀ pre-treatment periods and J donors, pre-period overfitting becomes serious when J approaches T₀. This tutorial runs with T₀ = 20 and J = 25, which sits in the danger zone. The LOO sensitivity step later is the right diagnostic for whether the fit reflects genuine comparability or overfitting.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You'll need Python 3.11 or newer, comfort with pandas and numpy, and familiarity with basic constrained optimization.</p>
<p>Install the packages for this tutorial:</p>
<pre><code class="language-shell">pip install numpy pandas scipy matplotlib
</code></pre>
<p><strong>Here's what's happening:</strong> four packages cover the full pipeline. Pandas loads the user-level log, NumPy handles panel arithmetic, SciPy provides the SLSQP solver to enforce the convex-combination constraint on the donor weights, and matplotlib renders the trajectory plot and the placebo distribution.</p>
<p>Clone the companion repo to get the synthetic dataset:</p>
<pre><code class="language-shell">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<p><strong>Here's what's happening:</strong> the clone pulls the companion repo, and <code>generate_data.py</code> produces the shared synthetic dataset used across the series. Seed 42 keeps the dataset reproducible, and 50,000 users give a clean signal for the estimator in this tutorial. The output CSV lands at <code>data/synthetic_llm_logs.csv</code>.</p>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The synthetic dataset simulates a SaaS product with 50,000 users spread across 50 workspaces. Workspaces 0 through 24 are in wave 1, which received the model upgrade at week 20. Workspaces 25 through 49 are in wave 2, which stayed on the old model through week 29.</p>
<p>The ground-truth causal effect baked into the data generator is a +5 percentage-point increase in task completion for wave-1 users in the post-treatment period. You know the truth, so you can check what the synthetic control recovers.</p>
<p>Load the data and aggregate to a workspace-by-week panel:</p>
<pre><code class="language-python">import numpy as np
import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")

PRE = 20         # weeks 0-19 are pre-treatment
WINDOW = 30      # analysis window weeks 0-29

df_window = df[df.signup_week &lt; WINDOW].copy()

panel = (
    df_window.groupby(["workspace_id", "signup_week"])
    ["task_completed"].mean().reset_index()
)
panel.columns = ["workspace_id", "week", "task_completed"]

pivot = panel.pivot(
    index="week", columns="workspace_id", values="task_completed"
)
pivot = pivot.interpolate(method="linear", axis=0).ffill().bfill()

ws_wave = df.groupby("workspace_id").wave.first()
wave1_ws = sorted(ws_wave[ws_wave == 1].index.tolist())
wave2_ws = sorted(ws_wave[ws_wave == 2].index.tolist())

treated_series = pivot[wave1_ws].mean(axis=1).values
donor_matrix = pivot[wave2_ws].values

print(f"Treated series shape: {treated_series.shape}")
print(f"Donor matrix shape:   {donor_matrix.shape}")
print(f"Users per workspace-week: ~{len(df_window) / (50 * WINDOW):.1f}")
print(f"Pre-period treated mean  (weeks 0-19):  {treated_series[:PRE].mean():.4f}")
print(f"Post-period treated mean (weeks 20-29): {treated_series[PRE:].mean():.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Treated series shape: (30,)
Donor matrix shape:   (30, 25)
Users per workspace-week: ~19.2
Pre-period treated mean  (weeks 0-19):  0.5927
Post-period treated mean (weeks 20-29): 0.6421
</code></pre>
<p><strong>Here's what's happening:</strong> you restrict to the 30-week window, aggregate user rows to a workspace-by-week panel, and reshape so rows are weeks and columns are workspaces. Interpolation fills any missing cells (each cell averages about 19 users). The treated series is the mean across all 25 wave-1 workspaces, pooling roughly 480 users per week to smooth cell-level noise.</p>
<p>The donor matrix keeps each wave-2 workspace as a separate column: 25 time series, each covering weeks 0 through 29. The pre-period treated mean of 0.5927 and the post-period mean of 0.6421 yield a raw before/after gap of +5.15 pp, which coincidentally sits near the ground-truth +5 pp and is contaminated by everything else that moved in weeks 20 through 29.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/9b5d9711-9632-41ec-9c38-5ad531ca676f.png" alt="9b5d9711-9632-41ec-9c38-5ad531ca676f" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><em>Figure 2: The diagnostic on the real 50,000-user dataset. Top panel: wave 1's trajectory in red and the fitted synthetic control in navy dashed, with pre-period RMSE of 3.74 pp and a post-treatment gap averaging +8.29 pp. Bottom panel: the placebo distribution built by re-fitting the synthetic control with each of the 25 donor workspaces standing in as the placebo treated unit. The observed gap lies outside the full placebo range, which drives the pseudo p-value in Step 3.</em></p>
<p><em>Where Figure 1 schematically showed the method, this figure shows that it produces a pre-period fit tight enough to make the post-period gap interpretable and a placebo distribution that discriminates the observed effect from noise.</em></p>
<h2 id="heading-step-1-fit-donor-weights-with-slsqp">Step 1: Fit Donor Weights with SLSQP</h2>
<p>The synthetic control weight vector <code>w</code> is the solution to a constrained optimization problem: minimize the pre-period mean squared error between the treated series and the weighted combination of donor series, subject to each weight being in [0, 1] and all weights summing to 1. The non-negativity and sum-to-1 constraints together define a convex combination, which is what prevents extrapolation beyond the support of the donor pool.</p>
<pre><code class="language-python">from scipy.optimize import minimize

n_donors = len(wave2_ws)
Y_pre = treated_series[:PRE]
D_pre = donor_matrix[:PRE, :]

def objective(w):
    return np.mean((Y_pre - D_pre @ w) ** 2)

w0 = np.ones(n_donors) / n_donors
bounds = [(0, 1)] * n_donors
constraints = [{"type": "eq", "fun": lambda w: w.sum() - 1}]

result = minimize(
    objective, w0, method="SLSQP", bounds=bounds,
    constraints=constraints,
    options={"ftol": 1e-12, "maxiter": 5000},
)
w_opt = result.x

pre_mse = float(np.mean((Y_pre - D_pre @ w_opt) ** 2))
pre_rmse = float(np.sqrt(pre_mse))
nz = int((w_opt &gt; 0.001).sum())

print(f"Optimization converged: {result.success}")
print(f"Non-zero donor weights (|w| &gt; 0.001): {nz}")
print(f"Pre-period MSE:  {pre_mse:.6f}")
print(f"Pre-period RMSE: {pre_rmse:.4f}  "
      f"({pre_rmse * 100:.2f} percentage points)")

synth_full = donor_matrix @ w_opt
gap = float((treated_series[PRE:] - synth_full[PRE:]).mean())
print(f"\nObserved post-period gap: {gap:+.4f}  (ground truth = +0.0500)")

nz_pairs = sorted(
    [(ws, w_opt[i]) for i, ws in enumerate(wave2_ws) if w_opt[i] &gt; 0.001],
    key=lambda x: -x[1]
)
print("\nTop 5 donor weights:")
for ws_id, weight in nz_pairs[:5]:
    print(f"  workspace {ws_id}: w = {weight:.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Optimization converged: True
Non-zero donor weights (|w| &gt; 0.001): 12
Pre-period MSE:  0.001400
Pre-period RMSE: 0.0374  (3.74 percentage points)

Observed post-period gap: +0.0829  (ground truth = +0.0500)

Top 5 donor weights:
  workspace 35: w = 0.2016
  workspace 40: w = 0.1900
  workspace 25: w = 0.1638
  workspace 32: w = 0.0872
  workspace 36: w = 0.0784
</code></pre>
<p><strong>Here's what's happening:</strong> the <code>objective</code> function computes the mean squared error between the treated pre-period series and the dot product of the donor matrix with the weight vector.</p>
<p>SLSQP handles the non-negativity bounds and the sum-to-1 equality constraint simultaneously. The <code>w &gt; 0.001</code> threshold classifies 12 donors as non-zero. SLSQP doesn't guarantee exact zeros at inactive constraints, so the threshold is a display convention. Pre-period RMSE of 3.74 pp measures how closely the weighted donors tracked the treated unit before the upgrade. The observed post-period gap of +0.0829 is the headline estimate, which overshoots the ground-truth +5 pp, as Step 5 quantifies with a confidence interval.</p>
<p>The weights are fixed at the end of the pre-period and never re-estimated using post-treatment data. Any divergence after week 20 reflects movement the optimizer had no opportunity to fit.</p>
<h2 id="heading-step-2-plot-treated-vs-synthetic-control-trajectories">Step 2: Plot Treated vs Synthetic Control Trajectories</h2>
<p>The primary visual diagnostic for synthetic control is the trajectory overlay: plot both series together, mark the treatment date, and confirm that the synthetic control tracks the treated unit in the pre-period and that a gap opens in the post-period.</p>
<p>A tight pre-period fit is the visible signal that the identification condition holds. A ragged fit means the treated unit is outside the convex hull of the donors, and the whole exercise is suspect.</p>
<pre><code class="language-python">import matplotlib.pyplot as plt

weeks = np.arange(WINDOW)

fig, ax = plt.subplots(figsize=(9, 4.5))
ax.plot(weeks, treated_series, marker="o", linewidth=1.8,
        color="#C44E52", label="Wave 1 (treated)")
ax.plot(weeks, synth_full, marker="s", linestyle="--",
        linewidth=1.8, color="#4C72B0", label="Synthetic control")
ax.axvline(PRE, color="#555555", linestyle=":", linewidth=1.4,
           label="Model upgrade (week 20)")
ax.set_xlabel("Signup week")
ax.set_ylabel("Mean task completion rate")
ax.set_title("Treated unit vs synthetic control")
ax.legend(frameon=False)
plt.tight_layout()
plt.show()

post_gap = treated_series[PRE:] - synth_full[PRE:]
print("Post-period weekly gaps (treated minus synthetic):")
for wk, g in zip(range(PRE, WINDOW), post_gap):
    print(f"  week {wk}: {g:+.4f}")
print(f"\nMean gap: {post_gap.mean():+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Post-period weekly gaps (treated minus synthetic):
  week 20: +0.0398
  week 21: +0.1663
  week 22: +0.1019
  week 23: +0.1535
  week 24: +0.1071
  week 25: +0.1047
  week 26: +0.0424
  week 27: +0.0326
  week 28: +0.0327
  week 29: +0.0479

Mean gap: +0.0829
</code></pre>
<p><strong>Here's what's happening:</strong> the two lines track each other in the pre-period, confirming the fit assumption. After week 20, the treated series moves above the synthetic control, and the weekly gaps are all positive with a mean of +8.29 pp.</p>
<p>The spread across weeks (from +3.26 pp to +16.63 pp) is how much week-to-week noise the estimator absorbs. A single bad week could swing the mean by a percentage point, which is why the placebo and LOO steps that follow matter more than any single point estimate.</p>
<h2 id="heading-step-3-in-space-placebo-permutation-test">Step 3: In-Space Placebo Permutation Test</h2>
<p>You can't run a standard t-test on a single treated unit. The synthetic control has one treated observation (wave 1) and 25 donor observations, which is not a setup for which any conventional p-value applies.</p>
<p>The standard validation is the in-space placebo permutation test. Treat each donor in turn as if it were the "treated" unit, re-fit the synthetic control using the remaining 24 donors as its placebo pool, record the placebo post-period gap, and compare the observed gap to the distribution of placebos.</p>
<pre><code class="language-python">placebo_gaps = []

for j in range(n_donors):
    placebo_treated = donor_matrix[:, j]
    placebo_pool = np.delete(donor_matrix, j, axis=1)
    n_p = placebo_pool.shape[1]

    def obj_p(w):
        return np.mean((placebo_treated[:PRE] - placebo_pool[:PRE] @ w) ** 2)

    res_p = minimize(
        obj_p, np.ones(n_p) / n_p, method="SLSQP",
        bounds=[(0, 1)] * n_p,
        constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1}],
        options={"ftol": 1e-12, "maxiter": 5000},
    )
    synth_p = placebo_pool @ res_p.x
    placebo_gaps.append((placebo_treated[PRE:] - synth_p[PRE:]).mean())

placebo_gaps = np.array(placebo_gaps)
observed_gap = gap

rank = int((np.abs(placebo_gaps) &gt;= abs(observed_gap)).sum())
pseudo_p = (rank + 1) / (len(placebo_gaps) + 1)

print(f"Observed gap:      {observed_gap:+.4f}")
print(f"Placebo mean gap:  {placebo_gaps.mean():+.4f}")
print(f"Placebo std gap:   {placebo_gaps.std():.4f}")
print(f"Placebo gap range: [{placebo_gaps.min():+.4f}, "
      f"{placebo_gaps.max():+.4f}]")
print(f"|placebo| &gt;= |observed|: {rank} of {len(placebo_gaps)}")
print(f"Pseudo p-value: {pseudo_p:.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Observed gap:      +0.0829
Placebo mean gap:  -0.0008
Placebo std gap:   0.0380
Placebo gap range: [-0.0748, +0.0707]
|placebo| &gt;= |observed|: 0 of 25
Pseudo p-value: 0.0385
</code></pre>
<p><strong>Here's what's happening:</strong> the loop iterates over all 25 wave-2 workspaces. For each one, you remove it from the donor pool, treat it as a placebo-treated unit, and re-run the SLSQP optimization. After 25 placebo runs, you count how many placebo gaps meet or exceed the observed gap in absolute value and apply the conservative (count + 1) / (N + 1) correction.</p>
<p>None of the 25 placebos produced a gap as extreme as the observed +0.0829, yielding a pseudo-p-value of 0.0385. That rejects the null of no effect at the 5% level. The placebo distribution centers near zero (mean -0.0008, std 3.80 pp), which is the noise floor to compare the observed gap against.</p>
<p>The correct statistical statement is: the observed gap is more extreme than any placebo drawn from untreated donors at the 5% level. The permutation test's power depends on the donor pool size: with 25 donors, the smallest possible pseudo-p is 1/26 = 0.0385, so you can't get a smaller p-value with this donor count. A wider placebo distribution or a smaller observed gap would rank the observation inside the placebo bulk and push the pseudo p above any useful threshold.</p>
<h2 id="heading-step-4-leave-one-out-donor-sensitivity">Step 4: Leave-One-Out Donor Sensitivity</h2>
<p>A tight point estimate can still be fragile if it hangs on a single donor. The leave-one-out (LOO) sensitivity check drops each non-zero-weight donor in turn, refits the synthetic control on the remaining donors, and records the new gap.</p>
<p>Abadie (2021) recommends this as the first-line robustness check. If removing any single donor swings the gap by a large amount, you don't have a synthetic control&nbsp;– you have a single-donor comparison dressed up with extra weight.</p>
<pre><code class="language-python">def fit_and_gap(treated, donors, pre=PRE):
    n = donors.shape[1]
    def obj(w):
        return np.mean((treated[:pre] - donors[:pre] @ w) ** 2)
    res = minimize(
        obj, np.ones(n) / n, method="SLSQP",
        bounds=[(0, 1)] * n,
        constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1}],
        options={"ftol": 1e-12, "maxiter": 5000},
    )
    synth = donors @ res.x
    return float((treated[pre:] - synth[pre:]).mean())


nz_idx = np.where(w_opt &gt; 0.001)[0]
loo_rows = []
for j in nz_idx:
    kept = np.delete(donor_matrix, j, axis=1)
    gap_new = fit_and_gap(treated_series, kept)
    loo_rows.append({
        "dropped_workspace": int(wave2_ws[j]),
        "dropped_weight": float(w_opt[j]),
        "new_gap": gap_new,
    })
loo_df = pd.DataFrame(loo_rows).sort_values("dropped_weight", ascending=False)
print(loo_df.round(4).to_string(index=False))
print(f"\nLOO gap range: [{loo_df.new_gap.min():+.4f}, "
      f"{loo_df.new_gap.max():+.4f}]")
print(f"Original gap:  {gap:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python"> dropped_workspace  dropped_weight  new_gap
                35          0.2016   0.0945
                40          0.1900   0.0756
                25          0.1638   0.0932
                32          0.0872   0.0868
                36          0.0784   0.0739
                31          0.0718   0.0858
                29          0.0648   0.0782
                26          0.0439   0.0786
                27          0.0364   0.0867
                46          0.0350   0.0794
                39          0.0192   0.0848
                42          0.0078   0.0839

LOO gap range: [+0.0739, +0.0945]
Original gap:  +0.0829
</code></pre>
<p><strong>Here's what's happening:</strong> the loop drops one non-zero-weight donor at a time and refits. All 12 LOO estimates stay positive, with the range [+7.39 pp, +9.45 pp] straddling the original +8.29 pp by about a percentage point in either direction.</p>
<p>No single donor drives the result. Even dropping workspace 35 (the largest weight at 0.2016) only shifts the gap to +9.45 pp because the optimizer redistributes weight across remaining donors.</p>
<p>That redistribution is the point of convex-combination weighting: many near-equivalent donor mixtures produce similar counterfactuals.</p>
<h2 id="heading-step-5-cluster-bootstrap-95-confidence-intervals">Step 5: Cluster Bootstrap 95% Confidence Intervals</h2>
<p>Point estimates are only half the story. A stakeholder asking "how sure are you" wants an interval. The classical non-parametric bootstrap doesn't apply cleanly to synthetic control on a single treated unit, because resampling the one treated time series with replacement destroys the time-ordering that the estimator depends on.</p>
<p>A valid substitute is the user-level cluster bootstrap: resample users with replacement, rebuild the workspace-by-week panel from the resampled user log, re-fit the donor weights on the pre-period, and record the post-period gap.</p>
<p>Repeat 500 times. The 2.5th and 97.5th percentiles of the resulting distribution are the 95% CI.</p>
<pre><code class="language-python">def build_panel(df_inner):
    dfw = df_inner[df_inner.signup_week &lt; WINDOW].copy()
    panel = (dfw.groupby(["workspace_id", "signup_week"])
             ["task_completed"].mean().reset_index())
    panel.columns = ["workspace_id", "week", "task_completed"]
    piv = panel.pivot(index="week", columns="workspace_id",
                      values="task_completed")
    piv = piv.interpolate(method="linear", axis=0).ffill().bfill()
    ws_wave_b = df_inner.groupby("workspace_id").wave.first()
    w1 = sorted(ws_wave_b[ws_wave_b == 1].index.tolist())
    w2 = sorted(ws_wave_b[ws_wave_b == 2].index.tolist())
    return piv[w1].mean(axis=1).values, piv[w2].values


rng = np.random.default_rng(7)
n = len(df)
n_reps = 500
gaps_boot = np.empty(n_reps)
for i in range(n_reps):
    sample = df.iloc[rng.integers(0, n, size=n)]
    t_b, d_b = build_panel(sample)
    gaps_boot[i] = fit_and_gap(t_b, d_b)

lo = float(np.percentile(gaps_boot, 2.5))
hi = float(np.percentile(gaps_boot, 97.5))
print(f"Post-period gap 95% CI: [{lo:+.4f}, {hi:+.4f}]")
print(f"Observed point estimate: {gap:+.4f}")
print(f"Ground truth +0.0500 inside CI: "
      f"{'YES' if lo &lt;= 0.05 &lt;= hi else 'NO'}")
print(f"Zero inside CI: {'YES' if lo &lt;= 0 &lt;= hi else 'NO'}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Post-period gap 95% CI: [+0.0511, +0.1215]
Observed point estimate: +0.0829
Ground truth +0.0500 inside CI: NO
Zero inside CI: NO
</code></pre>
<p><strong>Here's what's happening:</strong> you resample the user log 500 times, rebuild the panel from each resample, re-fit the weights on the pre-period, and take the 2.5th and 97.5th percentiles of the 500 resulting gaps. The 95% CI is [+5.11 pp, +12.15 pp]. It excludes zero with room to spare, so the effect is statistically meaningful.</p>
<p>The lower bound sits just above the +5 pp ground truth: a finite-sample upward bias typical of synthetic control on small donor panels, where each donor workspace (about 19 users per week) carries more noise than the 25-workspace treated average.</p>
<p>Placebo, LOO, and bootstrap together confirm a real positive effect. The point-estimate bias is the tradeoff for using single-workspace donors.</p>
<p>For a stakeholder report, cite the interval alongside the point estimate and note the bias direction so the team reads the number with the right calibration.</p>
<h2 id="heading-when-synthetic-control-fails">When Synthetic Control Fails</h2>
<p>Synthetic control is a precise tool with narrow failure modes. The four most common map directly to the three identification assumptions.</p>
<h3 id="heading-1-donor-pool-contamination-violates-no-interference">1. Donor Pool Contamination (Violates No Interference)</h3>
<p>If the upgrade shipped to wave 1 spills over to wave 2 (shared API rate-limit pools, shared prompt caches, users migrating between workspaces), the donors are contaminated, and the gap understates the true effect.</p>
<p>The defense is institutional: audit what changed for donor units around the treatment date, explicitly including model-level channels like shared routing, shared caching, and shared monitoring.</p>
<h3 id="heading-2-fundamentally-different-units-violates-pre-period-fit">2. Fundamentally Different Units (Violates Pre-period Fit)</h3>
<p>The convex-hull condition states that the treated unit must lie within the donors' support. If the treated unit is structurally different (for example, enterprise customers where every donor is an SMB), no weighting scheme yields a credible counterfactual, regardless of how tight the pre-period fit appears.</p>
<p>Check the weights: if the optimizer assigns 80 percent to a single donor, that donor is doing the entire job, and you should ask whether it's truly comparable.</p>
<h3 id="heading-3-post-treatment-shocks-to-donors-violate-stable-donor-composition">3. Post-Treatment Shocks to Donors (Violate Stable Donor Composition)</h3>
<p>The synthetic control projects donor behavior forward from pre-period weights. If a key donor experiences a major shock after treatment (a customer churn, an outage, a competitor release), its post-treatment trajectory is no longer a clean counterfactual. Inspect the time series of high-weight donors for unusual post-treatment patterns.</p>
<h3 id="heading-4-overfitting-risk-when-j-approaches-t-degrades-pre-period-fit-in-practice">4. Overfitting Risk When J Approaches T₀ (Degrades Pre-period Fit in Practice)</h3>
<p>The optimizer can fit the pre-period solely to noise when J ≥ T₀, creating the illusion of comparability. This tutorial runs at T₀/J = 20/25 = 0.8, in the danger zone. The LOO sensitivity check is the practical defense: if the gap holds up across donor drops, the fit reflects genuine comparability.</p>
<p>These failure modes stay invisible in your point estimate. They surface as a synthetic control that looks well-fit on paper and produces a gap that doesn't hold up when treatment rolls out to the next wave. Placebo test, LOO sensitivity, and bootstrap together are your defense.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>Synthetic control is the right tool when your feature ships globally and there's a pool of untreated units resembling the treated unit.</p>
<p>If treated and donor units operate at different scales, <strong>augmented synthetic control</strong> adds a bias-correction term from a linear outcome model. If you have many treated units with staggered adoption, <strong>generalized synthetic control</strong> (the <code>gsynth</code> R package) extends the framework.</p>
<p>For production Python work, <code>pysyncon</code> implements the full Abadie-Diamond-Hainmueller estimator with predictor-weighting via a V-matrix outer loop and adds in-time placebo tests (assigning the treatment to a pre-period date and checking for a spurious gap) that this tutorial doesn't cover. The from-scratch implementation here shows that the mechanics <code>pysyncon</code> is what you ship to a reviewer.</p>
<p>The companion notebook for this tutorial lives at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control</a>. Clone the repo, generate the synthetic dataset, and run <code>synthetic_control_demo.ipynb</code> (or <code>synthetic_control_demo.py</code>) to reproduce every code block, every number, and every figure from this tutorial.</p>
<p>When a model upgrade ships to every user at once, the naive before/after is usually the wrong number. Synthetic control builds "users like yours who didn't get the upgrade" from the data you already have, locks in the weights before the treatment week, and gives you a placebo distribution plus a bootstrap interval you can defend when a stakeholder asks how confident you are.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python ]]>
                </title>
                <description>
                    <![CDATA[ Causal inference for LLM-based features starts with one question editors ask before they ship anything: Did the change actually move the metric, or did the metric just move? Let's say that your team b ]]>
                </description>
                <link>https://www.freecodecamp.org/news/gen-ai-product-experimentation-with-regression-discontinuity-design/</link>
                <guid isPermaLink="false">69fe0255f239332df4da1c33</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ regression-discontinuity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ experimentation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Fri, 08 May 2026 15:33:41 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/a6b7e375-8638-4b98-824e-bb94c60e9e57.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Causal inference for LLM-based features starts with one question editors ask before they ship anything: Did the change actually move the metric, or did the metric just move?</p>
<p>Let's say that your team built a routing layer that splits incoming queries between two models: queries with a confidence score below 0.85 go to a premium model, and those above 0.85 go to a cheaper distilled model. The premium model costs 5x as much as the cheaper one.</p>
<p>Your boss wants the answer that ends the debate: Is the premium model worth it for the queries it sees?</p>
<p>You can't run a clean A/B test, because routing is deterministic: a query at confidence 0.84 always gets premium, a query at 0.86 always gets cheap, and you can't randomize the assignment.</p>
<p>You also can't trust a naïve comparison of premium-routed users against cheap-routed users. Premium handles the harder queries by design (that's the reason you built the gate), so the two groups differ in query difficulty before either model touches them.</p>
<p>The threshold itself is your free experiment. Right at 0.85, the assignment flips, but the queries on either side of that boundary are essentially identical. A query at confidence 0.849 isn't meaningfully different from a query at 0.851. Any differences in outcomes between the two narrow groups stem solely from the routing decision. That's what regression discontinuity design (RDD) reads.</p>
<p>In this tutorial, you'll use Python to estimate the causal effect of premium routing on task completion using sharp RDD with local linear regression. You'll sweep bandwidths to test estimate stability, run a manipulation diagnostic, check robustness with a quadratic specification, and bootstrap 95% confidence intervals around every point estimate.</p>
<p>The LLM telemetry is a 50,000-user synthetic dataset with the ground-truth premium-routing effect baked in at +6 percentage points, so you can verify that RDD recovers it.</p>
<p><strong>Companion code:</strong> every code block runs end-to-end <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/03_rdd_confidence_threshold">in the companion notebook</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-threshold-routing-is-a-natural-experiment">Why Threshold Routing is a Natural Experiment</a></p>
</li>
<li><p><a href="#heading-what-regression-discontinuity-actually-does">What Regression Discontinuity Actually Does</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-a-sharp-rdd-with-local-linear-regression">Step 1: A Sharp RDD with Local Linear Regression</a></p>
</li>
<li><p><a href="#heading-step-2-try-different-bandwidths">Step 2: Try Different Bandwidths</a></p>
</li>
<li><p><a href="#heading-step-3-checking-for-manipulation-at-the-threshold">Step 3: Checking for Manipulation at the Threshold</a></p>
</li>
<li><p><a href="#heading-step-4-quadratic-specification-as-a-robustness-check">Step 4: Quadratic Specification as a Robustness Check</a></p>
</li>
<li><p><a href="#heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</a></p>
</li>
<li><p><a href="#heading-when-regression-discontinuity-fails">When Regression Discontinuity Fails</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-threshold-routing-is-a-natural-experiment">Why Threshold Routing is a Natural Experiment</h2>
<p>The product reason this routing rule exists is to help your team spend the premium model budget where it earns its keep. Low-confidence queries are the harder ones, which is where a stronger model has the most upside. High-confidence queries already look easy enough for the cheap model to handle.</p>
<p>You'll see this routing direction across confidence-score gates for Q&amp;A assistants, query-complexity gates in multi-model gateways like OpenRouter, safety-score gates in content moderation, and latency-budget gates that re-route when the cheap model would exceed a p99 latency budget.</p>
<p>The mechanism is the same in every case: a continuous score, a threshold, and a deterministic routing rule.</p>
<p>What makes this setup useful for causal inference is that users don't pick which model they get. A query lands, the system computes confidence, and the routing layer decides. Right at the threshold, the user's experience flips from premium to cheap based on a difference too small to be meaningful.</p>
<p>Again, a query at 0.849 confidence isn't shipping a different problem to the model than a query at 0.851. Anything that differs in outcomes between those two groups is the routing decision speaking. The underlying query is the same.</p>
<p>That local randomness is the experiment RDD reads from. You don't need a randomized control group, you don't need a propensity score. And you don't need an instrument, you need a sharp threshold that nobody can game.</p>
<h2 id="heading-what-regression-discontinuity-actually-does">What Regression Discontinuity Actually Does</h2>
<p>The jump at the threshold is the causal effect, which is the number a product team can act on. RDD reads it by fitting two separate regression lines to the outcome: one for users just below the threshold and one for users just above. The vertical difference between those two fitted lines at the cutoff is the local average treatment effect at that point.</p>
<p>Graphically, picture task completion on the y-axis and query confidence on the x-axis. Completion generally trends with confidence (easier queries complete more often). At exactly 0.85, though, users below the cutoff get premium routing, and users above get cheap.</p>
<p>If premium routing helps, you'd see a sharp upward jump in task completion just below 0.85, then disappear just above. Approached from left to right with confidence rising, the visual reads as a downward step at 0.85, because you're moving from the premium-treated zone into the cheap-treated zone.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/f772c04b-5642-472c-8182-695183027294.png" alt="f772c04b-5642-472c-8182-695183027294" style="display:block;margin:0 auto" width="1517" height="857" loading="lazy">

<p><em>Figure 1. Conceptual schematic. Two outcome trajectories, one for premium-routed queries (confidence below 0.85) and one for cheap-routed queries (confidence above 0.85), meet at the threshold but don't match. The vertical gap between their endpoints at 0.85 is the local causal effect of premium routing.</em></p>
<p>That gap is identified under two named assumptions:</p>
<ol>
<li><p><strong>No manipulation of the running variable:</strong> Users (or your system) can't precisely nudge a query's confidence score across the cutoff. If anyone can game their score to land just below 0.85 and grab premium routing, the cutoff is no longer drawn at random, and RDD breaks.</p>
</li>
<li><p><strong>Continuity of potential outcomes at the cutoff:</strong> Every other factor that affects task completion (query type, user expertise, workspace tenure, time of day) varies smoothly across 0.85. Only the routing assignment changes discontinuously at exactly the threshold. If a second product rule fires at 0.85 (a different logging level, a separate UI treatment, a retry policy), RDD will attribute that rule's effect to the routing decision.</p>
</li>
</ol>
<p>These are the two assumptions you check before you trust the estimate. Step 3 below tests the first one. The second is a structural property of your system that you have to know cold.</p>
<p>Two practical choices shape every RDD: the <strong>bandwidth</strong> (how close to the cutoff to restrict the analysis) and the <strong>functional form</strong> (linear, quadratic, or local polynomial).</p>
<p>Narrow bandwidths cut potential bias by staying close to the local-randomization zone, but they shrink the sample. Linear specifications are stable, though they assume the underlying relationship can be approximated by a straight line on each side.</p>
<p>You'll try both linear and quadratic specifications at multiple bandwidths to see whether the answer holds.</p>
<p>The article uses sharp RDD throughout, since assignment is a deterministic function of confidence (below 0.85 always premium, above 0.85 always cheap). When the threshold is probabilistic and compliance is partial, the design is a fuzzy RDD, which requires an instrumental variables framework that you can implement using the <code>rdrobust</code> Python package.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You need Python 3.11 or newer, comfort with pandas and statsmodels, and rough familiarity with linear regression and interaction terms.</p>
<p>Install the packages used in this tutorial:</p>
<pre><code class="language-shell">pip install numpy pandas statsmodels matplotlib scipy
</code></pre>
<p><strong>Here's what's happening:</strong> four standard scientific Python libraries plus matplotlib for the diagnostic visualization. Nothing exotic.</p>
<p>Clone the companion repo and generate the synthetic dataset:</p>
<pre><code class="language-shell">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<p><strong>Here's what's happening:</strong> the data generator draws 50,000 users with a <code>query_confidence</code> score from a Beta(5,2) distribution, applies the routing rule (<code>routed_to_premium = query_confidence &lt; 0.85</code>), and bakes a +6-percentage-point premium routing effect into <code>task_completed</code>. Same seed, same dataset, every time.</p>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The dataset simulates a SaaS product that routes queries between a premium and a cheap model based on confidence score. The threshold is 0.85, and the ground-truth causal effect of premium routing is +6 percentage points on task completion. You know the truth, so you can check whether RDD recovers it.</p>
<p>Load the data and look at the routing breakdown:</p>
<pre><code class="language-python">import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(f"Loaded {len(df):,} rows, {df.shape[1]} columns")

print("\nRouting breakdown:")
counts = df.routed_to_premium.value_counts().to_dict()
print(f"  Premium-routed (confidence &lt; 0.85):  {counts.get(1, 0):,}")
print(f"  Cheap-routed   (confidence &gt;= 0.85): {counts.get(0, 0):,}")

print("\nQuery confidence distribution:")
print(df.query_confidence.describe().round(3))
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Loaded 50,000 rows, 16 columns

Routing breakdown:
  Premium-routed (confidence &lt; 0.85):  38,874
  Cheap-routed   (confidence &gt;= 0.85): 11,126

Query confidence distribution:
count    50000.000
mean         0.715
std          0.159
min          0.078
25%          0.611
50%          0.736
75%          0.838
max          0.998
</code></pre>
<p><strong>Here's what's happening:</strong> about 78% of queries land below the 0.85 cutoff and get premium routing. The Beta(5,2) distribution is skewed toward the upper end, with a median of 0.736, and most of its mass still sits below 0.85. The remaining 22% are queries that the model already feels confident about, and they go to the cheap model.</p>
<p>Before any regression, look at the naïve comparison every product team is tempted to run:</p>
<pre><code class="language-python">naive = (
    df[df.routed_to_premium == 1].task_completed.mean()
    - df[df.routed_to_premium == 0].task_completed.mean()
)
print(f"Naive premium-vs-cheap effect: {naive:+.4f}  (ground truth = +0.06)")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Naive premium-vs-cheap effect: +0.0632  (ground truth = +0.06)
</code></pre>
<p><strong>Here's what's happening:</strong> the naive estimate sits at +0.0632, which is suspiciously close to the truth. That's a coincidence of this specific synthetic dataset, where the only confounder of premium vs. cheap is <code>query_confidence</code> itself, and the outcome doesn't depend on confidence except through routing.</p>
<p>In production, you almost never get this lucky. User expertise, prompt phrasing, time of day, and a dozen unobserved query traits all correlate with confidence and with completion.</p>
<p>A naïve comparison in a real system can be off by 50% or more in either direction. RDD gives you identification that doesn't depend on the absence of hidden confounders.</p>
<h3 id="heading-step-1-a-sharp-rdd-with-local-linear-regression">Step 1: A Sharp RDD with Local Linear Regression</h3>
<p>The basic sharp RDD estimator is a local linear regression. Restrict to users whose confidence sits within a bandwidth of the cutoff, fit separate linear slopes on each side, and read off the jump at 0.85.</p>
<pre><code class="language-python">cutoff = 0.85
bw = 0.10

near = df[(df.query_confidence &gt; cutoff - bw)
          &amp; (df.query_confidence &lt; cutoff + bw)].copy()
near["below_cutoff"] = (near.query_confidence &lt; cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff

rdd_model = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc",
    data=near,
).fit(cov_type="HC3")

effect = rdd_model.params["below_cutoff"]
print(f"RDD effect at cutoff (LATE): {effect:+.4f}")
print(f"Std error (HC3):             {rdd_model.bse['below_cutoff']:.4f}")
print(f"p-value:                     {rdd_model.pvalues['below_cutoff']:.4f}")
print(f"N users in [0.75, 0.95):     {len(near):,}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">RDD effect at cutoff (LATE): +0.0548
Std error (HC3):             0.0131
p-value:                     0.0000
N users in [0.75, 0.95):     21,689
</code></pre>
<p><strong>Here's what's happening:</strong> the model fits separate intercepts and slopes on each side of 0.85 (<code>below_cutoff</code> is the side indicator, <code>rc</code> is confidence centered at the cutoff). The coefficient on <code>below_cutoff</code> reads off the vertical jump at the threshold, which is the local average treatment effect (LATE) for queries with confidence near 0.85. You get +0.0548, within sampling noise of the +0.06 ground truth.</p>
<p>Three notes on the specification. First, <code>task_completed</code> is binary, so this is a linear probability model. For RDD with a binary outcome at the cutoff, the linear probability model is standard practice because local linearity is the identifying assumption either way. Logit at the cutoff is an alternative if you need bounded predictions globally.</p>
<p>Second, the standard errors are used <code>cov_type="HC3"</code> to relax the homoskedasticity assumption, which is almost always wrong for binary outcomes.</p>
<p>Third, the dataset has one query per user with no within-user clustering, so cluster-robust standard errors aren't needed here. In a setting with multiple queries per user, you'd cluster on <code>user_id</code>.</p>
<p>The next diagnostic to look at is the confidence distribution near the cutoff. Figure 2 shows what 50,000 queries look like in the bandwidth window:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/9ecb8a4c-6eac-4732-95ae-2a5981917f54.png" alt="9ecb8a4c-6eac-4732-95ae-2a5981917f54" style="display:block;margin:0 auto" width="1483" height="1005" loading="lazy">

<p><em>Figure 2. Real distribution from the 50,000-user synthetic dataset. Unlike the schematic in Figure 1, this shows the actual query density by confidence score, with the routing threshold annotated. The bottom panel counts how many queries land in each 2-percentage-point bin near the cutoff (2,461 / 2,481 / 2,335 / 2,229 / 2,048 across the 0.80–0.90 range). The roughly uniform spread is the visual signal that no manipulation is concentrating users on one side of the threshold.</em></p>
<h3 id="heading-step-2-try-different-bandwidths">Step 2: Try Different Bandwidths</h3>
<p>Bandwidth choice matters. Too narrow and you have too few observations, so the confidence interval blows up. Too wide and you're extrapolating into regions where the linear specification is no longer a reasonable local approximation.</p>
<p>The honest move is to try multiple bandwidths and report whether the estimate holds.</p>
<pre><code class="language-python">results = []
for bw in [0.05, 0.10, 0.15, 0.20]:
    sub = df[(df.query_confidence &gt; cutoff - bw)
             &amp; (df.query_confidence &lt; cutoff + bw)].copy()
    sub["below_cutoff"] = (sub.query_confidence &lt; cutoff).astype(int)
    sub["rc"] = sub.query_confidence - cutoff

    m = smf.ols(
        "task_completed ~ below_cutoff + rc + below_cutoff:rc",
        data=sub,
    ).fit(cov_type="HC3")

    results.append({
        "bandwidth": bw,
        "n": len(sub),
        "effect": m.params["below_cutoff"],
        "se": m.bse["below_cutoff"],
        "p": m.pvalues["below_cutoff"],
    })

print(pd.DataFrame(results).round(4).to_string(index=False))
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python"> bandwidth      n  effect     se       p
      0.05  11554  0.0635  0.0183  0.0005
      0.10  21689  0.0548  0.0131  0.0000
      0.15  29137  0.0618  0.0112  0.0000
      0.20  34074  0.0614  0.0107  0.0000
</code></pre>
<p><strong>Here's what's happening:</strong> four bandwidths from ±0.05 to ±0.20 around the cutoff, refitting the same RDD specification at each. The estimates range from +0.0548 to +0.0635, all in the same neighborhood as the +0.06 ground truth, with standard errors that shrink as the bandwidth widens and grow as it narrows. Every p-value is well below 0.05. Whether the estimates are "stable" depends on the confidence intervals around them, which Step 5 produces with the bootstrap.</p>
<h3 id="heading-step-3-checking-for-manipulation-at-the-threshold">Step 3: Checking for Manipulation at the Threshold</h3>
<p>RDD is valid only if users can't precisely manipulate the running variable around the cutoff. If your users (or your system) can nudge confidence scores just below 0.85 to force premium routing, you get a density spike at the cutoff, and the RDD estimate is contaminated.</p>
<p>The standard diagnostic is the McCrary density test, which checks whether the distribution of the running variable has a sharp jump at the cutoff. The simple version: bin the data tightly around 0.85 and check whether the counts on the two sides are similar.</p>
<pre><code class="language-python">print("User counts in 2-percentage-point bins around 0.85:")
for lo in [0.80, 0.82, 0.84, 0.86, 0.88]:
    hi = lo + 0.02
    cnt = ((df.query_confidence &gt;= lo) &amp; (df.query_confidence &lt; hi)).sum()
    print(f"  [{lo:.2f}, {hi:.2f}):  n = {cnt:,}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">User counts in 2-percentage-point bins around 0.85:
  [0.80, 0.82):  n = 2,461
  [0.82, 0.84):  n = 2,481
  [0.84, 0.86):  n = 2,335
  [0.86, 0.88):  n = 2,229
  [0.88, 0.90):  n = 2,048
</code></pre>
<p><strong>Here's what's happening:</strong> counts trend gently downward across the bandwidth because Beta(5,2) places more mass at higher confidence levels, and the density tapers as it approaches 1.0. There's no spike or dip at the 0.84–0.86 bin that straddles the cutoff. The 433-user spread across all five bins is consistent with smooth tapering of the underlying density.</p>
<p>That's the pattern you want when manipulation is absent. For a more rigorous test, the <a href="https://github.com/rdpackages/rddensity"><code>rddensity</code></a> Python package implements the formal McCrary procedure with bias-corrected standard errors.</p>
<p>What manipulation looks like when it's real: a spike in users at confidences just barely below 0.85 (they're being nudged into premium routing) and a dip just above. If you see that pattern, the RDD estimate overstates the causal effect because the users right below 0.85 differ in motivation from those right above. They cared enough to manipulate the score, and they'd have shown different outcomes even under random routing.</p>
<h3 id="heading-step-4-quadratic-specification-as-a-robustness-check">Step 4: Quadratic Specification as a Robustness Check</h3>
<p>If the true relationship between confidence and task completion isn't exactly linear, a local linear RDD can mistake the curvature for a jump. The standard robustness check allows quadratic terms on both sides of the cutoff and tests whether the estimate holds.</p>
<pre><code class="language-python">near = df[(df.query_confidence &gt; cutoff - 0.10)
         &amp; (df.query_confidence &lt; cutoff + 0.10)].copy()
near["below_cutoff"] = (near.query_confidence &lt; cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff
near["rc2"] = near.rc ** 2

rdd_quad = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc"
    " + rc2 + below_cutoff:rc2",
    data=near,
).fit(cov_type="HC3")

print(f"Linear RDD    (bw=0.10):  effect = +0.0548, p &lt; 0.0001")
print(f"Quadratic RDD (bw=0.10):  effect = "
      f"{rdd_quad.params['below_cutoff']:+.4f}, "
      f"p = {rdd_quad.pvalues['below_cutoff']:.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Linear RDD    (bw=0.10):  effect = +0.0548, p &lt; 0.0001
Quadratic RDD (bw=0.10):  effect = +0.0569, p = 0.0036
</code></pre>
<p><strong>Here's what's happening:</strong> the quadratic specification adds squared terms and interactions with the cutoff indicator, allowing the relationship to curve differently on each side. The <code>below_cutoff</code> coefficient still captures the jump at the threshold, now under a more flexible specification.</p>
<p>The two estimates differ by 0.0022, both close to the +0.06 ground truth, and both are significant at p &lt; 0.01. The answer doesn't change when you let the model bend.</p>
<p>When linear and quadratic specifications disagree noticeably, you have a real signal. With small samples (a few thousand at narrow bandwidths), the quadratic version can lose power because four extra parameters need data to be identified.</p>
<p>The standard move is to widen the bandwidth and re-run both specifications. If they still disagree at wider bandwidths, the linear approximation is wrong, and you should report both numbers.</p>
<h3 id="heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</h3>
<p>Every point estimate in this article is a single number from a finite sample. The bootstrap quantifies how much that number would move under resampling, which is what a confidence interval describes.</p>
<pre><code class="language-python">def bootstrap_ci(df, cutoff, bw, quadratic=False, n_reps=500, seed=7):
    rng = np.random.default_rng(seed)
    near = df[(df.query_confidence &gt; cutoff - bw)
              &amp; (df.query_confidence &lt; cutoff + bw)].copy()
    near["below_cutoff"] = (near.query_confidence &lt; cutoff).astype(int)
    near["rc"] = near.query_confidence - cutoff
    if quadratic:
        near["rc2"] = near.rc ** 2
        formula = ("task_completed ~ below_cutoff + rc + below_cutoff:rc"
                   " + rc2 + below_cutoff:rc2")
    else:
        formula = "task_completed ~ below_cutoff + rc + below_cutoff:rc"

    n = len(near)
    estimates = np.empty(n_reps)
    for i in range(n_reps):
        sample = near.iloc[rng.integers(0, n, size=n)]
        m = smf.ols(formula, data=sample).fit()
        estimates[i] = m.params["below_cutoff"]
    return (np.percentile(estimates, 2.5), np.percentile(estimates, 97.5))


print("Linear RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10)
print(f"  effect = +0.0548   95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nBandwidth sensitivity:")
for bw, eff in [(0.05, 0.0635), (0.10, 0.0548), (0.15, 0.0618), (0.20, 0.0614)]:
    lo, hi = bootstrap_ci(df, cutoff, bw=bw)
    print(f"  bw = {bw:.2f}   effect = {eff:+.4f}   "
          f"95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nQuadratic RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10, quadratic=True)
print(f"  effect = +0.0569   95% CI: [{lo:+.4f}, {hi:+.4f}]")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Linear RDD (bw=0.10):
  effect = +0.0548   95% CI: [+0.0278, +0.0817]

Bandwidth sensitivity:
  bw = 0.05   effect = +0.0635   95% CI: [+0.0244, +0.0986]
  bw = 0.10   effect = +0.0548   95% CI: [+0.0278, +0.0817]
  bw = 0.15   effect = +0.0618   95% CI: [+0.0381, +0.0823]
  bw = 0.20   effect = +0.0614   95% CI: [+0.0420, +0.0808]

Quadratic RDD (bw=0.10):
  effect = +0.0569   95% CI: [+0.0205, +0.0959]
</code></pre>
<p><strong>Here's what's happening:</strong> the bootstrap resamples the bandwidth-restricted data with replacement 500 times, refits the RDD on each replicate, and collects the <code>below_cutoff</code> coefficient. The 2.5th and 97.5th percentiles of those 500 estimates form the 95% interval. Every interval covers the +0.06 ground truth, every interval excludes zero, and the bandwidth sweep produces overlapping intervals.</p>
<p>That's quantitative stability, verified by resampling across the full bandwidth range. Intervals widen as the bandwidth shrinks and narrow as it grows. The quadratic interval is wider than the linear one because the four extra parameters absorb degrees of freedom.</p>
<p>One thing the intervals do NOT do on this dataset: exclude the naive +0.0632 estimate. That's because the data generator doesn't bake in confounding by query confidence. The only difference between the premium and cheap groups in expectations is the +6pp routing effect itself, so the naïve comparison is close to the truth.</p>
<p>Real systems are messier. In a production setting where unobserved query traits affect both the routing assignment and task completion, the naïve estimate would diverge from the RDD estimate, and the bootstrap intervals would tell you which one to trust.</p>
<h2 id="heading-when-regression-discontinuity-fails">When Regression Discontinuity Fails</h2>
<p>RDD looks clean, but several specific failure modes can destroy the identification. Each one maps to a violation of one of the two named assumptions.</p>
<p><strong>Users manipulate the running variable</strong> (violates assumption 1). The whole setup depends on users (or any upstream service) being unable to precisely control which side of the cutoff they land on. Any system that reveals the cutoff and gives users a way to influence their score (a retry mechanism, a prompt engineering workaround, a confidence-inflating trick) breaks RDD.</p>
<p>Run the density check in Step 3 every time. If you find manipulation, switch to a fuzzy RDD that treats the threshold as probabilistic, or abandon the approach.</p>
<p><strong>Other policies fire at the same cutoff</strong> (violates assumption 2). If your product has additional rules that activate at 0.85 (a separate UI treatment, a different logging level, a different retry policy), RDD can't separate the routing effect from those other policy effects. Audit the full rule book for anything that shares the threshold.</p>
<p><strong>The threshold has noise or overrides</strong> (violates assumption 1, in the structural sense). Maybe routing isn't strictly deterministic at 0.85&nbsp;– it may have random jitter, or a second rule may override the main rule in some cases.</p>
<p>If assignment to the premium model isn't a deterministic function of <code>query_confidence</code>, you have a fuzzy RDD, which requires an instrumental variables framework. The <code>rdrobust</code> package handles both sharp and fuzzy designs.</p>
<p><strong>Curvature masquerading as a jump</strong> (breaks the linear approximation that supports identification at the cutoff). Sharp RDD assumes linearity is a reasonable local approximation. When the underlying outcome-confidence relationship is strongly curved, the linear specification can mistake the bend for a jump.</p>
<p>Step 4's quadratic robustness check is the standard diagnostic. If linear and quadratic disagree, widen the bandwidth and re-run both.</p>
<p><strong>Extrapolation bias</strong> (a continuity issue, reframed). RDD estimates are strictly local to the cutoff. The +0.06 effect at 0.85 tells you nothing about what premium routing would do for queries with confidence 0.30 or 0.99.</p>
<p>If you want a global average effect, you need a different technique: propensity methods, regression with confounder adjustment, or an actual experiment.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>RDD is the right tool when your AI feature is gated by a continuous score and a sharp threshold.</p>
<p>If your feature is gated by a user-controlled toggle, propensity score methods are a better fit. If it's gated by a staged rollout across workspaces, difference-in-differences handles it. If it's gated by rules you can't observe directly but that have a random component, instrumental variables is the right choice.</p>
<p>For production RDD analyses, use the <a href="https://github.com/rdpackages/rdrobust"><code>rdrobust</code></a> Python package. It gives you optimal bandwidth selection (Calonico, Cattaneo, and Titiunik 2014), bias-corrected standard errors, and a built-in plotting utility. The companion <a href="https://github.com/rdpackages/rddensity"><code>rddensity</code></a> package implements the McCrary density test you saw informally in Step 3.</p>
<p>The from-scratch version in this tutorial shows the mechanics. The rd-packages stack is what you ship to a reviewer.</p>
<p>One thing the LATE doesn't do: tell you the effect for users far from the cutoff. If a +0.06 LATE at 0.85 is enough to keep premium routing in the pipeline, you're done. If you need to know what premium would do for the easy queries you're currently sending to cheap (or the hardest queries near the floor), the next step is a small randomized rollout in those zones, scored against the RDD estimate as a calibration check. Don't generalize the LATE without evidence.</p>
<p>The companion notebook for this tutorial <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/03_rdd_confidence_threshold">lives here on GitHub</a>. Clone the repo, generate the synthetic dataset, and run <code>rdd_demo.ipynb</code> to reproduce every code block from this tutorial.</p>
<p>Threshold routing is one of the most common patterns in production LLM systems, and every confidence-gated routing decision in your stack is a potential RDD. Run the analysis.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)
 ]]>
                </title>
                <description>
                    <![CDATA[ We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/</link>
                <guid isPermaLink="false">69fb84ad50ecad45335e5367</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ academic writing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Wed, 06 May 2026 18:13:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/0998e844-4017-49b9-a68d-2d6c73fceb78.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on research papers where the original ideas were developed and tested.</p>
<p>Now, not everyone enjoys reading research papers or has the time to comb through and digest all that (sometimes very dense) info. So I decided to do the hard work for you and share the key insights in a series of AI paper reviews.</p>
<p>The goal isn’t to turn this into a heavy academic discussion, but to explain the main ideas in a clear and practical way. You'll learn what problem the paper was trying to solve, what approach it introduced, and why it mattered.</p>
<p>In each article, you’ll get a simple breakdown of the paper, how it works, and what you should take away from it. By the end, you should understand the key idea without needing to go through the full research paper yourself.</p>
<h2 id="heading-paper-overview">Paper Overview</h2>
<p>The first paper I'll be reviewing is "Improving Language Understanding by Generative Pre-Training", by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.</p>
<p>Here's the actual paper if you want to read it yourself: <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">Read the paper</a>.</p>
<p>And here's a little infographic of what we'll cover here:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/0466e09f-c2a3-41fa-939d-f67d53f900e1.png" alt="0466e09f-c2a3-41fa-939d-f67d53f900e1" style="display:block;margin:0 auto" width="1414" height="2000" loading="lazy">

<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-goals-of-the-paper">Goals of the Paper</a></p>
</li>
<li><p><a href="#heading-methodology">Methodology</a></p>
</li>
<li><p><a href="#heading-transformer-vs-bert-vs-gpt">Transformer vs. BERT vs. GPT</a></p>
</li>
<li><p><a href="#heading-model-architecture">Model Architecture</a></p>
</li>
<li><p><a href="#heading-key-techniques">Key Techniques</a></p>
</li>
<li><p><a href="#heading-key-findings">Key Findings</a></p>
</li>
<li><p><a href="#heading-conclusions">Conclusions</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-related-work-amp-context">Related Work &amp; Context</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to be familiar with a few basic ideas:</p>
<ul>
<li><p>A general understanding of natural language processing (NLP) and how machines work with text</p>
</li>
<li><p>A high-level idea of what a Transformer model is (you don’t need deep details, just the concept)</p>
</li>
<li><p>The difference between supervised and unsupervised learning</p>
</li>
<li><p>Basic machine learning concepts like training data and models</p>
</li>
</ul>
<p>If you’re not fully comfortable with all of these, that’s okay, you can still follow along. The goal here is to keep things clear and intuitive.</p>
<h2 id="heading-executive-summary">Executive Summary</h2>
<p>Before models like GPT became what we know today, there was a key limitation: AI systems were good at specific tasks, but struggled with general understanding.</p>
<p>In this paper, the authors introduce a simple but powerful idea. Instead of training a model separately for each task, they first train it on a large amount of unlabeled text to learn the structure of language. Then, they adapt it to specific tasks using smaller labeled datasets.</p>
<p>According to the authors, this two-step approach (pre-training followed by fine-tuning) allows a single model to handle many different tasks with minimal changes.</p>
<p>In practice, this marked a major shift: rather than building a new model for every problem, we can train one general model that learns language itself and then reuse it across tasks.</p>
<h2 id="heading-goals-of-the-paper">Goals of the Paper</h2>
<p>To understand the motivation behind this work, it helps to look at the main limitations in NLP at the time.</p>
<p>Most models depended heavily on large labeled datasets, which weren’t always available. Many tasks simply didn’t have enough labeled data to train effective systems. On top of that, existing models were usually designed for a single task, making them hard to reuse or adapt.</p>
<p>Because of this, the authors aimed to reduce the reliance on labeled data and move toward a more general approach. Their goal was to build a language model that could learn from large amounts of raw text and then be applied across different tasks.</p>
<p>According to the paper, they also wanted to enable transfer learning: the ability to take knowledge learned from one task and apply it to others. They also wanted to improve performance without needing to redesign a new model each time.</p>
<h2 id="heading-methodology">Methodology</h2>
<p>To understand how the authors approached this problem, let’s look at the core idea behind their method.</p>
<h3 id="heading-pre-training">Pre-Training</h3>
<p>At the heart of the paper is a simple but powerful approach built in two stages. The first stage is pre-training, where the model learns directly from raw text.</p>
<p>According to the authors, the model is trained on a large corpus of unlabeled text using a language modeling objective (predicting the next word in a sequence) – specifically, predicting the next word based on the previous ones to solve the intractable problem of <a href="https://en.wikipedia.org/wiki/High-dimensional_statistics">high dimension probabilities</a>. Through this process, the model gradually learns important aspects of language, such as grammar, context, structure, and general patterns.</p>
<p>The paper highlights that datasets like BooksCorpus are used in this stage because they contain long, continuous text. This is important, since it helps the model understand relationships across sentences rather than just short fragments.</p>
<h3 id="heading-fine-tuning-adapting-to-tasks">Fine-Tuning (Adapting to Tasks)</h3>
<p>Once the model has learned general language patterns, the next step is fine-tuning, where it is adapted to specific tasks using labeled data.</p>
<p>According to the authors, this includes tasks like question answering, text classification, natural language inference, and semantic similarity. Instead of building a new model for each task, the same pre-trained model is reused with only small adjustments.</p>
<p>In practice, this is what makes the approach powerful: the model already understands language at a general level, so it can quickly adapt to different tasks without needing to be redesigned from scratch.</p>
<h2 id="heading-transformer-vs-bert-vs-gpt">Transformer vs. BERT vs. GPT</h2>
<p>Before diving into GPT-1, it helps to understand how modern language models are structured. Most of them are based on the Transformer architecture, but they use it in different ways: encoder-only models (like BERT), decoder-only models (like GPT), or full encoder–decoder models.</p>
<p>The original encoder–decoder Transformer was mainly used for tasks like machine translation. Encoder-only models are typically used for understanding tasks such as text classification and sentiment analysis, while decoder-only models are designed for generation tasks like text creation, powering systems such as ChatGPT, Gemini, and Claude.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/e7348479-5fa0-4adf-92e1-644ae2039b03.png" alt="e7348479-5fa0-4adf-92e1-644ae2039b03" style="display:block;margin:0 auto" width="700" height="449" loading="lazy">

<p><em>Illustration comparing Transformer, GPT, and BERT architectures, adapted from</em> <a href="https://automotivevisions.wordpress.com/2025/03/21/comparing-large-language-models-gpt-vs-bert-vs-t5/">Comparing Large Language Models: GPT vs. BERT vs. T5</a> <em>showing encoder-decoder, decoder-only, and encoder-only designs</em></p>
<h3 id="heading-transformer-vs-bert-vs-gpt-key-differences">Transformer vs BERT vs GPT: Key Differences</h3>
<table style="min-width:100px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>Transformer (Original)</strong></p></td><td><p><strong>BERT</strong></p></td><td><p><strong>GPT</strong></p></td></tr><tr><td><p><strong>Paper</strong></p></td><td><p>Attention Is All You Need (2017)</p></td><td><p>BERT (2018)</p></td><td><p>GPT (2018–2019)</p></td></tr><tr><td><p><strong>Architecture Type</strong></p></td><td><p>Encoder + Decoder</p></td><td><p>Encoder-only</p></td><td><p>Decoder-only</p></td></tr><tr><td><p><strong>Primary Goal</strong></p></td><td><p>Sequence-to-sequence tasks (for example, translation)</p></td><td><p>Language understanding</p></td><td><p>Language generation</p></td></tr><tr><td><p><strong>Training Objective</strong></p></td><td><p>Predict next token (seq2seq setup)</p></td><td><p>Masked language modeling (fill in blanks)</p></td><td><p>Predict next token (autoregressive)</p></td></tr><tr><td><p><strong>Directionality</strong></p></td><td><p>Bidirectional (encoder) + left-to-right (decoder)</p></td><td><p>Fully bidirectional</p></td><td><p>Left-to-right only</p></td></tr><tr><td><p><strong>Context Understanding</strong></p></td><td><p>Strong (via attention)</p></td><td><p>Very strong (full bidirectional context)</p></td><td><p>Strong (but only past context)</p></td></tr><tr><td><p><strong>Input/Output Style</strong></p></td><td><p>Input → Output sequence</p></td><td><p>Input → Representation</p></td><td><p>Input → Generated text</p></td></tr><tr><td><p><strong>Fine-tuning</strong></p></td><td><p>Required for each task</p></td><td><p>Required for each task</p></td><td><p>Optional (GPT-2+ supports zero-shot)</p></td></tr><tr><td><p><strong>Typical Tasks</strong></p></td><td><p>Translation, summarization</p></td><td><p>Classification, QA, NLI</p></td><td><p>Text generation, QA, chat</p></td></tr><tr><td><p><strong>Strength</strong></p></td><td><p>Flexible architecture foundation</p></td><td><p>Deep understanding of text</p></td><td><p>General-purpose generation</p></td></tr><tr><td><p><strong>Limitation</strong></p></td><td><p>Not directly usable without adaptation</p></td><td><p>Cannot generate text naturally</p></td><td><p>Limited bidirectional context</p></td></tr><tr><td><p><strong>Key Innovation</strong></p></td><td><p>Self-attention mechanism</p></td><td><p>Deep bidirectional encoding</p></td><td><p>Scaled generative pre-training</p></td></tr><tr><td><p><strong>Evolution Role</strong></p></td><td><p>Foundation of all modern LLMs</p></td><td><p>Specialized understanding models</p></td><td><p>Path to general-purpose AI</p></td></tr></tbody></table>

<h2 id="heading-model-architecture">Model Architecture</h2>
<p>To support this pre-training and fine-tuning approach, the GPT-1 model is built on a Transformer (decoder) architecture.</p>
<p>According to the authors, this choice is important for a few reasons. Unlike older models such as LSTMs, Transformers handle long-range dependencies more effectively, meaning they can better understand relationships between words that are far apart in a sentence.</p>
<p>They also rely on self-attention, a mechanism that allows the model to focus on the most relevant parts of the text when processing each word. This helps the model capture context more accurately.</p>
<p>Another key advantage is that Transformers make transfer learning more effective, since the same learned representations can be reused across different tasks with minimal changes.</p>
<p>The paper highlights that, in these transfer learning scenarios, Transformers outperform LSTM-based models.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/59df10f6-d843-4db7-9def-e302594d0b7e.png" alt="59df10f6-d843-4db7-9def-e302594d0b7e" style="display:block;margin:0 auto" width="1793" height="831" loading="lazy">

<p><em>Figure 1 from</em> “Improving Language Understanding by Generative Pre-Training” <em>(Radford et al., 2018), showing the Transformer architecture and task-specific input transformations.</em></p>
<h2 id="heading-key-techniques">Key Techniques</h2>
<p>Along with the main approach, the authors introduce a few practical techniques that make the model more flexible across tasks.</p>
<p>According to the paper, different tasks are handled by converting them into text-based formats, so they can all be processed in a similar way. This makes it easier to use the same model across multiple problems without redesigning it each time.</p>
<p>Another important point is that the model requires only minimal architectural changes when switching between tasks. Most of the knowledge learned during pre-training is reused as-is.</p>
<p>The authors also include an auxiliary language modeling objective during fine-tuning, which helps the model retain its general understanding of language while adapting to specific tasks.</p>
<h2 id="heading-key-findings">Key Findings</h2>
<p>After training and evaluation, the results weren't just strong – they were surprisingly competitive.</p>
<p>According to the authors, the model outperformed state-of-the-art systems in 9 out of 12 tasks. It also showed clear improvements, including +8.9% in commonsense reasoning and +5.7% in question answering.</p>
<p>Another important observation is that the model performed well across datasets of different sizes, although performance was weaker on some smaller datasets.</p>
<p>This suggests that the pre-training step helped it generalize better, even when labeled data was limited.</p>
<p>In practice, what makes these results significant is that a single model was able to compete with specialized systems that were specifically designed for each individual task.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/14e5a9dd-9919-4b2a-ad42-6b011770b7fe.png" alt="14e5a9dd-9919-4b2a-ad42-6b011770b7fe" style="display:block;margin:0 auto" width="1866" height="815" loading="lazy">

<p><em>Figure 2 from</em> “Improving Language Understanding by Generative Pre-Training” <em>(Radford et al., 2018), illustrating performance gains from layer transfer and zero-shot learning behavior.</em></p>
<h2 id="heading-conclusions">Conclusions</h2>
<p>To wrap things up, this paper introduced a major shift in how AI systems are built.</p>
<p>According to the authors, instead of training a new model from scratch for every task, we can first teach a model the structure of language through pre-training, and then adapt it to specific tasks through fine-tuning. This simple idea turns out to be highly effective.</p>
<p>The key takeaway is that language models can develop a general understanding of text, especially when combined with Transformer architectures and large-scale data. This makes transfer learning practical across many different tasks.</p>
<p>In my view, this is what makes the paper so impactful. It doesn’t just improve performance on a few benchmarks. It changes the overall approach to building AI systems.</p>
<p>This idea later became the foundation for models like GPT-2, GPT-3, and ChatGPT, and continues to shape modern large language models today.</p>
<h2 id="heading-limitations">Limitations</h2>
<p>Like any approach, this method comes with its own limitations.</p>
<p>According to the paper, one of the main challenges is the need for large amounts of unlabeled data during the pre-training stage, which may not always be easy to get. The model’s performance also depends heavily on how well the fine-tuning step is done.</p>
<p>The authors also note that multi-task learning was not fully explored in this work, leaving some open questions about how well the model can handle multiple tasks at the same time.</p>
<p>In practice, another limitation is that performance can be weaker when working with very small datasets, especially if the fine-tuning process is not carefully handled.</p>
<h2 id="heading-related-work-amp-context">Related Work &amp; Context</h2>
<p>To better understand where this paper fits, it helps to look at the ideas it builds on.</p>
<p>According to the authors, earlier approaches such as word embeddings (like Word2Vec and GloVe), LSTM-based language models, and semi-supervised learning had already made progress in understanding language. But these methods were often limited to learning representations at the word level or required more task-specific design.</p>
<p>What this paper does differently is move beyond that. Instead of focusing only on individual words, it learns broader language representations that capture context and meaning across entire sequences. This shift is what enables the model to generalize better across different tasks.</p>
<h2 id="heading-final-insight">Final Insight</h2>
<p>If there’s one idea to take away from this paper, it’s this: you don’t need to teach an AI system every task separately.</p>
<p>According to the authors, once a model learns the structure of language, it can adapt to a wide range of tasks with minimal changes. That shift – from task-specific models to general language understanding – is what makes this work so important.</p>
<p>In my view, this is the moment where things really changed. What started here with GPT-1 became the foundation for the systems we use today, including ChatGPT and other modern language models.</p>
<h2 id="heading-resources">Resources:</h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1301.3781">Word2Vec (Mikolov et al., 2013)</a></p>
</li>
<li><p><a href="https://aclanthology.org/D14-1162.pdf">GloVe (Pennington et al., 2014)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1706.03762">Attention Is All You Need (Vaswani et al., 2017)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1511.01432">Semi-supervised Sequence Learning (Dai and Le, 2015)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1801.06146">Universal Language Model Fine-tuning for Text Classification (Howard and Ruder, 2018)</a></p>
</li>
<li><p><a href="https://aclanthology.org/N18-1202.pdf">Deep Contextualized Word Representations (Peters et al., 2018)</a></p>
</li>
<li><p><a href="https://aclanthology.org/P17-1194.pdf">Semi-supervised Multitask Learning for Sequence Labeling (Rei, 2017)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1506.06726">Skip-Thought Vectors (Kiros et al., 2015)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1705.02364">Supervised Learning of Universal Sentence Representations (Conneau et al., 2017)</a></p>
</li>
</ul>
<h3 id="heading-contact-me">Contact Me</h3>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Science Insights: Why the Mean Lies When Handling Messy Retail Data ]]>
                </title>
                <description>
                    <![CDATA[ In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on. Let's take the case of a retail shop. If we're looking at the average order value to u ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-science-insights-why-the-mean-lies-when-handling-messy-retail-data/</link>
                <guid isPermaLink="false">69fa21e5a386d7f121b5fe8c</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ statistics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rakshath Naik ]]>
                </dc:creator>
                <pubDate>Tue, 05 May 2026 16:59:17 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4441dcfc-d100-4613-9937-9c62449c6780.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.</p>
<p>Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.</p>
<p>Done.</p>
<p>Except something looks odd.</p>
<p>When we take a closer look, we see that most customers are buying items worth \(8 - \)15. So where's $20 coming from?</p>
<p>In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.</p>
<p>Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.</p>
<p>In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?</p>
<h2 id="heading-table-of-contents">Table Of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-dataset">The Dataset</a></p>
</li>
<li><p><a href="#heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</a></p>
</li>
<li><p><a href="#heading-median-the-robust-middle">Median: The Robust Middle</a></p>
</li>
<li><p><a href="#heading-beyond-averages-understanding-spread-with-quartiles">Beyond Averages: Understanding Spread with Quartiles</a></p>
</li>
<li><p><a href="#heading-applying-iqr-to-our-dataset">Applying IQR to Our Dataset</a></p>
</li>
<li><p><a href="#heading-final-comparison-and-insights">Final Comparison and Insights</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-connect-with-me">Connect with me</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along here, you'll need:</p>
<p><strong>Basic Python knowledge:</strong> Understanding of variables and functions.</p>
<p><strong>The Pandas library:</strong> Familiarity with loading data and basic DataFrame operations.</p>
<p><strong>A development environment:</strong> Access to a tool like Jupyter Notebook, VS Code, or Google Colab.</p>
<p><strong>A Dataset:</strong> For this analysis, I used the Online Retail Dataset, which is available for download <a href="https://archive.ics.uci.edu/dataset/352/online+retail">here</a>.</p>
<h2 id="heading-the-dataset"><strong>The Dataset</strong></h2>
<p>We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.</p>
<ol>
<li><p><strong>Source:</strong> UCI Machine Learning Repository</p>
</li>
<li><p><strong>Collected by:</strong> UK-based online retail company (2010–2011)</p>
</li>
<li><p><strong>Size:</strong> 541,909 transactions</p>
</li>
<li><p><strong>Features:</strong> 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)</p>
</li>
<li><p><strong>Ownership:</strong> Public dataset hosted by UCI</p>
</li>
<li><p><strong>License:</strong> Open for research and educational use</p>
</li>
</ol>
<h2 id="heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</h2>
<p>In statistics and data analysis, the terms "<strong>average</strong>" and "<strong>arithmetic mean</strong>" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:</p>
<p>$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$</p>
<p>In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.</p>
<pre><code class="language-python"># Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Average Order Value (Mean): 20.40
</code></pre>
<p>At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.</p>
<p>Take a look at the graph for the mean below.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/583bebff-0e5e-44b8-80cb-48e4662b9abf.png" alt="The graph shows the calculated mean for the Online Retail Dataset, where we get a mean of 20.40" style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)</p>
<p>The graph shows <strong>a right-skewed distribution</strong> where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of \(8 - \)15 range, but the <strong>red line</strong> is being dragged to the right by the <strong>long tail</strong> of high-value bulk orders by some customers.</p>
<p>In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.</p>
<p>In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.</p>
<h2 id="heading-median-the-robust-middle">Median: The Robust Middle</h2>
<p>When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.</p>
<p>Median is defined as the <strong>middle value after sorting the data.</strong></p>
<p>In our dataset, we sort all the transactions and pick the middle one.</p>
<p>The formula for calculating the median is:</p>
<p>$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} &amp; \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} &amp; \text{if } n \text{ is even} \end{cases}$$</p>
<p>Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Typical Order Value (Median): 11.10
</code></pre>
<p>Now you'll notice that the result lies in the \(8 — \)15 range, where most of the transactions lie.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/d89a4912-0e44-485e-8ea0-ff559cea6eba.png" alt="The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers." style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)</p>
<p>In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.</p>
<p>In the above figure <strong>the median graph</strong> accurately highlights the range where most of the customers lie.</p>
<h2 id="heading-beyond-averages-understanding-spread-with-quartiles"><strong>Beyond Averages: Understanding Spread with Quartiles</strong></h2>
<p>So far, we've studied the median, but knowing the center is not enough.</p>
<p>To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.</p>
<p>Quartiles divide the dataset into the following parts:</p>
<ol>
<li><p><strong>Q1(25th percentile):</strong> 25% of transactions are below this.</p>
</li>
<li><p><strong>Q2 (50th percentile):</strong> Median</p>
</li>
<li><p><strong>Q3 (75th percentile):</strong> 75% of transactions are below this.</p>
</li>
</ol>
<p>This is formally expressed as the Interquartile Range (IQR):</p>
<p>$$IQR = Q_3 - Q_1$$</p>
<h3 id="heading-the-iqr-detecting-outliers"><strong>The IQR: Detecting Outliers</strong></h3>
<p>The IQR measures the spread of the middle 50%.</p>
<p>If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.</p>
<p>Outlier Rule:</p>
<ol>
<li><p><strong>Lower Bound = Q1 — 1.5 * IQR</strong></p>
</li>
<li><p><strong>Upper Bound = Q3 + 1.5 * IQR</strong></p>
</li>
</ol>
<h4 id="heading-a-simple-example-to-understand-iqr">A Simple Example to Understand IQR</h4>
<p>Consider the following transaction values:</p>
<p>$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$</p>
<h4 id="heading-step-1-find-the-median-q2">Step 1: Find the Median (Q2):</h4>
<p>The middle value is:</p>
<p>$$Q_2 = 12$$</p>
<h4 id="heading-step-2-find-q1-lower-quartile">Step 2: Find Q1 (Lower Quartile):</h4>
<p>The lower half is [5, 8, 10]. The median of the lower half is:</p>
<p>$$Q_1 = 8$$</p>
<h4 id="heading-step-3-find-q3-upper-quartile">Step 3: Find Q3 (Upper Quartile):</h4>
<p>The upper half is [15, 18, 20]. The median of the upper half is:</p>
<p>$$Q_3 = 18$$</p>
<h4 id="heading-step-4-calculate-iqr">Step 4: Calculate IQR:</h4>
<p>$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$</p>
<h4 id="heading-step-5-find-outlier-bounds">Step 5: Find Outlier Bounds:</h4>
<p>$$\begin{aligned} \text{Lower Bound} &amp;= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &amp;= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$</p>
<p>Any value <strong>below -7 or above 33</strong> is an outlier (but in this demo problem, no outliers exist).</p>
<h2 id="heading-applying-iqr-to-our-dataset"><strong>Applying IQR to Our Dataset</strong></h2>
<p>In our retail dataset, instead of neat values, we have bulk values and even negative returns.</p>
<pre><code class="language-python"># 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
</code></pre>
<p>When we calculate IQR for our dataset, we get:</p>
<pre><code class="language-python">Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/e528db9b-57f9-4ee4-b331-143c2b1947fb.png" alt="The figure demonstrates the outlier range for our dataset" style="display:block;margin:0 auto" width="1036" height="547" loading="lazy">

<p>The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)</p>
<p>As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.</p>
<h3 id="heading-revisiting-the-mean-after-removing-outliers">Revisiting the Mean After Removing Outliers</h3>
<p>Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] &gt;= lower_bound) &amp; (df['TotalPrice'] &lt;= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")
</code></pre>
<p>After recomputing, we get:</p>
<pre><code class="language-python">Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/17e6c2d0-883f-4e48-b45b-d1bf93164c63.png" alt="The graph demonstrates that the mean improves significantly after all outliers are removed. (Image by Author)" style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.</p>
<h2 id="heading-final-comparison-and-insights"><strong>Final Comparison and Insights</strong></h2>
<p>Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.</p>
<p>The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.</p>
<p>After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.</p>
<p>This highlights a key lesson: <strong>The mean isn't wrong, but it must be used with an understanding of the data.</strong></p>
<p>Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.</p>
<h2 id="heading-connect-with-me"><strong>Connect with me</strong></h2>
<ol>
<li><p><a href="https://medium.com/@rakshathnaik62">Medium</a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/rakshath-/">LinkedIN</a></p>
</li>
</ol>
<p>If you want to dive deeper, you can visit: <a href="https://qubrica.com/mean-median-mode-python-guide/"><strong>Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis</strong></a><strong>.</strong></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python ]]>
                </title>
                <description>
                    <![CDATA[ Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a random sample. Your pr ]]>
                </description>
                <link>https://www.freecodecamp.org/news/product-experimentation-with-propensity-scores-causal-inference-for-llm-based-features-in-python/</link>
                <guid isPermaLink="false">69f3df46909e64ad07425413</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ propensity-score-matching ]]>
                    </category>
                
                    <category>
                        <![CDATA[ experimentation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 23:01:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6a8936be-7f43-4977-9baf-6021dc892b2d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a random sample.</p>
<p>Your product shipped a new agent mode last quarter. Users have to tap the "Try agent mode" toggle to enable it. The dashboard numbers look stunning: agent-mode users complete 21 percentage points more tasks than non-users. The CPO calls it the best feature launch of the year.</p>
<p>But you know something's off. Heavy-engagement users opt into new features constantly, while light users ignore toggles entirely. That 21-point gap measures the agent's effect combined with the pre-existing gap between power users and the rest of your base.</p>
<p>This is the Opt-In Trap. It shows up in every generative AI product that ships features behind a user-controlled toggle: "Try our AI assistant," "Enable smart replies," "Turn on code suggestions." Users who click to opt in differ systematically from those who scroll past. Any naïve comparison between the two groups collapses the feature's causal effect into whatever made those users opt in in the first place.</p>
<p>Running an AI feature behind a toggle is a product experiment. The hypothesis: the feature improves outcomes for users who adopt it.</p>
<p>Unlike an A/B test, where the coin flip creates two otherwise-identical populations, the toggle creates two populations that differ before they even make a choice. That pre-existing difference is the measurement problem, and a t-test on dashboard numbers can't fix it.</p>
<p>Propensity score methods are statistical tools that data scientists use to separate adoption bias from the feature's actual effect. They reweight (or rematch) your comparison so that opted-in and non-opted-in groups look comparable on observable characteristics, approximating what a randomized experiment would have given you.</p>
<p>This tutorial walks through the full pipeline (propensity estimation, inverse-probability weighting, nearest-neighbor matching, balance diagnostics, and bootstrap confidence intervals) on a 50,000-user synthetic SaaS dataset where the ground-truth causal effect is known. You'll estimate it, quantify uncertainty, and see where the approach silently breaks.</p>
<p><strong>Companion code:</strong> every code block runs end-to-end in the companion notebook at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in</a>. The notebook (<code>psm_demo.ipynb</code>) has all outputs pre-executed, so you can read along on GitHub before running anything locally.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-opt-in-features-break-naive-comparisons">Why Opt-in Features Break Naïve Comparisons</a></p>
</li>
<li><p><a href="#heading-what-propensity-scores-actually-do">What Propensity Scores Actually Do</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-estimate-the-propensity-score">Step 1: Estimate the Propensity Score</a></p>
</li>
<li><p><a href="#heading-step-2-inverse-probability-weighting">Step 2: Inverse-Probability Weighting</a></p>
</li>
<li><p><a href="#heading-step-3-nearest-neighbor-matching">Step 3: Nearest-Neighbor Matching</a></p>
</li>
<li><p><a href="#heading-step-4-check-covariate-balance">Step 4: Check Covariate Balance</a></p>
</li>
<li><p><a href="#heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</a></p>
</li>
<li><p><a href="#heading-when-propensity-score-methods-fail">When Propensity Score Methods Fail</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-opt-in-features-break-naive-comparisons">Why Opt-in Features Break Naïve Comparisons</h2>
<p>The math of an A/B test is elegant because of one assumption: treatment is assigned independent of everything else. Flip a coin: half your users get agent mode, and the coin flip breaks every possible confound by construction. The opt-in world has no coin.</p>
<p>Three mechanisms make opt-in comparisons misleading.</p>
<h4 id="heading-1-selection-on-engagement">1. Selection on engagement</h4>
<p>Power users click everything. If your heavy-engagement cohort opts into agent mode at 65 percent and your light-engagement cohort opts in at 12 percent, you've stacked the opt-in group with users who were going to complete more tasks anyway.</p>
<p>That compositional imbalance accounts for most of the observed lift on its own, before the agent does any work.</p>
<h4 id="heading-2-selection-on-intent">2. Selection on intent</h4>
<p>Users who opt into a new feature often have a specific use case in mind. A developer who clicks "Try code suggestions" already has code to write. That user would have shown higher task completion even with the control UI.</p>
<h4 id="heading-3-selection-on-risk-tolerance">3. Selection on risk tolerance</h4>
<p>Early adopters tolerate rough edges. A user who clicks "Try beta" and sees slow latency sticks around, but a risk-averse user bounces.</p>
<p>Your opt-in group is enriched for people willing to put up with bad experiences, which affects every downstream metric you might measure.</p>
<p>All three produce the same symptom: a raw comparison of opted-in users against everyone else that can overstate the feature's causal effect by 2x or more, depending on how concentrated opt-in is among your heaviest users.</p>
<p>On the synthetic dataset in this tutorial, the naïve comparison inflates a true +8pp effect to +21pp, a 2.6x overshoot. Propensity score methods exist to correct this.</p>
<h2 id="heading-what-propensity-scores-actually-do">What Propensity Scores Actually Do</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/df8f4e49-98f3-4cd2-b4a8-f9b49d18f60a.png" alt="Schematic propensity score distributions for two hypothetical groups" style="display:block;margin:0 auto" width="1469" height="822" loading="lazy">

<p><em>Figure 1: Schematic propensity score distributions for two hypothetical groups. The opted-in group (red) skews toward higher propensities, while the non-opted-in group (blue) skews lower.</em></p>
<p>In the above figure, the bracketed strip below the x-axis splits the score range into three zones: a control-heavy region at low propensities where few treated users exist, a region of common support in the middle where both groups are well represented, and a treatment-heavy region at high propensities where few controls exist. Propensity score methods operate within the common-support region by reweighting or rematching so that the two groups appear balanced on observables. The extremes are either trimmed out or handled with caution.</p>
<p>The propensity score is the probability that a user opts in given their observable characteristics. Estimate this probability well, and you can use it to reweight your sample so that opted-in and non-opted-in users look similar on observables, just as they would have if opt-in had been randomized.</p>
<p>Two practical strategies use the propensity score:</p>
<ul>
<li><p><strong>Inverse-probability weighting (IPW)</strong> assigns each user a weight equal to the inverse of their probability of receiving the treatment they actually received. Opted-in users get weighted by 1/P(opt-in). Non-opted-in users get weighted by 1/P(no opt-in). After weighting, the two groups are balanced on observables, and the weighted difference in outcomes approximates the average treatment effect.</p>
</li>
<li><p><strong>Matching</strong> pairs each opted-in user with one or more non-opted-in users who have similar propensity scores. The average outcome difference between matched pairs estimates the average treatment effect on the treated (ATT): what opt-in users actually gained by opting in.</p>
</li>
</ul>
<p>Both methods rest on three identification assumptions working together.</p>
<ol>
<li><p>First, <strong>unconfoundedness</strong>: every observable variable that drives opt-in and affects the outcome is in your propensity model.</p>
</li>
<li><p>Second, <strong>overlap</strong> (also called positivity): every user has some nonzero probability of opting in and some nonzero probability of staying out.</p>
</li>
<li><p>Third, <strong>no interference</strong>: one user's opt-in decision does not affect another user's outcome (the stable-unit-treatment-value assumption, or SUTVA.</p>
</li>
</ol>
<p>Violate any one of these and the estimate is biased even when the other two hold. The failure modes at the end of this tutorial walk through each one.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You'll need Python 3.11 or newer, comfort with pandas and scikit-learn, and rough familiarity with logistic regression.</p>
<p>Install the packages for this tutorial:</p>
<pre><code class="language-shell">pip install numpy pandas scikit-learn matplotlib
</code></pre>
<p><strong>Here's what's happening:</strong> four packages cover the full pipeline. Pandas loads the data, NumPy handles weights and array arithmetic, scikit-learn fits the propensity model and runs nearest-neighbor matching, and matplotlib renders the overlap diagnostic.</p>
<p>Clone the companion repo to get the synthetic dataset:</p>
<pre><code class="language-shell">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<p><strong>Here's what's happening:</strong> the clone pulls the companion repo, and <code>generate_data.py</code> produces the shared synthetic dataset used across the series. Seed 42 keeps the dataset reproducible, and 50,000 users give clean signal for every estimator in this tutorial. The output CSV lands at <code>data/synthetic_llm_logs.csv</code>.</p>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The synthetic dataset simulates a SaaS product where users can opt into an agent mode that uses a more expensive model. With fifty thousand users, opt-in rates differ sharply by engagement tier: heavy users opt in at 65 percent, medium users at 35 percent, and light users at 12 percent.</p>
<p>The ground-truth causal effect baked into the data generator is +8 percentage points on task completion for users who opted in. The naive comparison inflates this to around +21 percentage points because selection bias stacks the opted-in group with your most engaged users.</p>
<p>Knowing the ground truth is what lets you verify that your propensity score method recovers it.</p>
<p>Load the data and see the selection problem:</p>
<pre><code class="language-python">import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")

print(df.groupby("engagement_tier").opt_in_agent_mode.mean().round(3))

naive_effect = (
    df[df.opt_in_agent_mode == 1].task_completed.mean()
    - df[df.opt_in_agent_mode == 0].task_completed.mean()
)
print(f"\nNaive opt-in effect: {naive_effect:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">engagement_tier
heavy     0.647
light     0.120
medium    0.353
Name: opt_in_agent_mode, dtype: float64

Naive opt-in effect: +0.2106
</code></pre>
<p><strong>Here's what's happening:</strong> you load 50,000 rows, group by engagement tier, and print the opt-in rate inside each group. Heavy users opt in far more than light users, which is the selection-on-engagement pattern baked into the data. The naïve effect lands at +0.2106 (21 percentage points), nearly three times the ground truth of +0.08. That gap is exactly what propensity score methods have to remove.</p>
<h2 id="heading-step-1-estimate-the-propensity-score">Step 1: Estimate the Propensity Score</h2>
<p>The propensity score is the output of a model that predicts opt-in from observable characteristics. Logistic regression is the right starting point because it's interpretable and fast, but watch the balance diagnostics in Step 4: if any weighted SMD stays above 0.1, the logistic model is missing an interaction, and gradient boosting is the next move.</p>
<p>For this dataset, the relevant observables are engagement tier and query confidence. In a real product, you'd include every variable you think drives opt-in: device type, tenure, plan tier, and historical usage patterns.</p>
<pre><code class="language-python">from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X = pd.get_dummies(
    df[["engagement_tier", "query_confidence"]],
    drop_first=True
).astype(float)
y_treat = df.opt_in_agent_mode

ps_model = LogisticRegression(max_iter=1000).fit(X, y_treat)
df["propensity"] = ps_model.predict_proba(X)[:, 1]

# Basic sanity checks
print(df.groupby("engagement_tier").propensity.mean().round(3))
print(
    f"\nPropensity range (treated):  "
    f"{df[df.opt_in_agent_mode == 1].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 1].propensity.max():.3f}"
)
print(
    f"Propensity range (control):  "
    f"{df[df.opt_in_agent_mode == 0].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 0].propensity.max():.3f}"
)
print(f"Propensity model AUC: {roc_auc_score(y_treat, df.propensity):.3f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">engagement_tier
heavy     0.646
light     0.120
medium    0.353
Name: propensity, dtype: float64

Propensity range (treated):  0.114 - 0.675
Propensity range (control):  0.114 - 0.673
Propensity model AUC: 0.744
</code></pre>
<p><strong>Here's what's happening:</strong> you encode the engagement tier as dummy variables, keep query confidence continuous, and fit a logistic regression model. The predicted probability from the model is each user's propensity score.</p>
<p>Scikit-learn <code>LogisticRegression</code> applies L2 regularization by default (<code>C=1.0</code>), which shrinks propensities slightly toward 0.5. For production use, you can set <code>penalty=None</code> if you want an unregularized fit.</p>
<p>Mean propensity inside each engagement tier recovers the true opt-in rate for that tier almost exactly, so the model is calibrated. The AUC of 0.744 confirms the model discriminates between opt-ins and non-opt-ins well above chance (0.5).</p>
<p>And the propensity ranges overlap between treated and control groups (both span roughly 0.11 to 0.67), which is the visual overlap condition.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/0ad957a6-1d24-4332-b033-aae6e91c4162.png" alt="wo views of the same positivity check on the real 50,000-user synthetic dataset." style="display:block;margin:0 auto" width="1283" height="942" loading="lazy">

<p><em>Figure 2: Two views of the same positivity check on the real 50,000-user synthetic dataset.</em></p>
<p>In the figure above, the top panel plots smooth kernel density curves of the fitted propensity scores for each group. The three peaks align with the three engagement tiers (light at p ≈ 0.12, medium at p ≈ 0.35, heavy at p ≈ 0.65), as expected, because the opt-in rate is tier-driven. The bottom panel translates that same distribution into raw counts per tier: every tier contains thousands of both opted-in and non-opted-in users, which is exactly what positivity requires.</p>
<p>Where Figure 1 schematically illustrated the idea, this figure shows that it holds for the data, so the weighting and matching that follow will have real counterfactuals to work with.</p>
<h2 id="heading-step-2-inverse-probability-weighting">Step 2: Inverse-Probability Weighting</h2>
<p>IPW assigns each user a weight inversely proportional to their propensity. An opted-in user with a 0.12 propensity is rare (a light user who still opted in despite low engagement) and carries information about 1 / 0.12 ≈ 8 similar users in the population. A control user with a 0.12 propensity is the expected case for light users who stayed out, so they're common and get a weight of 1 / (1 - 0.12) ≈ 1.14.</p>
<pre><code class="language-python">import numpy as np

# ATE weights: 1/P(treat) for treated, 1/P(no treat) for control
df["ipw"] = np.where(
    df.opt_in_agent_mode == 1,
    1 / df.propensity,
    1 / (1 - df.propensity)
)

t = df[df.opt_in_agent_mode == 1]
c = df[df.opt_in_agent_mode == 0]
ate_ipw = (
    (t.task_completed * t.ipw).sum() / t.ipw.sum()
    - (c.task_completed * c.ipw).sum() / c.ipw.sum()
)
print(f"IPW average treatment effect (ATE): {ate_ipw:+.4f}")

# ATT: what opt-in users actually gained
df["ipw_att"] = np.where(
    df.opt_in_agent_mode == 1,
    1,
    df.propensity / (1 - df.propensity)
)
t = df[df.opt_in_agent_mode == 1]   # re-slice now that ipw_att is in df
c = df[df.opt_in_agent_mode == 0]
treated_mean = t.task_completed.mean()
control_w_mean = (c.task_completed * c.ipw_att).sum() / c.ipw_att.sum()
att_ipw = treated_mean - control_w_mean
print(f"IPW average treatment effect on treated (ATT): {att_ipw:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">IPW average treatment effect (ATE): +0.0851
IPW average treatment effect on treated (ATT): +0.0770
</code></pre>
<p><strong>Here's what's happening:</strong> first, you compute ATE weights for every user and take the weighted difference in task completion between opted-in and non-opted-in groups. Then you compute ATT weights, which reweight only the control group to match the treated group's covariate distribution, and compute the average treatment effect on the treated.</p>
<p>ATE answers the population question: what's the effect on a random user who might or might not have opted in anyway? ATT answers the user question: What did opt-in users actually gain? On this dataset, ATE lands at +0.0851 and ATT at +0.0770, both close to the ground-truth +0.08 and a massive improvement over the naive +0.2106.</p>
<p>The distinction matters in practice. Deciding whether to roll the feature out to users who haven't opted in calls for ATE. Reporting on the value opt-in users captured calls for ATT.</p>
<h2 id="heading-step-3-nearest-neighbor-matching">Step 3: Nearest-Neighbor Matching</h2>
<p>Matching takes a different approach: pair each opted-in user with the non-opted-in user whose propensity score is closest, then take the average outcome difference across matched pairs. The result estimates ATT.</p>
<pre><code class="language-python">from sklearn.neighbors import NearestNeighbors

treated_ps = df[df.opt_in_agent_mode == 1][["propensity"]].values
control_ps = df[df.opt_in_agent_mode == 0][["propensity"]].values

nn = NearestNeighbors(n_neighbors=1).fit(control_ps)
_, idx = nn.kneighbors(treated_ps)

treated_outcomes = df[df.opt_in_agent_mode == 1].task_completed.values
matched_control_outcomes = (
    df[df.opt_in_agent_mode == 0].task_completed.values[idx.flatten()]
)

att_match = (treated_outcomes - matched_control_outcomes).mean()
print(f"1-NN matching ATT: {att_match:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">1-NN matching ATT: +0.0752
</code></pre>
<p><strong>Here's what's happening:</strong> you extract propensity scores for each group, fit a nearest-neighbor index on the control group, and find the single closest control user for every treated user.</p>
<p>The <code>NearestNeighbors</code> index allows the same control user to be selected as the match for multiple treated users, so this is a matching-with-replacement case.</p>
<p>You pull the outcomes for each treated user and their matched control, take the difference per pair, and average across pairs. The result estimates what opt-in users gained compared to very similar users who did not opt in.</p>
<p>The +0.0752 result lands close to the ground truth of +0.08 but slightly below IPW ATT, typical of 1-NN matching because a single nearest neighbor is a high-variance estimator.</p>
<p>Two variants are worth knowing. Matching with replacement (what you just ran) allows a single control user to serve as a match for multiple treated users, reducing bias when good matches are scarce but inflating variance.</p>
<p>Matching without replacement assigns each control user to at most one treated user, which keeps variance lower but forces poor-quality pairings when the treated group dwarfs the available controls.</p>
<p>For most production analyses, k-nearest-neighbor matching with k = 3-5 and replacement is a sensible default.</p>
<h2 id="heading-step-4-check-covariate-balance">Step 4: Check Covariate Balance</h2>
<p>Propensity score methods work only if they actually balance the covariates between groups. You need to verify that they did, because if the balance fails, your estimate is wrong.</p>
<p>The standard diagnostic is the standardized mean difference (SMD) for each covariate. SMD compares the treated group mean to the control group mean, divided by the pooled standard deviation.</p>
<p>Before weighting, SMDs tell you how imbalanced the raw groups are. After weighting, they should be small (|SMD| &lt; 0.1 is the conventional cutoff).</p>
<pre><code class="language-python">def smd(treated_vals, control_vals, treated_w=None, control_w=None):
    """Standardized mean difference, optionally with weights."""
    if treated_w is None:
        treated_w = np.ones(len(treated_vals))
    if control_w is None:
        control_w = np.ones(len(control_vals))
    t_mean = np.average(treated_vals, weights=treated_w)
    c_mean = np.average(control_vals, weights=control_w)
    pooled_std = np.sqrt((treated_vals.var() + control_vals.var()) / 2)
    return (t_mean - c_mean) / pooled_std

engagement_heavy = (df.engagement_tier == "heavy").astype(float).values
qc = df.query_confidence.values
tr = (df.opt_in_agent_mode == 1).values

covariates = {
    "engagement_tier_heavy": engagement_heavy,
    "query_confidence": qc,
}

print(f"{'Covariate':&lt;30} {'Raw SMD':&gt;10} {'Weighted SMD':&gt;15}")
for name, vals in covariates.items():
    smd_raw = smd(vals[tr], vals[~tr])
    smd_weighted = smd(
        vals[tr], vals[~tr],
        treated_w=df[tr].ipw.values,
        control_w=df[~tr].ipw.values,
    )
    print(f"{name:&lt;30} {smd_raw:&gt;+10.3f} {smd_weighted:&gt;+15.3f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Covariate                         Raw SMD    Weighted SMD
engagement_tier_heavy              +0.742          +0.002
query_confidence                   -0.032          -0.003
</code></pre>
<p><strong>Here's what's happening:</strong> the helper computes the standardized mean difference for any covariate, with optional IPW weights.</p>
<p>You then print raw and weighted SMDs for each covariate. The raw SMD on <code>engagement_tier_heavy</code> is +0.742 (heavy users opt in far more than everyone else), and the weighted SMD drops to +0.002, a clean pass. Query confidence was already close to balanced on the raw data, and weighting keeps it that way. If any weighted SMD came back above 0.1 in absolute value, your propensity model would be missing something; the fix is usually richer features or interaction terms in the logistic regression.</p>
<p>Visually, Figure 2 above confirmed what the SMDs now confirm numerically: the overlap condition holds, and balance is achievable.</p>
<h2 id="heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</h2>
<p>Point estimates are only half the story. Any estimate you report to a product team needs an interval that tells them whether +0.08 is distinguishable from +0.03 or from +0.12. Analytic standard errors for IPW and matching are tricky because of the estimated propensity score, so the simplest and most honest move is the non-parametric bootstrap.</p>
<pre><code class="language-python">def estimate_all(sample):
    """Return (ATE_IPW, ATT_IPW, ATT_match) on a bootstrap sample."""
    s = sample.copy()
    X_s = pd.get_dummies(
        s[["engagement_tier", "query_confidence"]], drop_first=True
    ).astype(float)
    ps = LogisticRegression(max_iter=1000).fit(X_s, s.opt_in_agent_mode)
    s["p"] = ps.predict_proba(X_s)[:, 1]

    s["w_ate"] = np.where(
        s.opt_in_agent_mode == 1, 1 / s.p, 1 / (1 - s.p)
    )
    s["w_att"] = np.where(
        s.opt_in_agent_mode == 1, 1, s.p / (1 - s.p)
    )
    t, c = s[s.opt_in_agent_mode == 1], s[s.opt_in_agent_mode == 0]

    ate = (
        (t.task_completed * t.w_ate).sum() / t.w_ate.sum()
        - (c.task_completed * c.w_ate).sum() / c.w_ate.sum()
    )
    att = t.task_completed.mean() - (
        (c.task_completed * c.w_att).sum() / c.w_att.sum()
    )
    nn_b = NearestNeighbors(n_neighbors=1).fit(c[["p"]].values)
    _, idx_b = nn_b.kneighbors(t[["p"]].values)
    match = (
        t.task_completed.values
        - c.task_completed.values[idx_b.flatten()]
    ).mean()
    return ate, att, match

rng = np.random.default_rng(7)
n_reps = 500
results = np.zeros((n_reps, 3))
for i in range(n_reps):
    boot = df.iloc[rng.integers(0, len(df), size=len(df))]
    results[i] = estimate_all(boot)

for name, col in zip(["IPW ATE", "IPW ATT", "1-NN ATT"], range(3)):
    lo, hi = np.percentile(results[:, col], [2.5, 97.5])
    print(f"{name:&lt;10} 95% CI: [{lo:+.4f}, {hi:+.4f}]")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">IPW ATE    95% CI: [+0.0745, +0.0954]
IPW ATT    95% CI: [+0.0687, +0.0865]
1-NN ATT   95% CI: [+0.0659, +0.0940]
</code></pre>
<p><strong>Here's what's happening:</strong> you resample the dataset with replacement 500 times, refit the propensity model, and recompute each estimator on each resample, and take the 2.5th and 97.5th percentiles of the bootstrap distribution as the 95% confidence interval. All three intervals cover the ground-truth +0.08 and exclude the naive +0.21 by a wide margin.</p>
<p>The IPW ATT interval is the tightest because ATT reweights only the control group. The 1-NN matching interval is the widest because single-neighbor matching discards control users outside the matched set.</p>
<p>Running this once takes about 90 seconds on a laptop. For a stakeholder report, anchor the headline to the point estimate and cite the interval so the team sees the uncertainty alongside the number.</p>
<h2 id="heading-when-propensity-score-methods-fail">When Propensity Score Methods Fail</h2>
<p>Propensity scores make opt-in comparisons rigorous when their assumptions hold. They produce biased estimates that look clean when those assumptions fail.</p>
<p>Four common failure modes map to the three identification assumptions from earlier.</p>
<h3 id="heading-1-unmeasured-confounders-violate-unconfoundedness">1. Unmeasured Confounders (Violate Unconfoundedness)</h3>
<p>If something drives both opt-in and your outcome but isn't in your propensity model, IPW and matching produce biased estimates. This is the most common failure in practice.</p>
<p>An example: users who opt into agent mode are also the users who follow your engineering blog and read release notes. If blog-reading behavior raises task completion independently of the feature, missing that signal attributes the effect to agent mode, inflating your estimate.</p>
<p>The only real defense is domain knowledge about what drives opt-in, richer feature engineering in your propensity model, and formal sensitivity tools (Rosenbaum bounds, E-values) that quantify how strong an unmeasured confounder would have to be to overturn the result.</p>
<h3 id="heading-2-positivity-overlap-failures-violates-overlap">2. Positivity (Overlap) Failures (Violates Overlap)</h3>
<p>If some users have near-zero probability of opting in (or near-one), you've got no comparable counterfactual for them. I</p>
<p>PW creates extreme weights (1 / 0.001 = 1,000) that let a single outlier dominate the estimate. So matching is forced into poor-quality pairings.</p>
<p>Check propensity histograms and trim propensities outside [0.05, 0.95] before weighting if extreme values exist.</p>
<h3 id="heading-3-misspecified-propensity-models-degrade-unconfoundedness-in-practice">3. Misspecified Propensity Models (Degrade Unconfoundedness in Practice)</h3>
<p>A linear logistic regression can't capture nonlinear relationships. If opt-in depends on the interaction between engagement tier and query confidence (power users with complex queries opt in, while light users pass), a main-effects model misses that and produces poor balance.</p>
<p>Use flexible models (for example, gradient boosting on the propensity score or regression adjustment on top of weighting) and always check the balance after weighting. Poor balance after weighting is the primary signal of misspecification.</p>
<h3 id="heading-4-spillovers-between-users-violates-sutva">4. Spillovers Between Users (Violates SUTVA)</h3>
<p>Propensity score methods assume your users are independent. If one user opting into agent mode affects another user's task completion (for example, teammates adopting the feature together in shared workspaces), your estimated effect includes the spillover.</p>
<p>This violates the stable-unit-treatment-value-assumption, and handling it cleanly requires a different toolkit: either cluster randomization for features adopted at the workspace level or network-aware experimental designs for user-level spillovers.</p>
<p>These failure modes stay invisible in your regression coefficients. They surface as estimates that look good on paper but don't hold up when the feature rolls out to a broader audience.</p>
<p>Run balance diagnostics, check overlap plots, and document what you might have missed: those are your only real defenses.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>Propensity score methods are the right tool when your feature ships behind an opt-in toggle and you've got rich covariates to model selection with.</p>
<p>If opt-in follows a crisp rule (a threshold on query complexity, a paid-tier gate), regression discontinuity fits better. If you suspect unobserved confounders and have an external randomization source (randomized rollout noise, rate-limit-triggered routing), instrumental variables will do better.</p>
<p>To guard your estimate against propensity misspecification, doubly robust estimators combine propensity weighting with regression adjustment and stay consistent if at least one of the two component models is correctly specified.</p>
<p>The companion notebook for this tutorial <a href="http://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in">lives here</a>. Clone the repo, generate the synthetic dataset, and run <code>psm_demo.ipynb</code> (or <code>psm_demo.py</code>) to reproduce every code block, every number, and every figure from this tutorial.</p>
<p>When an AI feature ships behind a toggle, the naïve opt-in comparison is usually the wrong number. Propensity score methods give you "users comparable to those who clicked this" as your counterfactual, and the bootstrap gives you an interval you can defend when a stakeholder asks how sure you are.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway ]]>
                </title>
                <description>
                    <![CDATA[ In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distingu ]]>
                </description>
                <link>https://www.freecodecamp.org/news/deploying-serverless-spam-classifier/</link>
                <guid isPermaLink="false">69f2e347b18c978233780179</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Architecture ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rakshath Naik ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 05:06:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/08672d22-a4df-4b99-8ef7-fffd18f5dc07.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distinguish legitimate emails from malicious ones.</p>
<p>While building a machine learning model in a notebook is relatively straightforward, the real challenge lies in the last mile: deploying that model into a scalable, production-ready system that users can actually interact with.</p>
<p>In this project, I built an end-to-end serverless spam classifier, combining Scikit-learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, scalable API that can classify messages in real time.</p>
<p>The system is designed to be modular and cost-efficient, allowing the model to be retrained and updated independently without affecting the live API. From detecting "free iPhone" scams to identifying phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real-world deployment.</p>
<h3 id="heading-table-of-contents">Table of&nbsp;Contents</h3>
<ul>
<li><p><a href="#heading-1-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-2-building-the-brain-the-model">Building the Brain: The Model</a></p>
</li>
<li><p><a href="#heading-3-deploying-the-model-to-aws">Deploying the Model to AWS</a></p>
</li>
<li><p><a href="#heading-4-how-to-run-the-project-locally">How to Run The Project Locally</a></p>
</li>
<li><p><a href="#heading-5-our-project-architecture">Our Project Architecture</a></p>
</li>
<li><p><a href="#heading-6-conclusion-the-power-of-serverless-ai">Conclusion: The Power of Serverless AI</a></p>
</li>
<li><p><a href="#heading-7-acknowledgment-references">Acknowledgment / References</a></p>
</li>
</ul>
<h2 id="heading-1-prerequisites">1. Prerequisites</h2>
<ol>
<li><p><strong>Fundamental skills:</strong> Basic proficiency in Python and understanding of Machine Learning concepts like classification.</p>
</li>
<li><p><strong>AWS account:</strong> Access to an AWS account with permissions for Lambda, S3, and API Gateway.</p>
</li>
<li><p><strong>Environment:</strong> Python 3.11 installed, along with libraries like scikit-learn, pandas, and joblib.</p>
</li>
<li><p><strong>AWS CLI:</strong> Configured on your local machine for file uploads.</p>
</li>
<li><p><strong>HuggingFace account:</strong> You can directly download the model from my account.</p>
</li>
</ol>
<h2 id="heading-2-building-the-brain-the-model">2. Building the Brain: The&nbsp;Model</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/b43af198-1472-4914-9469-6cd5ca5384e2.png" alt="Demonstrational image to show the brain of AI." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p><em>Photo by</em> <a href="https://unsplash.com/@steve_j?utm_source=medium&amp;utm_medium=referral"><em>Steve A Johnson</em></a> <em>on</em> <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral"><em>Unsplash</em></a></p>
<p>At the heart of this project lies a supervised learning approach. Instead of simply specifying which words are considered spam, we'll provide the computer with a dataset and an algorithm, enabling it to learn and identify spam patterns on its own.</p>
<h3 id="heading-1-vectorization-turning-text-into-math">1. Vectorization: Turning Text into&nbsp;Math</h3>
<p>Machine Learning models can't <strong>read</strong> text. They require numerical input. To solve this, we used the <a href="https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/">TF-IDF</a> (Term Frequency-Inverse Document Frequency) Vectorizer.</p>
<pre><code class="language-python">feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train
</code></pre>
<p>Here's the mathematical formula:</p>
<p>$$w_{i,j} = tf_{i,j} \times \log \left( \frac{N}{df_i} \right)$$</p>
<p>TF-IDF term definitions:</p>
<ul>
<li><p><strong>wᵢ,ⱼ (Weight):</strong> The final importance score of a specific word in a document.</p>
</li>
<li><p><strong>tfᵢ,ⱼ (Term Frequency):</strong> How often a word appears in a single email.</p>
</li>
<li><p><strong>N (Total Documents):</strong> The total count of all emails in your dataset.</p>
</li>
<li><p><strong>dfᵢ (Document Frequency):</strong> The number of different emails that contain this specific word.</p>
</li>
<li><p><strong>log(N/dfᵢ) (IDF):</strong> A penalty that lowers the score of common words like <strong>the</strong> or <strong>is</strong> that appear everywhere.</p>
</li>
</ul>
<p>It cleans the data by removing common words, converts all text to lowercase for consistency, and assigns more importance to rare and meaningful words while giving less importance to frequently used words.</p>
<h3 id="heading-2-training-the-logistic-regression-engine">2. Training: The Logistic Regression Engine</h3>
<p>We'll use <strong>Logistic Regression</strong> here, a classification algorithm that predicts the probability of an outcome.</p>
<p>In this stage, we feed our vectorized training data into the Logistic Regression algorithm. The goal is to establish a mathematical relationship between specific word weights and the <strong>Spam</strong> or <strong>Ham</strong> label.</p>
<p>During training, the model iteratively adjusts its internal parameters to minimize error, eventually learning that words like winner or free correlate highly with spam, while conversational language correlates with legitimate messages.</p>
<pre><code class="language-python">model = LogisticRegression()
model.fit(X_train_features, Y_train)
</code></pre>
<p>In our case, it calculates the probability that an email belongs to spam or HAM.</p>
<p>The algorithm uses the Sigmoid function to map any real-valued number into a value between 0 and 1.</p>
<p>$$P(y=1|x) = \frac{1}{1 + e^{-(z)}}$$</p>
<p>where z = β₀ + β₁x₁ +&nbsp;… + βₙxₙ.</p>
<h3 id="heading-3-evaluation-testing-the-intelligence">3. Evaluation: Testing the Intelligence</h3>
<p>After training, we need to verify if the brain actually works on data it hasn't seen before.</p>
<pre><code class="language-python">prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
</code></pre>
<p>By comparing the model’s predictions against the actual labels in our test set, we calculate an Accuracy Score. This gives us the confidence that the model is ready for the real world (achieving ~94% accuracy in our tests).</p>
<h3 id="heading-4-exporting-the-logic-serialization">4. Exporting the Logic (Serialization)</h3>
<p>To move this brain from our local Python environment to the AWS Cloud, we'll use Joblib to save our work into binary files (.pkl).</p>
<pre><code class="language-python">joblib.dump(model, 'spam_model.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')
</code></pre>
<p>We use the Pickle format because it allows us to freeze complex Python objects (mathematical weights and word mappings) into a portable binary format that can be instantly re-animated in the cloud.</p>
<p>We need the Vectorizer to translate new user text into the exact numerical coordinates the Model was trained to understand. Using one without the other is like having a key but no lock.</p>
<p>The trained Logistic Regression model and TF-IDF vectorizer are openly available for the community on Hugging Face here: <a href="https://huggingface.co/rakshath1/mail-spam-detector">Get the model on HuggingFace</a>.</p>
<h2 id="heading-3-deploying-the-model-to-aws">3. Deploying the Model to&nbsp;AWS</h2>
<p>Training a model is science, while deploying it is engineering. To make this classifier accessible to the world, we'll use a serverless stack that scales automatically and incurs nearly no maintenance costs.</p>
<h3 id="heading-1-model-storage-amazon-s3">1. Model Storage: Amazon&nbsp;S3</h3>
<p>First, we'll uploade our&nbsp;.pkl files to an S3 bucket. By decoupling the model from the code, we can update the AI's intelligence (simply by overwriting the file in S3) without redeploying the backend code. It makes the system highly maintainable.</p>
<h3 id="heading-2-the-production-backend-aws-lambda">2. The Production Backend: AWS&nbsp;Lambda</h3>
<p>To make the AI accessible, we'll move from a local script to a Serverless Cloud Architecture. This ensures the model is always available without the cost of a 24/7 server.</p>
<p>The deployment environment is AWS Lambda (Python 3.11). Since Lambda is a lightweight environment, it doesn't include Scikit-Learn or Joblib. To provide these, we'll download and store them in our S3 bucket and import them through the layers.</p>
<p><strong>Commands in AWS CLI:</strong></p>
<pre><code class="language-python">
# 1. Create a workspace
mkdir ml_layer &amp;&amp; cd ml_layer

# 2. Install scikit-learn and its dependencies into a folder
pip install \
    --platform manylinux2014_x86_64 \
    --target=python/lib/python3.11/site-packages \
    --implementation cp \
    --python-version 3.11 \
    --only-binary=:all: \
    scikit-learn joblib

# 3. Zip the folder
zip -r sklearn_lib.zip python

# 4. Upload to S3 (Using AWS CLI)
aws s3 cp sklearn_lib.zip s3://YOUR-BUCKET-NAME/
</code></pre>
<p>We store the Scikit-Learn library as a ZIP in S3 to bypass the AWS Lambda deployment package size limit. This allows the function to dynamically load heavy dependencies only when needed without bloating the core code.</p>
<p><strong>The Lambda Function:</strong></p>
<pre><code class="language-python">
import json
import boto3
import os
import sys
from io import BytesIO

# Ensures the custom Lambda layer(containing sklearn/joblib)
sys.path.append('/opt/python')

try:
    import joblib
except ImportError:
    # Fallback for specific Scikit-Learn distributions
    from sklearn.utils import _joblib as joblib

# Initialize S3 client
s3 = boto3.client('s3')

# Use placeholders for the article so readers can insert their own values
BUCKET_NAME = 'YOUR_S3_BUCKET_NAME' 
MODEL_KEY = 'spam_model.pkl'
VECTORIZER_KEY = 'vectorizer.pkl'

# Global variables for 'Warm Start' caching (improves performance by keeping model in RAM)
model = None
vectorizer = None

def load_model():
    """Downloads model files from S3 only if they aren't already in RAM"""
    global model, vectorizer
    if model is None or vectorizer is None:
        try:
            # 1. Load the Logistic Regression Model from S3
            m_obj = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
            model = joblib.load(BytesIO(m_obj['Body'].read()))
            
            # 2. Load the TF-IDF Vectorizer directly from S3
            v_obj = s3.get_object(Bucket=BUCKET_NAME, Key=VECTORIZER_KEY)
            vectorizer = joblib.load(BytesIO(v_obj['Body'].read()))
        except Exception as e:
            raise Exception(f"Failed to load .pkl files from S3: {str(e)}")

def lambda_handler(event, context):
    try:
        # Ensure model and vectorizer are ready before processing
        load_model()
        
        # Handles both direct Lambda tests and API Gateway POST requests
        body = event.get('body', event)
        if isinstance(body, str):
            body = json.loads(body)
            
        text = body.get('text', '')
            
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided.'})
              }

        # 1. Transform input text to numeric features using the trained Vectorizer
        data_vec = vectorizer.transform([text])
        
        # 2. Predict using the Logistic Regression Model 
        prediction = int(model.predict(data_vec)[0])
        
      # 3. Map numeric result to human-readable label
        result_label = "HAM" if prediction == 1 else "SPAM"
        
        # RESPONSE WITH CORS
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*' # needed for cross-domain web integration
            },
            'body': json.dumps({
                'status': 'success',
                'classification': result_label,
                'input_text': text
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error_message': f"Inference Error: {str(e)}"})
        }
</code></pre>
<p>Key features of the Lambda function:</p>
<ol>
<li><p><strong>Warm start caching:</strong> By defining the model and vectorizer variables outside the lambda_handler, we store them in the container's memory. This significantly reduces cold start latency for subsequent requests.</p>
</li>
<li><p><strong>Dynamic dependency loading:</strong> The <strong>sys.path.append('/opt/python')</strong> line allows us to import heavy libraries from S3/Layers without exceeding the upload limit.</p>
</li>
<li><p><strong>Bimodal input handling:</strong> The function is designed to handle both direct JSON testing from the AWS console and stringified payloads sent via API Gateway.</p>
</li>
</ol>
<h3 id="heading-3-the-api-gateway-the-bridge-to-the-web">3. The API Gateway - The Bridge to the&nbsp;Web</h3>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/8aa3e8d7-569a-4dd5-a6ac-184922474952.png" alt="Demonstrational image to show the API Gateway." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p>Photo by <a href="https://unsplash.com/@growtika?utm_source=medium&amp;utm_medium=referral">Growtika</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></p>
<h4 id="heading-creating-the-rest-api">Creating the REST API</h4>
<p>Next we'll create a REST API with a single POST method. Why POST, you might be wondering? Well, we need to securely send a JSON payload containing the user’s text message to our model.</p>
<ol>
<li><p>First navigate to the Amazon API Gateway console and select Create API -&gt; REST API.</p>
</li>
<li><p>Give your API a name, such as EmailSpamPredictor-API, and set the Endpoint Type to Regional.</p>
</li>
<li><p>Then in the left sidebar, click Resources and enter a resource name (e.g: <strong>/ predict</strong> as entered by me)</p>
</li>
<li><p>Next click the create method and select POST and then select Lambda Function for integration type</p>
</li>
<li><p>Ensure Lambda Proxy integration is enabled (this allows the full request to pass through to your code).</p>
</li>
</ol>
<p><strong>The CORS Configuration (The Troubleshooting Hub)</strong><br>This is where many developers encounter the dreaded <strong>Connection Error</strong>. Since our API is hosted on AWS, and if your front-end is on a separate website, the browser’s Same-Origin Policy will block the request by default.</p>
<p>To fix this, we'll enable <strong>CORS:</strong></p>
<ol>
<li><p><strong>Access-Control-Allow-Origin:</strong> Set to * (or specifically to your domain) to tell the browser that the API is allowed to talk to your front-end.</p>
</li>
<li><p><strong>The OPTIONS method:</strong> API Gateway creates an OPTIONS method automatically. This handles the Preflight request where the browser asks, “Are you allowed to receive data from me?” before sending the actual text.</p>
</li>
<li><p><strong>Access-Control-Allow-Headers:</strong> In the screenshot, you'll notice headers like Content-Type and Authorization are allowed. This ensures that when our JavaScript fetch() call sets the content type to application/json, the API Gateway doesn't reject it.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/cf5c87c6-f374-4dda-8001-77a0aab52672.png" alt="Image illustrates the CORS configuration for our project. " style="display:block;margin:0 auto" width="1487" height="617" loading="lazy">

<p>Image illustrates the CORS configuration for our project. (Image by author)</p>
<h4 id="heading-deployment-stages">Deployment Stages</h4>
<p>Once the API is deployed to a production stage, AWS generates a permanent Invoke URL. This acts as the public gateway to our model and typically follows this structure: <a href="https://%5Bapi-id%5D.execute-api.%5Bregion%5D.amazonaws.com/prod/classify">https://[api-id].execute-api.[region].amazonaws.com/prod/classify</a>.</p>
<h4 id="heading-connecting-the-frontend-the-javascript-layer">Connecting the Frontend (The JavaScript Layer)</h4>
<p>With the API live, we can now write a simple JavaScript function to talk to our model. This script runs whenever a user clicks the <strong>Analyze</strong> button on your site.</p>
<pre><code class="language-python">
async function checkSpam() {
    const message = document.getElementById("userInput").value;
    const apiUrl = "YOUR_API_GATEWAY_INVOKE_URL";

    try {
        const response = await fetch(apiUrl, {
            method: "POST",
            headers: {
                "Content-Type": "application/json"
            },
            body: JSON.stringify({ "text": message })
        });

        const data = await response.json();
        
        // Display result on the webpage
        const resultElement = document.getElementById("result");
        resultElement.innerText = `Prediction: ${data.classification}`;
        resultElement.style.color = data.classification === "SPAM" ? "red" : "green";

    } catch (error) {
        console.error("Error:", error);
        alert("Could not connect to the Spam Detector API.");
    }
}
</code></pre>
<h2 id="heading-4-how-to-run-the-project-locally">4. How to Run The Project&nbsp;Locally</h2>
<p>You can store the front-end as an HTML file. Once it's ready, you shouldn’t just double-click the&nbsp;.html file. Opening it as a <strong>file</strong> in your browser can cause security restrictions. Instead, you should host it using a simple local server.</p>
<p><strong>Step 1:</strong> Open the terminal or Command Prompt.</p>
<p><strong>Step 2:</strong> Navigate to your project folder</p>
<pre><code class="language-shell">cd [PATH_TO_YOUR_FOLDER]
</code></pre>
<p><strong>Step 3:</strong> Start a local Python web server.</p>
<pre><code class="language-shell">python -m http.server 8000
</code></pre>
<p><strong>Step 4:</strong> Access the application.</p>
<p>Open your browser and navigate to:<br><a href="http://localhost:8000/your-file-name.html">http://localhost:8000/your-file-name.html</a></p>
<p><strong>Watch the Demo:</strong></p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/q2X_azntmzY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>

<h2 id="heading-5-our-project-architecture">5. Our Project Architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/c17673d4-5dd0-43dc-8e8d-3015bcd31864.png" alt="Image showing the Architecture Diagram of our Project." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p>The image illustrates the architecture of our project (Building a Serverless Spam Classifier). It shows the process that takes place from the client input to the final model output. (Image by Author)</p>
<ol>
<li><p><strong>Client Front-End Interaction:</strong> The process starts on the far left. A user interacts with the web interface (for example, a website or a desktop app). They input text like <strong>WIN free iPhone now</strong> and trigger a request.</p>
</li>
<li><p><strong>The Entry Point: API Gateway:</strong> The request hits the Amazon API Gateway, which acts as the <strong>security guard</strong> and translator.&nbsp;<br><strong>(a)</strong> CORS OPTIONS handles the pre-flight handshake to ensure the browser has permission to talk to the AWS cloud.&nbsp;<br><strong>(b)</strong> Classification Request (POST) routes the actual message data to your backend logic.</p>
</li>
<li><p><strong>The Engine: AWS Lambda (Python 3.11):</strong>&nbsp;The central “<strong>lightbulb</strong>” represents your Lambda function. This is where the code you wrote lives. It doesn’t run 24/7 – it only wakes up when a request arrives.</p>
</li>
<li><p><strong>Storage &amp; Retrieval: S3 Bucket:</strong> Since Lambda is lightweight, it doesn’t store your heavy Machine Learning files internally.<br><strong>Dependency and Model Download:</strong> The function reaches out to the S3 Bucket to pull in the sklearn_<a href="http://lib.zip">lib.zip</a> (the engine) and the&nbsp;.pkl files (the intelligence).&nbsp;<br><strong>Required Dependency and Model:</strong> These assets are loaded into the Lambda’s temporary memory to prepare for the prediction.</p>
</li>
<li><p><strong>The Inference Pipeline:</strong>&nbsp;Inside the Lambda, a three-step mathematical cycle occurs:<br><strong>(a) Text Vectorizer:</strong> Translates the words into numbers.<br><strong>(b) Logistic Regression:</strong> Calculates the probability of spam based on those numbers.<br><strong>(c) Label:</strong> Assigns a final result (Spam or Ham).</p>
</li>
<li><p><strong>The Result Delivery:</strong> The result is sent back through the API Gateway, including the necessary CORS Headers to ensure the browser accepts it. The front-end then updates to show the “<strong>Result: SPAM</strong>” with a visual indicator.</p>
</li>
</ol>
<h2 id="heading-6-conclusion-the-power-of-serverless-ai">6. Conclusion: The Power of Serverless AI</h2>
<p>By merging the mathematical simplicity of Logistic Regression with the industrial strength of AWS Serverless Architecture, we have transformed a static Python script into a globally accessible, scalable API.</p>
<p>This project demonstrates that you don’t need a massive budget or a 24/7 dedicated server to deploy high-quality Machine Learning.</p>
<p>Using the S3-to-Lambda workaround allowed us to bypass common storage hurdles, ensuring that our Brain (the model) and its Muscle (Scikit-Learn) could function seamlessly within the cloud’s ephemeral environment. It bridges the gap between experimentation and real-world applications, making AI systems practical, efficient, and accessible.</p>
<h2 id="heading-7-acknowledgment-references">7. Acknowledgment / References</h2>
<ul>
<li><p>Pre-trained spam classification model: View on Hugging Face (<a href="https://huggingface.co/rakshath1/mail-spam-detector"><strong>rakshath1/mail-spam-detector · Hugging Face</strong></a><strong>)</strong></p>
</li>
<li><p>Scikit-learn <a href="https://scikit-learn.org/stable/api/index.html?utm_source=chatgpt.com">Documentation</a></p>
</li>
<li><p>AWS Lambda <a href="https://docs.aws.amazon.com/lambda/latest/api/welcome.html?utm_source=chatgpt.com">Documentation</a></p>
</li>
<li><p>Amazon S3 <a href="https://aws.amazon.com/documentation-overview/s3/">Documentation</a></p>
</li>
<li><p>Amazon API Gateway <a href="https://docs.aws.amazon.com/apigateway/">Documentation</a></p>
</li>
</ul>
<h3 id="heading-connect-with-me">Connect With Me</h3>
<ul>
<li><p><a href="https://medium.com/@rakshathnaik62">Medium</a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/rakshath-/">LinkedIN</a></p>
</li>
</ul>
<p><strong>You may also like</strong></p>
<ol>
<li><p><a href="https://qubrica.com/python-polars-v-s-pandas-libraries-comparison/">How Polars overtook Pandas</a></p>
</li>
<li><p><a href="https://qubrica.com/devops-is-dead-platform-engineering-2026/"><strong>DevOps is Dead. Long Live Platform Engineering</strong></a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It ]]>
                </title>
                <description>
                    <![CDATA[ Your team shipped an LLM-based summaries feature to wave 1 workspaces at week 20 and now the post-launch doc is due. You need a causal effect number, a specific estimate you can defend to a statistici ]]>
                </description>
                <link>https://www.freecodecamp.org/news/why-ab-testing-breaks-in-ai-rollouts-and-how-to-fix-it/</link>
                <guid isPermaLink="false">69e94caed5f8830e7dae1569</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 22:33:18 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/ed63a287-c756-4dfd-a270-3c5f5ee0c1d0.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Your team shipped an LLM-based summaries feature to wave 1 workspaces at week 20 and now the post-launch doc is due. You need a causal effect number, a specific estimate you can defend to a statistician.</p>
<p>The problem is that wave 2 workspaces are still waiting, a product-wide onboarding redesign shipped the same Tuesday, and week 20 also coincided with a quarterly engagement bump. Any comparison between the two groups after week 20 mixes the feature's causal effect with the redesign, the seasonality, and whatever selection criteria determined which workspaces landed in wave 1 in the first place.</p>
<p>This is how most enterprise SaaS teams ship AI features in 2026: one workspace at a time, in waves, on a rollout calendar. Randomization doesn't happen, and because randomization doesn't happen, A/B testing can't give you a clean causal effect. The result is a number on a dashboard that everyone argues over.</p>
<p>Call this the <strong>Rollout Calendar Trap</strong>: you have real data, a real experiment structure, and a completely invalid comparison. For data scientists shipping AI features in waves, it's the primary source of bad causal claims downstream.</p>
<p>Product experimentation for generative AI features follows this exact pattern: the hypothesis is that the AI feature causes higher engagement, and the wave structure is supposed to test it.</p>
<p>The wave calendar replaced the coin flip, and that substitution breaks the math. A simple A/B comparison assumes randomized assignment that the rollout never produced, so the measurement tool fails even when the experiment design is sound.</p>
<p>Difference-in-differences is the causal inference method that fixes this. It subtracts the time trend by comparing how outcomes shift across time periods for each group, giving you a defensible causal estimate even without randomization.</p>
<p>In this tutorial you'll use it to measure the true causal effect of an AI feature rolled out across enterprise workspaces, with working Python code against a synthetic SaaS product dataset.</p>
<p>By the end you'll know how to run a DiD estimate, how to test its parallel-trends assumption, and what to do when that assumption fails.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-ab-testing-breaks-for-staged-rollouts">Why A/B Testing Breaks for Staged Rollouts</a></p>
</li>
<li><p><a href="#heading-what-difference-in-differences-does">What Difference-in-Differences Does</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-a-simple-2x2-did">Step 1: A Simple 2x2 DiD</a></p>
</li>
<li><p><a href="#heading-step-2-regression-did-with-fixed-effects">Step 2: Regression DiD with Fixed Effects</a></p>
</li>
<li><p><a href="#heading-step-3-checking-the-parallel-trends-assumption">Step 3: Checking the Parallel-Trends Assumption</a></p>
</li>
<li><p><a href="#heading-when-difference-in-differences-fails">When Difference-in-Differences Fails</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-ab-testing-breaks-for-staged-rollouts">Why A/B Testing Breaks for Staged Rollouts</h2>
<p>Random assignment is the engine that makes A/B testing a valid causal method. When you flip a coin to decide which user gets the feature, the treatment and control groups end up with identical distributions of every <strong>confounder</strong> (any variable that affects both who gets treatment and what outcome you measure). Any difference in outcomes after assignment is the causal effect of the treatment. Full stop.</p>
<p>A staged rollout across enterprise workspaces breaks that engine in three ways:</p>
<h4 id="heading-1-the-wave-assignment-isnt-random">1. The wave assignment isn't random.</h4>
<p>Product teams choose wave 1 workspaces for various reasons: they have the most engaged admins, the largest seat counts, or the best relationship with customer success. Those reasons correlate directly with your outcome. Wave 1 workspaces were going to show higher engagement anyway, feature or no feature.</p>
<h4 id="heading-2-the-calendar-introduces-a-time-trend">2. The calendar introduces a time trend</h4>
<p>Between week 20 (wave 1 launch) and week 30 (wave 2 launch), your product gets better, your onboarding improves, your sales team lands bigger customers. Any naïve "engagement after week 20 minus engagement before week 20" comparison picks up all of that along with the feature's effect.</p>
<h4 id="heading-3-adoption-inside-treated-workspaces-is-itself-selective">3. Adoption inside treated workspaces is itself selective</h4>
<p>Even inside a workspace that received the feature, not every user turns it on. Power users do, and less engaged users often wait months. Comparing users who used the feature against users who didn't introduces <strong>selection bias</strong>, where the groups differ systematically before you even measure the outcome, on top of the non-random workspace assignment.</p>
<p>A/B testing assumes none of these three problems exist. Staged rollouts guarantee all three. The naïve comparison gives you a number, and that number measures engagement theater.</p>
<h2 id="heading-what-difference-in-differences-does">What Difference-in-Differences Does</h2>
<p>Difference-in-differences compares the <em>change</em> in outcomes over time between a treated group and a control group. Subtracting one change from the other cancels any shared time trend (product improvements, seasonality, onboarding changes) because both groups experience it equally, leaving you with just the treatment effect.</p>
<p>Here's a concrete example. Imagine tracking quarterly revenue for coffee shops in two neighborhoods. One neighborhood gets a new competitor in Q3, the other doesn't.</p>
<p>Both neighborhoods experience the same underlying market trends, a local economic upturn, and holiday seasonality. DiD isolates the competitor's impact by subtracting whatever revenue shift happened in both neighborhoods.</p>
<p>Your staged rollout sets up the exact same structure: wave 1 workspaces are the neighborhood with the new entrant, wave 2 is the comparison.</p>
<p>The math formalizes this as a 2x2 table, where rows are groups (treated, control), columns are time periods (pre, post), and each cell holds the mean outcome for that group in that period:</p>
<ul>
<li><p><strong>A</strong> = mean task completion for wave 1 users <em>before</em> week 20 (coffee shops: Q2 revenue, neighborhood with incoming competitor)</p>
</li>
<li><p><strong>B</strong> = mean task completion for wave 1 users <em>after</em> week 20 (coffee shops: Q3 revenue, same neighborhood)</p>
</li>
<li><p><strong>C</strong> = mean task completion for wave 2 users before week 20 (coffee shops: Q2 revenue, the untouched neighborhood)</p>
</li>
<li><p><strong>D</strong> = mean task completion for wave 2 users after week 20 (coffee shops: Q3 revenue, same)</p>
</li>
</ul>
<pre><code class="language-text">                         Pre     Post
Treated (wave 1):         A       B
Control (wave 2):         C       D

Naive post-period gap:   B - D     (contaminated by group differences)
Naive treated change:    B - A     (contaminated by time trend)
DiD:                 (B - A) - (D - C)   ← the causal effect
</code></pre>
<p><code>B - A</code> is wave 1's change, but it includes both the treatment effect and whatever time trend moved everyone. <code>D - C</code> is wave 2's change over the same window, same time trend, no treatment. Subtracting one from the other leaves only the treatment effect.</p>
<p>The <strong>counterfactual</strong> is what wave 1 would have looked like without the treatment. DiD constructs it by saying: wave 1's counterfactual trajectory = wave 1's pre-period level, carried forward with wave 2's post-period trend. The gap between the actual wave 1 trajectory and that counterfactual is the DiD estimate.</p>
<img src="https://raw.githubusercontent.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/main/images/article-1/did_parallel_trends.png" alt="Causal inference with difference-in-differences: parallel trends and treatment effect" style="display:block;margin:0 auto" width="1485" height="807" loading="lazy">

<p><em>Figure 1: Causal inference with difference-in-differences. Blue solid: Wave 1 actual trajectory. Orange dashed: Wave 2 (control, untreated during this window). Blue dotted: the counterfactual, where Wave 1 would have gone based on Wave 2's post-period trend. The green arrow is the DiD estimate: the gap between the actual Wave 1 trajectory and the counterfactual in the post-treatment period. A, B, C, D correspond to the four cells in the table above.</em></p>
<p>Before week 20, wave 1 and wave 2 track each other closely. That's the parallel-trends requirement at work. At week 20, wave 1 pulls ahead of both wave 2 and its own counterfactual (the dotted line). That post-treatment divergence is the DiD estimate.</p>
<p>The DiD estimate handles two types of bias at once. Permanent differences between treated and control groups (wave 1 workspaces were always more engaged) cancel out because DiD focuses on <em>changes</em> in outcomes across time periods. Time trends that affect both groups (product improvements, market seasonality) cancel out because both groups experience them.</p>
<p>DiD asks one thing in return: parallel pre-treatment trends. The treated and control groups have to be moving in the same direction at the same rate before treatment starts. When that holds, you can extrapolate the shared trend forward and attribute any post-treatment divergence to the treatment. If the trends were already diverging before treatment, DiD is biased, and no amount of clever regression fixes it.</p>
<p>Parallel trends is the assumption you'll test in step 3.</p>
<h3 id="heading-companion-notebook">Companion Notebook</h3>
<p>All the code in this tutorial, including the synthetic dataset, the DiD regression, the parallel-trends plot, and the placebo pre-trend test, lives in a single executable Jupyter notebook in the GitHub repo for this series on product experimentation and causal inference for GenAI and LLM applications.</p>
<p>You can clone it, run <code>generate_data.py</code> once, and every output in this article reproduces exactly: <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/01_did_staged_rollouts">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm</a></p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You'll need Python 3.11 or newer and comfort with pandas and basic regression. You can follow along without prior causal inference experience, as the article defines confounders and selection bias inline when they first appear. You'll encounter clustered standard errors and fixed effects in step 2. The article explains what they do and why they matter, but it doesn't derive them from scratch.</p>
<p>Install the packages for this tutorial:</p>
<pre><code class="language-bash">pip install numpy pandas statsmodels linearmodels matplotlib
</code></pre>
<p>Clone the companion repo to get the synthetic dataset:</p>
<pre><code class="language-bash">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The dataset simulates a SaaS product with an AI summaries feature launched in two waves: wave 1 workspaces get it at week 20, wave 2 at week 30, with 50,000 users total, each with one row of <a href="https://www.freecodecamp.org/news/how-to-use-opentelemetry/">telemetry</a>.</p>
<p>The data generator bakes in a +5 percentage point causal effect on task completion for users in their workspace's post-treatment period. You know the truth upfront, so you can check whether your DiD estimator actually recovers it.</p>
<p>Load the data and inspect the structure:</p>
<pre><code class="language-python">import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(df.shape)
print(df[["wave", "signup_week", "workspace_id", "task_completed"]].head())
print("\nWave sizes:", df.wave.value_counts().to_dict())
print("Treatment weeks per wave:",
      df.groupby("wave").treatment_week.first().to_dict())
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">(50000, 16)
   wave  signup_week  workspace_id  task_completed
0     2           10            36               0
1     2           51            44               1
2     2            2            28               1
3     1           15            20               1
4     1           29             0               1
Wave sizes: {2: 25063, 1: 24937}
Treatment weeks per wave: {1: 20, 2: 30}
</code></pre>
<p>Here's what's happening: you load 50,000 rows, one per user. Wave 1 has about 24,937 users across 25 workspaces; wave 2 has about 25,063 users across 25 different workspaces. The <code>treatment_week</code> column records when each user's workspace got the AI summaries feature (week 20 for wave 1, week 30 for wave 2). The <code>task_completed</code> column is your outcome: did the AI successfully complete the user's task.</p>
<p>One important detail: <code>signup_week</code> in this dataset records which calendar week a user first joined the product, and we're using it as a time index to assign users to pre- or post-treatment cohorts.</p>
<p>A user who signed up in week 22 joined after the feature launched, so their experience is "post-treatment." A user who signed up in week 14 joined before the launch, so their experience is "pre-treatment."</p>
<p>This works here because each user has one row of telemetry tied to their initial product experience. In a panel dataset with multiple observations per user across time, you'd use an observation timestamp column tied to when each row was recorded.</p>
<p>To keep the analysis clean, restrict to users who signed up before the wave 2 launch (<code>signup_week &lt; 30</code>). Wave 2 then works as a proper control group, since it hasn't been treated yet, while wave 1 has been treated for 10 weeks.</p>
<pre><code class="language-python">analysis = df[df.signup_week &lt; 30].copy()
analysis["post"] = (analysis.signup_week &gt;= 20).astype(int)
analysis["treated"] = (analysis.wave == 1).astype(int)

print(analysis.groupby(["treated", "post"])
              .agg(n=("user_id", "count"),
                   mean_completion=("task_completed", "mean"))
              .round(3))
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">                 n  mean_completion
treated post
0       0     9590            0.556
        1     4878            0.555
1       0     9633            0.592
        1     4738            0.643
</code></pre>
<p>Here's what's happening: you filter the data to the analysis window (weeks 0 to 29) and create two indicator variables. <code>post</code> is 1 for users in the post-week-20 period, 0 otherwise. <code>treated</code> is 1 for wave 1 users, 0 for wave 2. The groupby shows the four cells of the DiD 2x2 table: (treated=0, post=0), (treated=0, post=1), (treated=1, post=0), (treated=1, post=1). Those four means are everything you need for a first-pass DiD estimate.</p>
<h2 id="heading-step-1-a-simple-2x2-did">Step 1: A Simple 2x2 DiD</h2>
<p>Start with the cleanest version. Compute the four cell means by hand, then take the difference of differences:</p>
<pre><code class="language-python">cells = analysis.groupby(["treated", "post"]).task_completed.mean()

wave2_pre  = cells.loc[(0, 0)]   # control, pre
wave2_post = cells.loc[(0, 1)]   # control, post
wave1_pre  = cells.loc[(1, 0)]   # treated, pre
wave1_post = cells.loc[(1, 1)]   # treated, post

did_effect = (wave1_post - wave1_pre) - (wave2_post - wave2_pre)
print(f"Wave 1 change: {wave1_post - wave1_pre:+.4f}")
print(f"Wave 2 change: {wave2_post - wave2_pre:+.4f}")
print(f"DiD effect:    {did_effect:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Wave 1 change: +0.0515
Wave 2 change: -0.0013
DiD effect:    +0.0527  (ground truth = +0.05)
</code></pre>
<p>Here's what's happening: you pull the four cell means, compute wave 1's change in task completion from pre to post, compute wave 2's change over the same calendar window (wave 2 hasn't been treated yet), and take the difference. The DiD estimate is the piece of wave 1's change that can't be explained by whatever time trend also moved wave 2.</p>
<p>On this dataset the simple 2x2 estimate lands at +0.053, which is very close to the true +0.05. But you can't take this number to a product review. You have no standard errors, which means you can't say whether +0.053 is a real signal or within sampling noise. You have no covariate adjustment, so if wave 1 happened to have more heavy users in this cohort, some of that +0.053 could be engagement-tier composition. And you have no way to handle the workspace-level correlation in your data. Step 2 fixes all three.</p>
<h2 id="heading-step-2-regression-did-with-fixed-effects">Step 2: Regression DiD with Fixed Effects</h2>
<p>The regression formulation of DiD produces the same point estimate as the 2x2 table when there are no covariates. But it also buys you three things:</p>
<ul>
<li><p><strong>Standard errors and p-values</strong> computed correctly</p>
</li>
<li><p><strong>Covariate adjustment</strong> to reduce variance and sharpen your estimate</p>
</li>
<li><p><strong>Cluster-robust errors</strong> that handle correlation within workspaces, which a staged rollout always has</p>
</li>
</ul>
<p>The regression is: <code>outcome ~ treated + post + treated:post + controls</code>. The coefficient on the <code>treated:post</code> interaction is your DiD estimate.</p>
<pre><code class="language-python">import statsmodels.formula.api as smf

did_model = smf.ols(
    "task_completed ~ treated * post + C(engagement_tier)",
    data=analysis
).fit(
    cov_type="cluster",
    cov_kwds={"groups": analysis.workspace_id}
)

print(did_model.summary().tables[1])
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">================================================================================================
                                   coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                        0.8301      0.007    126.538      0.000       0.817       0.843
C(engagement_tier)[T.light]     -0.4027      0.006    -63.168      0.000      -0.415      -0.390
C(engagement_tier)[T.medium]    -0.1766      0.007    -25.931      0.000      -0.190      -0.163
treated                          0.0367      0.005      6.885      0.000       0.026       0.047
post                            -0.0056      0.008     -0.684      0.494      -0.022       0.011
treated:post                     0.0541      0.011      4.981      0.000       0.033       0.075
================================================================================================
</code></pre>
<p>Here's what's happening: you fit an ordinary least squares regression of task completion on the <code>treated</code> indicator, the <code>post</code> indicator, their interaction, and a categorical control for engagement tier.</p>
<p>The <code>treated:post</code> coefficient is the DiD estimate. Users in the same workspace share common shocks, making their outcomes correlated. Grouping by <code>workspace_id</code> corrects for that.</p>
<p>On this dataset the <code>treated:post</code> coefficient comes out at +0.054 with a clustered p-value of &lt;0.001. The ground truth is +0.050. At 0.4 percentage points from the true effect, with a standard error that accounts for workspace-level correlation, that's a number you can put in a product review.</p>
<p>A few practical notes on this regression:</p>
<ul>
<li><p><strong>Controls should be time-invariant</strong> (engagement tier, signup cohort). Time-varying controls that are themselves affected by treatment will bias the estimate.</p>
</li>
<li><p><strong>Only the interaction has a causal interpretation.</strong> The intercept and level terms describe baseline differences between groups, nothing more.</p>
</li>
<li><p><strong>Clustered errors are mandatory.</strong> Skip clustering and your standard errors are 3 to 10x too small, test statistics are artificially inflated, and results look far more significant than they are.</p>
</li>
</ul>
<h2 id="heading-step-3-checking-the-parallel-trends-assumption">Step 3: Checking the Parallel-Trends Assumption</h2>
<p>DiD is only valid if wave 1 and wave 2 were moving in the same direction at the same rate <em>before</em> treatment started. You check this by plotting (or tabulating) weekly means for the two waves across the pre-treatment window.</p>
<pre><code class="language-python">import matplotlib.pyplot as plt
import numpy as np

df_plot = df[df.signup_week &lt; 30].copy()
weekly = (df_plot.groupby(["signup_week", "wave"])
             .task_completed.mean()
             .reset_index()
             .pivot(index="signup_week", columns="wave", values="task_completed"))

# 3-week rolling average to smooth week-to-week sampling noise
smoothed = weekly.rolling(3, center=True, min_periods=2).mean()

TREATMENT_WEEK = 20
pre_idx = smoothed.index[smoothed.index &lt; TREATMENT_WEEK]
post_idx = smoothed.index[smoothed.index &gt;= TREATMENT_WEEK]

# DiD counterfactual: wave 1 pre-period mean + wave 2's post-period change
wave1_pre_mean = smoothed.loc[pre_idx, 1].mean()
wave2_pre_mean = smoothed.loc[pre_idx, 2].mean()
counterfactual = wave1_pre_mean + (smoothed.loc[post_idx, 2].values - wave2_pre_mean)

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.axvspan(-0.5, TREATMENT_WEEK, alpha=0.04, color="#94A3B8", zorder=0)
ax.axvspan(TREATMENT_WEEK, 29.5, alpha=0.06, color="#3B82F6", zorder=0)
ax.plot(smoothed.index, smoothed[2], "s--", color="#F59E0B", linewidth=2,
        markersize=4, label="Wave 2 — control (untreated during this window)", zorder=3)
ax.plot(smoothed.index, smoothed[1], "o-", color="#2563EB", linewidth=2.2,
        markersize=4, label="Wave 1 — treated (AI feature on at week 20)", zorder=4)
ax.plot(post_idx, counterfactual, ":", color="#2563EB", linewidth=2.2,
        label="Wave 1 counterfactual (projected without treatment)", zorder=4)
ax.axvline(TREATMENT_WEEK, color="#DC2626", linestyle="--", linewidth=1.8,
           label="AI feature launched (week 20)")

ax.text(9.5, 0.508, "Pre-treatment period\n(parallel trends required)",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.text(24, 0.508, "Post-treatment",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.set_xlabel("Week", fontsize=11)
ax.set_ylabel("Mean task completion rate", fontsize=11)
ax.set_title("Figure 2: Data-Driven Parallel-Trends Check\n(3-week rolling average, 50k users)",
             fontsize=12, fontweight="bold", pad=14)
ax.legend(loc="upper left", fontsize=9, framealpha=0.92)
ax.set_xlim(-0.5, 29.5)
ax.set_ylim(0.50, 0.72)
ax.grid(True, alpha=0.18, linestyle=":")
ax.tick_params(labelsize=10)
plt.tight_layout()
plt.savefig("parallel_trends.png", dpi=150, bbox_inches="tight")
print("Saved parallel_trends.png")
</code></pre>
<p><strong>Expected output (Figure 2, data-driven verification):</strong></p>
<pre><code class="language-text">Saved parallel_trends.png
</code></pre>
<img src="https://raw.githubusercontent.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/main/images/article-1/parallel_trends.png" alt="Parallel trends visual check, data-driven verification" style="display:block;margin:0 auto" width="1486" height="804" loading="lazy">

<p><em>Figure 2 is the data-driven parallel-trends check from your actual dataset, plotted as a 3-week rolling average to smooth week-to-week sampling noise. Both waves track each other closely before week 20, and small wiggles in the pre-period affect both groups at the same time, which is exactly what parallel trends looks like. After week 20, wave 1 separates cleanly above the dotted counterfactual line. The gap between the solid blue line and the dotted line in the post-treatment window is the DiD estimate playing out in your actual data.</em></p>
<p>Here's what's happening: you group by signup week and wave, compute the mean task completion rate per cell, pivot so each wave is a column, and plot the two time series together.</p>
<p>A vertical dashed line marks week 20 when wave 1 got treatment. In the pre-treatment window (weeks 0 to 19) the two series should track each other closely. After week 20, wave 1 should pull ahead of wave 2 by roughly the treatment effect.</p>
<p>To put a number on it, run a placebo regression on the pre-treatment period only. Regress the outcome on a linear time trend interacted with the treated indicator. If the interaction coefficient is near zero and insignificant, the two groups were moving in parallel before treatment:</p>
<pre><code class="language-python">pre_only = analysis[analysis.post == 0].copy()
pre_only["weeks_since_start"] = pre_only.signup_week - 10  # center

placebo_model = smf.ols(
    "task_completed ~ treated * weeks_since_start + C(engagement_tier)",
    data=pre_only
).fit(
    cov_type="cluster",
    cov_kwds={"groups": pre_only.workspace_id}
)

print("Pre-trend slope difference:",
      placebo_model.params["treated:weeks_since_start"])
print("p-value:",
      placebo_model.pvalues["treated:weeks_since_start"])
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Pre-trend slope difference: -0.00095...
p-value: 0.4435...
</code></pre>
<p>Here's what's happening: you restrict to pre-treatment observations, fit a regression that lets wave 1 and wave 2 follow different linear trends in the pre-period, and read off the interaction coefficient.</p>
<p>A coefficient close to zero with p &gt; 0.05 means the two waves were moving in parallel before treatment. If that coefficient is large and statistically significant, the parallel-trends assumption is broken: your DiD estimate is absorbing whatever differential trend separated the groups before week 20.</p>
<p>If the placebo test fails, stop and rethink. Your options: restrict to a narrower pre-window where trends were parallel, find a better control group, or switch to synthetic control, which builds a weighted counterfactual from multiple untreated units.</p>
<p>On this synthetic dataset the placebo test passes: the pre-trend slope difference is -0.00095 with p = 0.44, so the parallel-trends assumption holds and the +0.054 estimate from step 2 is trustworthy.</p>
<h2 id="heading-when-difference-in-differences-fails">When Difference-in-Differences Fails</h2>
<p>DiD is a precise accounting method, and every precise method has specific failure modes worth knowing before you trust its output. Here are four common ones:</p>
<h3 id="heading-1-non-parallel-pre-trends">1. Non-parallel Pre-trends</h3>
<p>When the treated and control groups were already diverging before treatment started, DiD mistakes that pre-existing drift for a treatment effect.</p>
<p>The placebo test in step 3 is your guard. Run it every time. If it fails, you have three options:</p>
<ol>
<li><p>Restrict the analysis to a shorter pre-window where trends were parallel and re-run the placebo</p>
</li>
<li><p>Find a better control group whose pre-trend matches the treated group</p>
</li>
<li><p>Switch to synthetic control, which builds a weighted counterfactual from multiple untreated units and picks the weights to match the treated group's pre-treatment trajectory</p>
</li>
</ol>
<h3 id="heading-2-staggered-adoption">2. Staggered Adoption</h3>
<p>A staged rollout with three or more waves demands a different approach than a clean 2x2. Wave 1 gets treated at week 20, wave 2 at week 30, wave 3 at week 40. Once wave 2 is treated, it's no longer a valid control for wave 1 comparisons that span weeks 30 and beyond. Earlier treated units start acting as controls for later ones, which contaminates the estimate.</p>
<p>That's the Goodman-Bacon decomposition problem, and the standard two-way fixed effects estimator from step 2 will silently absorb it. The Callaway-Sant'Anna estimator (see <a href="https://www.sciencedirect.com/science/article/abs/pii/S0304407620303948">their 2021 paper</a>) fixes this by averaging only the clean 2x2 comparisons and discarding the contaminated ones. The <code>differences</code> package in Python implements it.</p>
<h3 id="heading-3-time-varying-confounders-that-hit-only-the-treated-group">3. Time-varying Confounders that Hit Only the Treated Group</h3>
<p>If your marketing team runs a targeted campaign in wave 1 workspaces during week 22, you've got a treatment-specific shock DiD can't net out.</p>
<p>Parallel trends certifies the pre-treatment period, but the post-treatment window remains your responsibility to audit.</p>
<p>Check every product or marketing event inside the analysis window. If you find one, the only options are to redesign the study, restrict the analysis to the window before the shock, or model the shock explicitly as a second treatment variable.</p>
<h3 id="heading-4-anticipation-effects">4. Anticipation Effects</h3>
<p>If wave 1 customers knew in week 18 that the feature was coming in week 20, some will have started behaving differently before treatment technically started: signing up more, pre-configuring settings, contacting support. That contaminates the "pre" period. The tell is a bump or dip in wave 1 in the weeks immediately before week 20 on the event-study plot.</p>
<p>The fix is to push the pre-period cutoff back. Treat week 18 as the "treatment" start for purposes of the analysis, which removes the anticipation window from your pre-period baseline.</p>
<p>Each of these failure modes has a diagnostic and a specific remedy. Naming them in your analysis builds credibility with skeptical reviewers. DiD is a careful accounting identity – it produces reliable estimates exactly as long as its inputs are clean.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>The regression DiD above is the right tool for a two-wave rollout. If your rollout has three or more waves, switch to the Callaway-Sant'Anna estimator. If your rollout crosses a treatment threshold you set deliberately (confidence scores, query complexity), look into regression discontinuity. If you want to compare a single treated unit against a constructed counterfactual, synthetic control is the right choice.</p>
<p>The <a href="http://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm">companion notebook for this tutorial is here</a>. Clone the repo, generate the synthetic dataset with <code>generate_data.py</code>, and open <code>did_demo.ipynb</code> to reproduce every code block with pre-saved outputs.</p>
<p>If you ship AI features in waves, your rollout calendar is already a DiD study. The only question is whether you run the analysis.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP ]]>
                </title>
                <description>
                    <![CDATA[ Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensi ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-gpu-optimized-machine-image-with-hashicorp-packer-on-gcp/</link>
                <guid isPermaLink="false">69e93606d5f8830e7d9fbad6</guid>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ VM Image ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GCP ]]>
                    </category>
                
                    <category>
                        <![CDATA[ hashicorp packer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Cloud Computing ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rasheedat Atinuke Jamiu ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 20:30:00 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/fd393878-fe7c-458a-addf-7cd22d8280ac.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every time you spin up GPU infrastructure, you do the same thing: install CUDA drivers, DCGM, apply OS‑level GPU tuning, and fight dependency issues. Same old ritual every single time, wasting expensive cloud credits and getting frustrated before actual work begins.</p>
<p>In this article, you'll build a reusable GPU-optimized machine image using Packer, pre-loaded with NVIDIA drivers, CUDA Toolkit, NVIDIA Container Toolkit, DCGM, and system-level GPU tuning like persistence mode.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a href="#heading-step-1-install-packer">Step 1: Install Packer</a></p>
</li>
<li><p><a href="#heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</a></p>
</li>
<li><p><a href="#heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</a></p>
</li>
<li><p><a href="#heading-step-4-define-your-source">Step 4: Define Your Source</a></p>
</li>
<li><p><a href="#heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</a></p>
</li>
<li><p><a href="#heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</a></p>
<ul>
<li><p><a href="#heading-section-1-pre-installation-kernel-headers">section 1: Pre-Installation (Kernel Headers)</a></p>
</li>
<li><p><a href="#heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</a></p>
</li>
<li><p><a href="#heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</a></p>
</li>
<li><p><a href="#heading-section-4-installing-the-driver">Section 4: Installing the Driver</a></p>
</li>
<li><p><a href="#heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</a></p>
</li>
<li><p><a href="#heading-section-6-nvidia-container-toolkit">Section 6: Nvidia Container Toolkit</a></p>
</li>
<li><p><a href="#heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM — Data Center GPU Manager</a></p>
</li>
<li><p><a href="#heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</a></p>
</li>
<li><p><a href="#heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-7assembling-and-running-the-build">Step 7:Assembling and Running the Build</a></p>
</li>
<li><p><a href="#heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p><a href="https://www.packer.io/">HashiCorp Packer</a> &gt;= 1.9</p>
</li>
<li><p><a href="https://github.com/hashicorp/packer-plugin-googlecompute">Google Compute Packer plugin</a> (installed via <code>packer init</code>)</p>
</li>
<li><p>Optionally, the <a href="https://github.com/hashicorp/packer-plugin-amazon">AWS Packer plugin</a> can be used for EC2 builds by adding an <code>amazon-ebs</code> source to <code>node.pkr.hcl</code></p>
</li>
<li><p>GCP project with Compute Engine API enabled (or AWS account with EC2 access)</p>
</li>
<li><p>GCP authentication (<code>gcloud auth application-default login</code>) or AWS credentials</p>
</li>
<li><p>Access to an NVIDIA GPU instance type (For example, A100, H100, L4 on GCP; p4d, p5, G6 on AWS)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<h3 id="heading-step-1-install-packer">Step 1: Install Packer</h3>
<p>To get started, you'll install Packer with the steps below if you're on macOS (or you can follow the official documentation for Linux and Windows installation <a href="https://developer.hashicorp.com/packer/tutorials/docker-get-started/get-started-install-cli#:~:text=Chocolatey%20on%20Windows-,Linux,-HashiCorp%20officially%20maintains">guides</a>).</p>
<p>First, you'll install the official Packer formula from the terminal.</p>
<p>Install the HashiCorp tap, a repository of all Hashicorp packages.</p>
<pre><code class="language-plaintext">$ brew tap hashicorp/tap
</code></pre>
<p>Now, install Packer with <code>hashicorp/tap/packer</code>.</p>
<pre><code class="language-plaintext">$ brew install hashicorp/tap/packer
</code></pre>
<h3 id="heading-step-2-set-up-project-directory">Step 2: Set Up Project Directory</h3>
<p>With Packer installed, you'll create your project directory. For clean code and separation of concerns, your project directory should look like the below. Go ahead and create these files in your <code>packer_demo</code> folder using the command below:</p>
<pre><code class="language-plaintext">mkdir -p packer_demo/script &amp;&amp; touch packer_demo/{build.pkr.hcl,source.pkr.hcl,variable.pkr.hcl,local.pkr.hcl,plugins.pkr.hcl,values.pkrvars.hcl} packer_demo/script/base.sh
</code></pre>
<p>Your file directory should look like this:</p>
<pre><code class="language-plaintext">packer_demo
├── build.pkr.hcl                 # Build pipeline — provisioner ordering
├── source.pkr.hcl                # GCP source definition (googlecompute)
├── variable.pkr.hcl              # Variable definitions with defaults
├── local.pkr.hcl                 # Local values
├── plugins.pkr.hcl                # Packer plugin requirements
├── values.pkrvars.hcl             # variable values (copy and customize)
├── script/
│   ├── base.sh                  # requirement script 
</code></pre>
<h3 id="heading-step-3-install-packers-plugins">Step 3: Install Packer's Plugins</h3>
<p>In your <code>plugins.pkr.hcl file,</code>, define your plugins in the <code>packer block.</code> The <code>packer {}</code> block contains Packer settings, including specifying a required plugin version. You'll find the <code>required_plugins</code> block in the Packer block, which specifies all the plugins required by the template to build your image. If you're on Azure or AWS, you can check for the latest plugin <a href="https://developer.hashicorp.com/packer/integrations">here</a>.</p>
<pre><code class="language-hcl">packer {
  required_plugins {
    googlecompute = {
      source  = "github.com/hashicorp/googlecompute"
      version = "~&gt; 1"
    }
  }
}
</code></pre>
<p>Then, initialize your Packer plugin with the command below:</p>
<pre><code class="language-plaintext">packer init .
</code></pre>
<h3 id="heading-step-4-define-your-source">Step 4: Define Your Source</h3>
<p>With your plugin initialized, you can now define your source block. The source block configures a specific builder plugin, which is then invoked by a build block. Source blocks contain your <code>project ID</code>, the zone where your machine will be created, the <code>source_image_family</code> (think of this as your base image, such as Debian, Ubuntu, and so on), and your <code>source_image_project_id</code>.</p>
<p>In GCP, each has an image project ID, such as "ubuntu-os-cloud" for Ubuntu. You'll set the <code>machine type</code> to a GPU machine type because you're building your base image for a GPU machine, so the machine on which it will be created needs to be able to run your commands.</p>
<pre><code class="language-hcl">source "googlecompute" "gpu-node" {
  project_id              = var.project_id
  zone                    = var.zone
  source_image_family     = var.image_family
  source_image_project_id = var.image_project_id
  ssh_username            = var.ssh_username
  machine_type            = var.machine_type



  image_name        = var.image_name
  image_description = var.image_description

  disk_size           = var.disk_size
  on_host_maintenance = "TERMINATE"

  tags = ["gpu-node"]

}
</code></pre>
<p>Setting <code>on_host_maintenance = "TERMINATE"</code> on Google Cloud Compute Engine ensures that a VM instance stops instead of live-migrating during infrastructure maintenance. This is important when using GPUs or specialized hardware that can't migrate, preventing data corruption.</p>
<p>You'll define all your variables in the <code>variable.pkr.hcl</code> file, and set the values in the <code>values.pkrvars.hcl</code>. Remember to always add your <code>values.pkrvars.hcl</code> file to Gitignore.</p>
<pre><code class="language-hcl">variable "image_name" {
  type        = string
  description = "The name of the resulting image"
}

variable "image_description" {
  type        = string
  description = "Description of the image"
}

variable "project_id" {
  type        = string
  description = "The GCP project ID where the image will be created"
}

variable "image_family" {
  type        = string
  description = "The image family to which the resulting image belongs"
}

variable "image_project_id" {
  type        = list(string)
  description = "The project ID(s) to search for the source image"
}

variable "zone" {
  type        = string
  description = "The GCP zone where the build instance will be created"
}

variable "ssh_username" {
  type        = string
  description = "The SSH username to use for connecting to the instance"
}
variable "machine_type" {
  type        = string
  description = "The machine type to use for the build instance"
}

variable "cuda_version" {
  type        = string
  description = "CUDA toolkit version"
  default     = "13.1"
}

variable "driver_version" {
  type        = string
  description = "NVIDIA driver version"
  default     = "590.48.01"
}

variable "disk_size" {
  type        = number
  description = "Boot disk size in GB"
  default     = 50
}
</code></pre>
<p><code>values.pkrvars.hcl</code></p>
<pre><code class="language-hcl">image_name        = "base-gpu-image-{{timestamp}}"
image_description = "Ubuntu 24.04 LTS with gpu drivers and health checks"
project_id        = "your gcp project id"
image_family      = "ubuntu-2404-lts-amd64"
image_project_id  = ["ubuntu-os-cloud"]
zone              = "us-central1-a"
ssh_username      = "packer"
machine_type      = "g2-standard-4"
disk_size        = 50
driver_version   = "590.48.01"
cuda_version      = "13.1" 
</code></pre>
<h3 id="heading-step-5-writing-the-build-template">Step 5: Writing the Build Template</h3>
<p>Create <code>build.pkr.hcl</code>. The <code>build</code> block creates a temporary instance, runs provisioners, and produces an image.</p>
<p>Provisioners in this template are organized as follows:</p>
<ul>
<li><p><strong>First provisioner</strong> runs system updates and upgrades.</p>
</li>
<li><p><strong>Second provisioner</strong> reboots the instance (<code>expect_disconnect = true</code>).</p>
</li>
<li><p><strong>Third provisioner</strong> waits for the instance to come back (<code>pause_before</code>), then runs <code>script/base.sh</code>. This provisioner sets <code>max_retries</code> to handle transient SSH timeouts and pass environment variables for <code>DRIVER_VERSION</code> and <code>CUDA_VERSION</code>.</p>
</li>
</ul>
<p>Lastly, you have the post-processor to tell you the image ID and completion status:</p>
<pre><code class="language-hcl">build {
  sources = ["source.googlecompute.gpu-node"]

  provisioner "shell" {
    inline = [
      "set -e",
      "sudo apt update",
      "sudo apt -y dist-upgrade"
    ]
  }

  provisioner "shell" {
    expect_disconnect = true
    inline            = ["sudo reboot"]
  }

  # Base: NVIDIA drivers, CUDA, DCGM
  provisioner "shell" {
    pause_before = "60s"
    script       = "script/base.sh"
    max_retries  = 2
    environment_vars = [
      "DRIVER_VERSION=${var.driver_version}",
      "CUDA_VERSION=${var.cuda_version}"
    ]
  }

  post-processor "shell-local" {
    inline = [
      "echo '=== Image Build Complete ==='",
      "echo 'Image ID: ${build.ID}'",
      "date"
    ]
  }
}
</code></pre>
<h3 id="heading-step-6-writing-the-gpu-provisioning-script">Step 6: Writing the GPU Provisioning Script</h3>
<p>Now we'll go through the base script, and break down some parts of it.</p>
<h3 id="heading-section-1-pre-installation-kernel-headers">Section 1: Pre-Installation (Kernel Headers)</h3>
<p>Before installing NVIDIA drivers, the system needs kernel headers and build tools. The NVIDIA driver compiles a kernel module during installation via DKMS, so if the headers for your running kernel aren't present, the build will fail silently, and the driver won't load on boot.</p>
<pre><code class="language-shellscript">log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget
</code></pre>
<h3 id="heading-section-2-installing-nvidias-apt-repository">Section 2: Installing NVIDIA's Apt Repository</h3>
<p>This snippet downloads and installs NVIDIA’s official keyring package based on your OS Linux distribution, which adds the trusted signing keys needed for the system to verify CUDA packages.</p>
<pre><code class="language-shellscript">log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq
</code></pre>
<h3 id="heading-section-3-pinning-nvidia-drivers-version">Section 3: Pinning NVIDIA Drivers Version</h3>
<p>Pinning the NVIDIA driver to a specific version ensures that the system always installs and keeps using exactly that driver version, even when newer drivers appear in the repository.</p>
<p>NVIDIA drivers are tightly coupled with CUDA toolkit versions, Kernel versions, and container runtimes like Docker or NVIDIA Container Toolkit</p>
<p>A mismatch, such as the system auto‑upgrading to a newer driver, can cause CUDA to stop working, break GPU acceleration, or make the machine image inconsistent across deployments.</p>
<pre><code class="language-shellscript">log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"
</code></pre>
<h3 id="heading-section-4-installing-the-driver">Section 4: Installing the Driver</h3>
<p>The <code>libnvidia-compute</code> installs only the compute‑related user‑space libraries (CUDA driver components), while the <code>nvidia-dkms-open;</code> installs the <strong>open‑source NVIDIA kernel module</strong>, built locally via DKMS.</p>
<p>Together, these two packages give you a fully functional CUDA driver environment without any GUI or graphics dependencies.</p>
<p>Here, we're using <strong>NVIDIA’s compute‑only driver stack using the open‑source kernel modules</strong>, as it deliberately avoids installing any display-related components, which you don't need.</p>
<p>This method provides an installation module based on DKMS that's better aligned with Linux distros, as it's lightweight, and compute-focused.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open
</code></pre>
<h3 id="heading-section-5-cuda-toolkit-installation">Section 5: CUDA Toolkit Installation</h3>
<p>This part of the script installs the <strong>CUDA Toolkit</strong> for the specified version and then makes sure that CUDA’s executables and libraries are available system‑wide for every user and every shell session.</p>
<p>It adds CUDA binaries to PATH, so commands like <code>nvcc</code>, <code>cuda-gdb</code>, and <code>cuda-memcheck</code> work without specifying full paths. It also adds CUDA libraries to LD_LIBRARY_PATH, so applications can find CUDA’s shared libraries at runtime.</p>
<pre><code class="language-shellscript">log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig
</code></pre>
<h3 id="heading-section-6-nvidia-container-toolkit">Section 6: NVIDIA Container Toolkit</h3>
<p>This block installs the NVIDIA Container Toolkit and configures it so that containers (Docker or containerd) can access the GPU safely and correctly. It’s a critical step for Kubernetes GPU nodes, Docker GPU workloads, and any system that needs GPU acceleration inside containers.</p>
<pre><code class="language-shellscript">log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi
</code></pre>
<h3 id="heading-section-7-installing-dcgm-data-center-gpu-manager">Section 7: Installing DCGM (Data Center GPU Manager)</h3>
<p>This section covers the installation and validation of NVIDIA DCGM (Data Center GPU Manager), which is NVIDIA’s official management and telemetry framework for data center GPUs.</p>
<p>It offers health monitoring and diagnostics, telemetry (including temperature, clocks, power, and utilization), error reporting, and integration with Kubernetes, Prometheus, and monitoring agents. Your GPU monitoring stack relies on this.</p>
<p>The script extracts the installed version and checks that it meets the <strong>minimum required version</strong> for NVIDIA driver 590+. Then it enforces the version requirement. This prevents a mismatch between the GPU driver and DCGM, which would break monitoring and health checks. It also enables fabric manager for NVLink/NVswitches, if you're on a Multi‑GPU topologies like A100/H100 DGX or multi‑GPU servers.</p>
<pre><code class="language-shellscript">log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager

DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi
</code></pre>
<h3 id="heading-section-8-enabling-persistence-mode">Section 8: Enabling Persistence Mode</h3>
<p>The NVIDIA driver normally unloads itself when the GPU is idle. When a new workload starts, the driver must reload, reinitialize the GPU, and set up memory mappings. This adds a delay of a few hundred milliseconds to several seconds, depending on the GPU and system.</p>
<p>Enabling nvidia‑persistenced keeps the NVIDIA driver loaded in memory even when no GPU workloads are running.</p>
<pre><code class="language-shellscript">log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced
</code></pre>
<h3 id="heading-section-9-system-tuning-for-gpu-compute-workloads">Section 9: System Tuning for GPU Compute Workloads</h3>
<p>This block applies a set of <strong>system‑level performance and stability tunings</strong> that are standard for high‑performance GPU servers, Kubernetes GPU nodes, and ML/AI workloads.</p>
<p>Each line targets a specific bottleneck or instability pattern that appears in real GPU production environments.</p>
<ul>
<li><p>Swap and memory behavior: Disabling swap and setting <code>vm.swappiness=0</code> prevents the kernel from pushing GPU‑bound processes into swap. GPU workloads are extremely sensitive to latency, and swapping can cause CUDA context resets and GPU driver timeouts.</p>
</li>
<li><p>Hugepages for large memory allocations: Setting <code>vm.nr_hugepages=2048</code> allocates a pool of hugepages, which reduces TLB pressure for large contiguous memory allocations.</p>
<p>CUDA, NCCL, and deep‑learning frameworks frequently allocate large buffers, and hugepages reduce page‑table overhead, improving memory bandwidth and lowering latency for large tensor operations. This is especially useful on multi‑GPU servers.</p>
</li>
<li><p>CPU frequency governor: Installing <code>cpupower</code> and forcing the CPU governor to <code>performance</code> ensures the CPU stays at maximum frequency instead of scaling down.</p>
<p>GPU workloads often become CPU‑bound during Data preprocessing, Kernel launches, and NCCL communication. Keeping CPUs at full speed reduces jitter and improves throughput.</p>
</li>
<li><p>NUMA and topology tools: Installing <code>numactl</code>, <code>libnuma-dev</code>, and <code>hwloc</code> provides tools for pinning processes to NUMA nodes, understanding CPU–GPU affinity, and optimizing multi‑GPU placement.</p>
</li>
<li><p>Disabling irqbalance: Stopping and disabling <code>irqbalance</code> it lets the NVIDIA driver manage interrupt affinity. For GPU servers, irqbalance can incorrectly move GPU interrupts to suboptimal CPUs, causing higher latency and lower throughput.</p>
</li>
</ul>
<pre><code class="language-shell">log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system
</code></pre>
<p>Full base.sh script here:</p>
<pre><code class="language-shell">#!/bin/bash
set -euo pipefail

log()   { echo "[BASE] $1"; }
error() { echo "[BASE][ERROR] $1" &gt;&amp;2; exit 1; }

###############################################################
###############################################################
[[ -z "${DRIVER_VERSION:-}" ]] &amp;&amp; error "DRIVER_VERSION is not set."
[[ -z "${CUDA_VERSION:-}"   ]] &amp;&amp; error "CUDA_VERSION is not set."

log "DRIVER_VERSION : ${DRIVER_VERSION}"
log "CUDA_VERSION   : ${CUDA_VERSION}"

DISTRO=\((. /etc/os-release &amp;&amp; echo "\){ID}${VERSION_ID}" | tr -d '.')
ARCH="x86_64"

export DEBIAN_FRONTEND=noninteractive

###############################################################
# 1. System update
###############################################################
log "Updating system packages..."
sudo apt-get update -qq
sudo apt-get upgrade -qq -y

###############################################################
# 2. Pre-installation — kernel headers
#    Source: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
###############################################################
log "Installing kernel headers and build tools..."
sudo apt-get install -qq -y \
  "linux-headers-$(uname -r)" \
  build-essential \
  dkms \
  curl \
  wget

###############################################################
# 3. NVIDIA CUDA Network Repository
###############################################################
log "Adding NVIDIA CUDA apt repository (${DISTRO})..."
wget -q "https://developer.download.nvidia.com/compute/cuda/repos/\({DISTRO}/\){ARCH}/cuda-keyring_1.1-1_all.deb" \
  -O /tmp/cuda-keyring.deb
sudo dpkg -i /tmp/cuda-keyring.deb
rm /tmp/cuda-keyring.deb
sudo apt-get update -qq

###############################################################
# 4. Pin driver version BEFORE installation (590+ requirement)
###############################################################
log "Pinning driver to version ${DRIVER_VERSION}..."
sudo apt-get install -qq -y "nvidia-driver-pinning-${DRIVER_VERSION}"

###############################################################
# 5. Compute-only (headless) driver — Open Kernel Modules
#    Source: NVIDIA Driver Installation Guide — Compute-only System (Open Kernel Modules)
#
#    libnvidia-compute  = compute libraries only (no GL/Vulkan/display)
#    nvidia-dkms-open   = open-source kernel module built via DKMS
#
#    Open kernel modules are the NVIDIA-recommended choice for
#    Ampere, Hopper, and Blackwell data centre GPUs (A100, H100, etc.)
###############################################################
log "Installing NVIDIA compute-only driver (open kernel modules)..."
sudo apt-get -V install -y \
  libnvidia-compute \
  nvidia-dkms-open

###############################################################
# 6. CUDA Toolkit
###############################################################
log "Installing CUDA Toolkit ${CUDA_VERSION}..."
sudo apt-get install -qq -y "cuda-toolkit-${CUDA_VERSION}"

# Persist CUDA paths for all users and sessions
cat &lt;&lt;'EOF' | sudo tee /etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
echo "/usr/local/cuda/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

###############################################################
# 7. NVIDIA Container Toolkit
#    Required for GPU workloads in Docker / containerd / Kubernetes
###############################################################
log "Installing NVIDIA Container Toolkit..."
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update -qq
sudo apt-get install -qq -y nvidia-container-toolkit

# Configure for containerd (primary Kubernetes runtime)
sudo nvidia-ctk runtime configure --runtime=containerd

# Configure for Docker if present on this image
if systemctl list-unit-files | grep -q "^docker.service"; then
  sudo nvidia-ctk runtime configure --runtime=docker
fi

###############################################################
# 8. DCGM — DataCenter GPU Manager
###############################################################
log "Installing DCGM..."
sudo apt-get install -qq -y datacenter-gpu-manager
 
DCGM_VER=\((dpkg -s datacenter-gpu-manager 2&gt;/dev/null | awk '/^Version:/{print \)2}' | sed 's/^[0-9]*://')
DCGM_MAJOR=\((echo "\){DCGM_VER}" | cut -d. -f1)
DCGM_MINOR=\((echo "\){DCGM_VER}" | cut -d. -f2)
if [[ "\({DCGM_MAJOR}" -lt 4 ]] || { [[ "\){DCGM_MAJOR}" -eq 4 ]] &amp;&amp; [[ "${DCGM_MINOR}" -lt 3 ]]; }; then
  error "DCGM ${DCGM_VER} is below the 4.3 minimum required for driver 590+. Check your CUDA repo."
fi
log "DCGM installed: ${DCGM_VER}"

sudo systemctl enable nvidia-dcgm
sudo systemctl start  nvidia-dcgm

# Fabric Manager — only needed for NVLink/NVSwitch GPUs (A100/H100 multi-GPU nodes)
if systemctl list-unit-files | grep -q "^nvidia-fabricmanager.service"; then
  log "Enabling nvidia-fabricmanager for NVLink GPUs..."
  sudo systemctl enable nvidia-fabricmanager
  sudo systemctl start  nvidia-fabricmanager
fi

###############################################################
# 9. NVIDIA Persistence Daemon
#    Keeps the driver loaded between jobs — reduces cold-start
#    latency on the first CUDA call in each new workload
###############################################################
log "Enabling nvidia-persistenced..."
sudo systemctl enable nvidia-persistenced
sudo systemctl start  nvidia-persistenced

###############################################################
# 10. System tuning for GPU compute workloads
###############################################################
log "Applying system tuning..."

# Disable swap (critical for Kubernetes scheduler and ML stability)
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
echo "vm.swappiness=0"     | sudo tee /etc/sysctl.d/99-gpu-swappiness.conf

# Hugepages — reduces TLB pressure for large memory allocations
echo "vm.nr_hugepages=2048" | sudo tee /etc/sysctl.d/99-gpu-hugepages.conf

# CPU performance governor
sudo apt-get install -qq -y linux-tools-common "linux-tools-$(uname -r)" || true
sudo cpupower frequency-set -g performance || true

# NUMA and topology tools for GPU affinity tuning
sudo apt-get install -qq -y numactl libnuma-dev hwloc

# Disable irqbalance — let NVIDIA driver manage interrupt affinity
sudo systemctl disable irqbalance || true
sudo systemctl stop    irqbalance || true

# Apply all sysctl settings now
sudo sysctl --system

###############################################################
# Done
###############################################################
log "============================================"
log "Base layer provisioning complete."
log "  OS      : ${DISTRO}"
log "  Driver  : ${DRIVER_VERSION} (open kernel modules, compute-only)"
log "  CUDA    : cuda-toolkit-${CUDA_VERSION}"
log "  DCGM    : ${DCGM_VER}"
log "============================================"
</code></pre>
<h2 id="heading-step-7-assembling-and-running-the-build">Step 7: Assembling and Running the Build</h2>
<p>Validate the template first, then run the build. Validation catches syntax or variable errors early, so the build doesn’t start on a broken config.</p>
<pre><code class="language-shellscript">packer validate -var-file=values.pkrvars.hcl .
</code></pre>
<p>If validation succeeds, you’ll see a short confirmation like <code>The configuration is valid.</code>. After that, start the build. You should expect the process to create a temporary VM, run your provisioners, and produce an image:</p>
<pre><code class="language-plaintext">packer build -var-file=values.pkrvars.hcl .
</code></pre>
<p>The build typically takes <strong>15–20 minutes,</strong> depending on network speed and package installs. Watch the Packer log for three key checkpoints:</p>
<ul>
<li><p><strong>Instance creation</strong> — confirms the temporary VM was provisioned.</p>
</li>
<li><p><strong>Provisioner output</strong> — shows each script step (updates, reboot, <code>script/base.sh</code>) and any errors.</p>
</li>
<li><p><strong>Image creation</strong> — indicates the build finished and an image artifact was written.</p>
</li>
</ul>
<p>If the build fails, copy the failing provisioner’s log lines and re-run the build after fixing the script or variables. For quick troubleshooting, re-run the failing provisioner locally on a matching test VM to iterate faster.</p>
<pre><code class="language-plaintext">googlecompute.gpu-node: output will be in this color.

==&gt; googlecompute.gpu-node: Checking image does not exist...
==&gt; googlecompute.gpu-node: Creating temporary RSA SSH key for instance...
==&gt; googlecompute.gpu-node: no persistent disk to create
==&gt; googlecompute.gpu-node: Using image: ubuntu-2404-noble-amd64-v20260225
==&gt; googlecompute.gpu-node: Creating instance...
==&gt; googlecompute.gpu-node: Loading zone: us-central1-a
==&gt; googlecompute.gpu-node: Loading machine type: g2-standard-4
==&gt; googlecompute.gpu-node: Requesting instance creation...
==&gt; googlecompute.gpu-node: Waiting for creation operation to complete...
==&gt; googlecompute.gpu-node: Instance has been created!
==&gt; googlecompute.gpu-node: Waiting for the instance to become running...
==&gt; googlecompute.gpu-node: IP: 34.58.58.214
==&gt; googlecompute.gpu-node: Using SSH communicator to connect: 34.58.58.214
==&gt; googlecompute.gpu-node: Waiting for SSH to become available...
systemd-logind.service
==&gt; googlecompute.gpu-node:  systemctl restart unattended-upgrades.service
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: No containers need to be restarted.
==&gt; googlecompute.gpu-node:
==&gt; googlecompute.gpu-node: User sessions running outdated binaries:
==&gt; googlecompute.gpu-node:  packer @ session #1: sshd[1535]
==&gt; googlecompute.gpu-node:  packer @ user manager service: systemd[1540]
==&gt; googlecompute.gpu-node: Pausing 1m0s before the next provisioner...
==&gt; googlecompute.gpu-node: Provisioning with shell script: script/base.sh
==&gt; googlecompute.gpu-node: [BASE] DRIVER_VERSION : 590.48.01
==&gt; googlecompute.gpu-node: [BASE] CUDA_VERSION   : 13.1
==&gt; googlecompute.gpu-node: [BASE] Updating system packages...
==&gt; googlecompute.gpu-node: [BASE] Installing kernel headers and build tools...
==&gt; googlecompute.gpu-node: [BASE] Installing CUDA Toolkit 13.1...
==&gt; googlecompute.gpu-node: [BASE] Installing DCGM...
==&gt; googlecompute.gpu-node: [BASE] Enabling nvidia-persistenced...
==&gt; googlecompute.gpu-node: [BASE] Applying system tuning...
==&gt; googlecompute.gpu-node: vm.swappiness=0
==&gt; googlecompute.gpu-node: vm.nr_hugepages=2048
==&gt; googlecompute.gpu-node: Setting cpu: 0
==&gt; googlecompute.gpu-node: Error setting new values. Common errors:
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: [BASE] Base layer provisioning complete.
==&gt; googlecompute.gpu-node: [BASE]   OS      : ubuntu2404
==&gt; googlecompute.gpu-node: [BASE]   Driver  : 590.48.01 (open kernel modules, compute-only)
==&gt; googlecompute.gpu-node: [BASE]   CUDA    : cuda-toolkit-13.1
==&gt; googlecompute.gpu-node: [BASE]   DCGM    : 1:3.3.9
==&gt; googlecompute.gpu-node: [BASE] ============================================
==&gt; googlecompute.gpu-node: Deleting instance...
==&gt; googlecompute.gpu-node: Instance has been deleted!
==&gt; googlecompute.gpu-node: Creating image...
==&gt; googlecompute.gpu-node: Deleting disk...
==&gt; googlecompute.gpu-node: Disk has been deleted!
==&gt; googlecompute.gpu-node: Running post-processor:  (type shell-local)
==&gt; googlecompute.gpu-node (shell-local): Running local shell script: 
==&gt; googlecompute.gpu-node (shell-local): === Image Build Complete ===
==&gt; googlecompute.gpu-node (shell-local): Image ID: packer-69b6c2ee-883a-3602-7bb5-059f1ba27c8b
==&gt; googlecompute.gpu-node (shell-local): Sun Mar 15 15:50:09 WAT 2026
Build 'googlecompute.gpu-node' finished after 17 minutes 55 seconds.

==&gt; Wait completed after 17 minutes 55 seconds

==&gt; Builds finished. The artifacts of successful builds are:
--&gt; googlecompute.gpu-node: A disk image was created in the 'my_project-00000' project: base-gpu-image-1773585134
</code></pre>
<h3 id="heading-step-8-test-the-image-and-verify-the-gpu-stack">Step 8: Test the Image and Verify the GPU Stack</h3>
<p>Confirm the image exists in the GCP Console: <strong>Compute → Storage → Images</strong> and locate your newly created OS image.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/90f304eb-3fe7-4304-b2ad-d86701dde607.png" alt="Your Image information on GCP" style="display:block;margin:0 auto" width="1686" height="692" loading="lazy">

<p>Create a test VM from the image:</p>
<pre><code class="language-plaintext">gcloud compute instances create my-gpu-vm \
  --machine-type=g2-standard-4 \
  --accelerator=count=1,type=nvidia-l4 \
  --image=base-gpu-image-1772718104 \
  --image-project=YOUR_PROJECT_ID \
  --boot-disk-size=50GB \
  --maintenance-policy=TERMINATE \
  --restart-on-failure \
  --zone=us-central1-a

Created [https://www.googleapis.com/compute/v1/projects/my-project-000/zones/us-central1-a/instances/my-gpu-vm].
NAME       ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP    EXTERNAL_IP      STATUS
my-gpu-vm  us-central1-a  g2-standard-4               10.128.15.227  104.154.184.217  RUNNING
</code></pre>
<p>Once the instance is <code>RUNNING</code>, verify the NVIDIA driver and GPU are visible:</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/364df8fc-7584-40df-8ab7-b3fe349d5065.png" alt="Output from the Nvidia-SMI command showing Driver and CUDA Version" style="display:block;margin:0 auto" width="1508" height="630" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/0912c303-3bb0-47fa-aa34-1c91ff26874f.png" alt="Image verifying the persistence mode is enabled" style="display:block;margin:0 auto" width="1508" height="80" loading="lazy">

<p><strong>The</strong> <code>nvidia-smi</code> <strong>output confirms:</strong></p>
<ul>
<li><p>Driver 590.48.01 loaded</p>
</li>
<li><p>CUDA 13.1 available</p>
</li>
<li><p>Persistence Mode is <code>On</code></p>
</li>
<li><p>The L4 GPU is detected with 23GB VRAM</p>
</li>
<li><p>Zero ECC errors</p>
</li>
<li><p>No running processes (clean idle state).</p>
</li>
</ul>
<p>This is exactly what a healthy base image should look like. Notice <code>Disp.A: Off</code>? That confirms our compute-only driver choice is working — no display adapter is active.</p>
<p>Confirm the installed CUDA toolkit by running. <code>nvcc --version</code>. You can see that version 13.1 was installed as specified.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/cc744624-9408-4348-88d7-61da04b5e1d0.png" alt="Output from the NVCC -Version command" style="display:block;margin:0 auto" width="1508" height="202" loading="lazy">

<p>Let's confirm DCGM installation by running <code>dcgmi discovery -l</code>. Successful output indicates DCGM is running and communicating with the driver.</p>
<img src="https://cdn.hashnode.com/uploads/covers/5eacc4c926e78ca711dfbbdc/114996c6-1f28-43d4-a3fa-13aa7ccd2c82.png" alt="Output from the DCGMI dicovery -l command showing device information" style="display:block;margin:0 auto" width="1508" height="714" loading="lazy">

<h2 id="heading-conclusion">Conclusion</h2>
<p>You now have a production‑grade, GPU‑optimized base image that includes the NVIDIA compute‑only driver built with open kernel modules, DCGM for monitoring, and the CUDA Toolkit. You also applied OS‑level tuning tailored to GPU compute workloads, providing a consistent, reproducible environment with no manual setup.</p>
<p>From here, you can extend the build by adding an application‑layer script to install frameworks such as PyTorch, TensorFlow, or vLLM, or create an instance template that uses this image to scale your GPU infrastructure.</p>
<p>The full Packer project includes additional scripts for training and inference workloads that you can use to extend your image.</p>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p>NVIDIA Driver Installation Guide (Ubuntu): <a href="https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/">https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/</a></p>
</li>
<li><p>NVIDIA CUDA Toolkit Documentation: <a href="https://docs.nvidia.com/cuda/">https://docs.nvidia.com/cuda/</a></p>
</li>
<li><p>NVIDIA Container Toolkit Installation Guide: <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html">https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html</a></p>
</li>
<li><p>NVIDIA DCGM Documentation: <a href="https://docs.nvidia.com/datacenter/dcgm/latest/index.html">https://docs.nvidia.com/datacenter/dcgm/latest/index.html</a></p>
</li>
<li><p>NVIDIA Persistence Daemon: <a href="https://docs.nvidia.com/deploy/driver-persistence/index.html">https://docs.nvidia.com/deploy/driver-persistence/index.html</a></p>
</li>
<li><p>HashiCorp Packer Documentation: <a href="https://developer.hashicorp.com/packer/docs">https://developer.hashicorp.com/packer/docs</a></p>
</li>
<li><p>Packer Google Compute Builder: <a href="https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute">https://developer.hashicorp.com/packer/integrations/hashicorp/googlecompute</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Context Hub (chub) to Build a Companion Relevance Engine
 ]]>
                </title>
                <description>
                    <![CDATA[ Large language models can write code quickly, but they still misremember APIs, miss version-specific details, and forget what they learned at the end of a session. That is the problem Context Hub is t ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-context-hub-chub-to-build-a-companion-relevance-engine/</link>
                <guid isPermaLink="false">69e299d0fd22b8ad6276817b</guid>
                
                    <category>
                        <![CDATA[ context-hub ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Developer Tools ]]>
                    </category>
                
                    <category>
                        <![CDATA[ search ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agentic AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agents ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Open Source ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nataraj Sundar ]]>
                </dc:creator>
                <pubDate>Fri, 17 Apr 2026 20:36:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/14f9768e-436d-4c7e-b86c-3d380e821354.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Large language models can write code quickly, but they still misremember APIs, miss version-specific details, and forget what they learned at the end of a session.</p>
<p>That is the problem Context Hub is trying to solve.</p>
<p>Context Hub (<code>chub</code>) gives coding agents curated, versioned documentation and skills that they can search and fetch through a CLI. It also gives them two learning loops: local annotations for agent memory and feedback for maintainers.</p>
<p>In this tutorial, you'll learn how the official <code>chub</code> workflow works, how Context Hub organizes docs and skills, how annotations and feedback create a memory loop, and how to build a <a href="https://github.com/natarajsundar/context-hub-relevance-engine/">companion relevance engine</a> that improves retrieval without breaking the upstream content model.</p>
<p>This tutorial uses two public repositories side by side:</p>
<ul>
<li><p>the official upstream project: <a href="https://github.com/andrewyng/context-hub">andrewyng/context-hub</a></p>
</li>
<li><p>the companion implementation for this article: <a href="https://github.com/natarajsundar/context-hub-relevance-engine/">natarajsundar/context-hub-relevance-engine</a></p>
</li>
</ul>
<p>I've also opened a corresponding upstream pull request from my fork to the main project. If you want to track that work from the article, use the upstream pull request list filtered by author: <a href="https://github.com/andrewyng/context-hub/pulls?q=is%3Apr+author%3Anatarajsundar">andrewyng/context-hub pull requests by <code>natarajsundar</code></a>.</p>
<h2 id="heading-what-well-build">What We'll Build</h2>
<p>By the end of this tutorial, you'll have:</p>
<ul>
<li><p>a clear mental model for how Context Hub works</p>
</li>
<li><p>a working local install of the official <code>chub</code> CLI</p>
</li>
<li><p>a repeatable workflow for search, fetch, annotations, and feedback</p>
</li>
<li><p>a companion repo that adds an additive reranking layer on top of a Context-Hub-style content tree</p>
</li>
<li><p>a small benchmark and local comparison UI you can run end to end</p>
</li>
<li><p>a clear bridge between the companion repo and the smaller upstream PR</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you start, make sure you have:</p>
<ul>
<li><p>Node.js 18 or newer</p>
</li>
<li><p>npm</p>
</li>
<li><p>comfort with the terminal</p>
</li>
<li><p>basic familiarity with Markdown</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-how-to-understand-context-hub">How to Understand Context Hub</a></p>
</li>
<li><p><a href="#heading-how-to-understand-the-official-repo-the-companion-repo-and-the-upstream-pr">How to Understand the Official Repo, the Companion Repo, and the Upstream PR</a></p>
</li>
<li><p><a href="#heading-how-to-install-and-use-the-official-cli">How to Install and Use the Official CLI</a></p>
</li>
<li><p><a href="#heading-how-to-understand-docs-skills-and-the-content-layout">How to Understand Docs, Skills, and the Content Layout</a></p>
</li>
<li><p><a href="#heading-how-to-use-incremental-fetch-and-layered-sources">How to Use Incremental Fetch and Layered Sources</a></p>
</li>
<li><p><a href="#heading-how-to-use-annotations-and-feedback-to-create-a-memory-loop">How to Use Annotations and Feedback to Create a Memory Loop</a></p>
</li>
<li><p><a href="#heading-how-to-see-where-relevance-still-misses">How to See Where Relevance Still Misses</a></p>
</li>
<li><p><a href="#heading-how-the-companion-relevance-engine-improves-retrieval">How the Companion Relevance Engine Improves Retrieval</a></p>
</li>
<li><p><a href="#heading-how-to-run-the-companion-repo-end-to-end">How to Run the Companion Repo End to End</a></p>
</li>
<li><p><a href="#heading-how-to-read-the-benchmark-honestly">How to Read the Benchmark Honestly</a></p>
</li>
<li><p><a href="#heading-how-to-connect-the-companion-repo-to-the-upstream-pr">How to Connect the Companion Repo to the Upstream PR</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-sources">Sources</a></p>
</li>
</ol>
<h2 id="heading-how-to-understand-context-hub">How to Understand Context Hub</h2>
<p>Context Hub is easiest to understand as a workflow for turning fast-moving documentation into a reliable input for coding agents.</p>
<p>Instead of asking an agent to rely on whatever it remembers from training data, you give it a predictable contract:</p>
<ol>
<li><p>search for the right entry</p>
</li>
<li><p>fetch the right doc or skill</p>
</li>
<li><p>write code against that curated content</p>
</li>
<li><p>save local lessons as annotations</p>
</li>
<li><p>send doc-quality feedback back to maintainers</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/694ca88d5ac09a5d68c63854/09d75c85-fbb0-4c9a-86d5-8acdff4e1abf.png" alt="Diagram showing the Context Hub loop from developer prompt to agent search and fetch, then annotations and maintainer feedback." style="display:block;margin:0 auto" width="1654" height="307" loading="lazy">

<p>That system boundary matters.</p>
<p>It makes the agent easier to audit, easier to improve, and easier to extend. It also keeps the interface small enough that you can reason about where the failures happen. If the agent still misses the answer, you can ask whether the problem happened during search, fetch, context selection, or generation.</p>
<h2 id="heading-how-to-understand-the-official-repo-the-companion-repo-and-the-upstream-pr">How to Understand the Official Repo, the Companion repo, and the Upstream PR</h2>
<p>This tutorial is intentionally split across two codebases and one contribution path.</p>
<p>The official upstream project, <a href="https://github.com/andrewyng/context-hub">andrewyng/context-hub</a>, is the source of truth for the real CLI, the content model, and the documented workflows. That's the codebase you should use to learn how <code>chub</code> works today.</p>
<p>The companion repository, <a href="https://github.com/natarajsundar/context-hub-relevance-engine/">natarajsundar/context-hub-relevance-engine</a>, is where the relevant ideas in this article are made concrete. It's a companion implementation, not a replacement product. Its job is to make retrieval tradeoffs visible, measurable, and easy to run locally.</p>
<p>The upstream PR is the bridge between those two worlds. The companion repo is where you can iterate faster on benchmarks, reranking, and the comparison UI. The upstream PR is where the smallest reviewable slices can be proposed back to the main project. You can track that thread here: <a href="https://github.com/andrewyng/context-hub/pulls?q=is%3Apr+author%3Anatarajsundar">upstream PR search filtered by author</a>.</p>
<p>That three-part framing keeps the article honest:</p>
<ul>
<li><p><strong>use the upstream repo</strong> to understand the current system</p>
</li>
<li><p><strong>use the companion repo</strong> to explore relevant improvements end to end</p>
</li>
<li><p><strong>use the upstream PR</strong> to show how a larger idea can be broken into reviewable pieces</p>
</li>
</ul>
<h2 id="heading-how-to-install-and-use-the-official-cli">How to Install and Use the Official CLI</h2>
<p>The official quick start is intentionally small.</p>
<pre><code class="language-bash">npm install -g @aisuite/chub
</code></pre>
<p>Once the CLI is installed, you can search for what is available and fetch a specific entry:</p>
<pre><code class="language-bash">chub search openai
chub get openai/chat --lang py
</code></pre>
<p>That's the happy path, but it helps to think through the request flow.</p>
<img src="https://cdn.hashnode.com/uploads/covers/694ca88d5ac09a5d68c63854/c5ff71d4-5e51-48b8-bbd3-fc2aafa93b9d.png" alt="Sequence diagram showing the developer asking the agent for current docs, the agent calling chub search and chub get, and the CLI fetching docs from the registry." style="display:block;margin:0 auto" width="1416" height="683" loading="lazy">

<p>In practice, the most useful detail is that the CLI is designed for the <strong>agent</strong> to use, not just for the human to use by hand.</p>
<p>That's why the upstream CLI also ships a <code>get-api-docs</code> skill. For example, if you use Claude Code, you can copy the skill into your local project like this:</p>
<pre><code class="language-bash">mkdir -p .claude/skills
cp $(npm root -g)/@aisuite/chub/skills/get-api-docs/SKILL.md \
  .claude/skills/get-api-docs.md
</code></pre>
<p>That step teaches the agent a retrieval habit:</p>
<blockquote>
<p>Before you write code against a third-party SDK or API, use <code>chub</code> instead of guessing.</p>
</blockquote>
<p>That behavioral rule is often as important as the docs themselves.</p>
<h2 id="heading-how-to-understand-docs-skills-and-the-content-layout">How to Understand Docs, Skills, and the Content Layout</h2>
<p>Context Hub separates content into two categories:</p>
<ul>
<li><p><strong>docs</strong>, which answer “what should the agent know?”</p>
</li>
<li><p><strong>skills</strong>, which answer “how should the agent behave?”</p>
</li>
</ul>
<p>That distinction makes the content model easier to scale. Docs can be versioned and language-specific. Skills can stay short and operational.</p>
<p>The directory structure is also predictable. The content guide organizes entries by author, then by <code>docs</code> or <code>skills</code>, then by entry name.</p>
<img src="https://cdn.hashnode.com/uploads/covers/694ca88d5ac09a5d68c63854/3ac72bc2-c869-4e2e-9294-d63b35991135.png" alt="Diagram showing the content tree from author to docs and skills, with DOC.md and SKILL.md feeding a build step that emits registry and search artifacts." style="display:block;margin:0 auto" width="674" height="739" loading="lazy">

<p>A small example looks like this:</p>
<pre><code class="language-text">author/docs/payments/python/DOC.md
author/docs/payments/python/references/errors.md
author/skills/login-flows/SKILL.md
</code></pre>
<p>This is one of the reasons Context Hub is easy to work with.</p>
<p>The shape of the content is plain Markdown, the main entry file is predictable, and the build output is inspectable. You don't have to reverse engineer a hidden prompt layer to figure out what the agent is reading.</p>
<h2 id="heading-how-to-use-incremental-fetch-and-layered-sources">How to Use Incremental Fetch and Layered Sources</h2>
<p>One of the best design choices in Context Hub is that it doesn't force you to inject every file into the model on every request.</p>
<p>Instead, the entry file gives you the overview, and the reference files hold the deeper material.</p>
<img src="https://cdn.hashnode.com/uploads/covers/694ca88d5ac09a5d68c63854/88d80a48-c991-495a-af25-14a0c0ac9868.png" alt="Diagram showing how chub get can fetch just the main entry file, a specific reference file, or the full entry directory." style="display:block;margin:0 auto" width="592" height="460" loading="lazy">

<p>That lets you fetch content in progressively larger slices.</p>
<pre><code class="language-bash">chub get stripe/webhooks --lang py
chub get stripe/webhooks --lang py --file references/raw-body.md
chub get stripe/webhooks --lang py --full
</code></pre>
<p>This is a token-budget feature as much as it is a documentation feature. A good agent should first load the overview, decide what part of the task matters, and only then fetch the specific supporting file.</p>
<p>Context Hub also supports layered sources. You can merge public content with your own local build output through <code>~/.chub/config.yaml</code>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/694ca88d5ac09a5d68c63854/67465254-7a7c-4cfc-b9f0-9e94d8c3e2f3.png" alt="Diagram showing community, official, and local team sources merging into one search surface for chub search and chub get." style="display:block;margin:0 auto" width="774" height="460" loading="lazy">

<p>A minimal configuration looks like this:</p>
<pre><code class="language-yaml">sources:
  - name: community
    url: https://cdn.aichub.org/v1
  - name: my-team
    path: /opt/team-docs/dist
</code></pre>
<p>That means you can keep public docs in one lane and team-specific runbooks in another lane while still giving the agent one search surface.</p>
<h2 id="heading-how-to-use-annotations-and-feedback-to-create-a-memory-loop">How to Use Annotations and Feedback to Create a Memory Loop</h2>
<p>Context Hub has two different improvement channels.</p>
<p>Annotations are local. They help your agent remember what worked last time. Feedback is shared. It helps maintainers improve the docs for everyone.</p>
<p>That distinction matters because not every lesson belongs in the shared registry. Some lessons are environment-specific. Others point to content quality issues that should be fixed centrally.</p>
<img src="https://cdn.hashnode.com/uploads/covers/694ca88d5ac09a5d68c63854/a8514430-08cb-4085-8047-64df25c603c7.png" alt="Diagram showing the agent fetch/write cycle, then branching to local annotations or maintainer feedback before the next task." style="display:block;margin:0 auto" width="808" height="798" loading="lazy">

<p>Here is what local memory looks like in practice:</p>
<pre><code class="language-bash">chub annotate stripe/webhooks \
  "Remember: Flask request.data must stay raw for Stripe signature verification."
</code></pre>
<p>And here's the feedback path:</p>
<pre><code class="language-bash">chub feedback stripe/webhooks up
</code></pre>
<p>That loop is simple, but it's one of the most important ideas in the project. It turns a one-off debugging lesson into either persistent local memory or a signal that the shared docs need to improve.</p>
<h2 id="heading-how-to-see-where-relevance-still-misses">How to See Where Relevance Still Misses</h2>
<p>The upstream project already has a real ranking story. It uses BM25 and lexical rescue so that package-like identifiers, exact tokens, and fuzzy matches still have a chance to surface.</p>
<p>That is a strong baseline.</p>
<p>But developer queries are often much messier than package names.</p>
<p>People search for:</p>
<ul>
<li><p><code>rrf</code></p>
</li>
<li><p><code>signin</code></p>
</li>
<li><p><code>pg vector</code></p>
</li>
<li><p><code>hnsw</code></p>
</li>
<li><p><code>raw body stripe</code></p>
</li>
</ul>
<p>Those aren't “bad” queries. They're realistic shorthand.</p>
<p>And they expose an opportunity in the content model itself: many of the exact answers live in reference files such as <code>references/rrf.md</code>, <code>references/raw-body.md</code>, and <code>references/hnsw.md</code>.</p>
<p>So the question is not whether the current search works at all. It clearly does. The better question is this:</p>
<blockquote>
<p>How can you improve retrieval without breaking the content contract that already makes Context Hub useful?</p>
</blockquote>
<p>The answer in the companion repo is to keep the current model and add a reranking layer on top of it.</p>
<h2 id="heading-how-the-companion-relevance-engine-improves-retrieval">How the Companion Relevance Engine Improves Retrieval</h2>
<p>The companion repository in this article is <a href="https://github.com/natarajsundar/context-hub-relevance-engine/"><code>context-hub-relevance-engine</code></a>.</p>
<p>It keeps the same broad ideas that make Context Hub attractive:</p>
<ul>
<li><p>plain Markdown content</p>
</li>
<li><p><code>DOC.md</code> and <code>SKILL.md</code> entry points</p>
</li>
<li><p>build artifacts you can inspect</p>
</li>
<li><p>local annotations and feedback</p>
</li>
<li><p>progressive fetch behavior</p>
</li>
</ul>
<p>Then it adds one new build artifact: <code>signals.json</code>.</p>
<p>At build time, the engine extracts extra signals such as:</p>
<ul>
<li><p>headings from the main file</p>
</li>
<li><p>titles and tokens from reference files</p>
</li>
<li><p>language and version metadata</p>
</li>
<li><p>source metadata and freshness</p>
</li>
<li><p>annotation overlap</p>
</li>
<li><p>feedback priors</p>
</li>
</ul>
<p>The first pass stays cheap and transparent. The reranker only runs after the baseline has done its work.</p>
<img src="https://cdn.hashnode.com/uploads/covers/694ca88d5ac09a5d68c63854/2ed2dadb-8fff-41ee-904b-0792cafcf744.png" alt="Diagram showing the relevance pipeline from query to BM25 and lexical rescue, then synonym expansion, candidate set building, reranking signals, and final results." style="display:block;margin:0 auto" width="1399" height="541" loading="lazy">

<p>That approach matters for two reasons.</p>
<p>First, it's additive. You don't have to redesign the content tree.</p>
<p>Second, it's measurable. You can define concrete failure modes, fix them one by one, and run the same benchmark every time you change the scorer.</p>
<h2 id="heading-how-to-run-the-companion-repo-end-to-end">How to Run the Companion Repo End to End</h2>
<p>Open the repository on <a href="https://github.com/natarajsundar/context-hub-relevance-engine/">GitHub</a>, clone it using GitHub’s normal clone flow, and then run the commands below from the project root.</p>
<pre><code class="language-bash">cd context-hub-relevance-engine
npm install
npm run build
npm test
</code></pre>
<p>The repository has no third-party runtime dependencies, so <code>npm install</code> is mostly there to keep the workflow familiar. The main commands are all plain Node scripts.</p>
<h3 id="heading-how-to-reproduce-a-baseline-miss">How to Reproduce a Baseline Miss</h3>
<p>Start with the query <code>rrf</code>.</p>
<pre><code class="language-bash">node bin/chub-lab.mjs search rrf --mode baseline --lang python
</code></pre>
<p>Expected output:</p>
<pre><code class="language-text">No results.
</code></pre>
<p>Now run the improved mode.</p>
<pre><code class="language-bash">node bin/chub-lab.mjs search rrf --mode improved --lang python
</code></pre>
<p>Expected top result:</p>
<pre><code class="language-text">langchain/retrievers [doc] score=320.24
  Composable retrieval patterns for hybrid search, parent documents, query expansion, and reranking.
</code></pre>
<p>That win happens because the improved mode looks beyond the top-level entry description. It also sees the reference file title <code>rrf</code>, the related terms from query expansion, and the broader token overlap in the extracted signals.</p>
<h3 id="heading-how-to-reproduce-a-workflow-intent-win">How to Reproduce a Workflow-intent Win</h3>
<p>Try a sign-in query.</p>
<pre><code class="language-bash">node bin/chub-lab.mjs search signin --mode baseline
node bin/chub-lab.mjs search signin --mode improved
</code></pre>
<p>The baseline misses. The improved mode returns <code>playwright-community/login-flows</code> because the reranker treats <code>signin</code>, <code>sign in</code>, <code>login</code>, and <code>authentication</code> as related intent.</p>
<h3 id="heading-how-to-test-the-memory-loop">How to Test the Memory Loop</h3>
<p>Write a local note:</p>
<pre><code class="language-bash">node bin/chub-lab.mjs annotate stripe/webhooks \
  "Remember: Flask request.data must stay raw for Stripe signature verification."
</code></pre>
<p>Then fetch the doc:</p>
<pre><code class="language-bash">node bin/chub-lab.mjs get stripe/webhooks --lang python
</code></pre>
<p>You will see the main doc content, the list of available reference files, and the appended annotation.</p>
<p>That's the behavior you want from an agent memory loop: learn once, reuse many times.</p>
<h3 id="heading-how-to-run-the-benchmark">How to Run the Benchmark</h3>
<p>Start from an empty store:</p>
<pre><code class="language-bash">npm run reset-store
node bin/chub-lab.mjs evaluate
</code></pre>
<p>The included synthetic stress set reports the following summary with an empty store:</p>
<table>
<thead>
<tr>
<th>Mode</th>
<th>Top-1 Accuracy</th>
<th>MRR</th>
</tr>
</thead>
<tbody><tr>
<td>baseline</td>
<td>0.333</td>
<td>0.333</td>
</tr>
<tr>
<td>improved</td>
<td>1.000</td>
<td>1.000</td>
</tr>
</tbody></table>
<p>You can also seed the store and rerun the evaluation:</p>
<pre><code class="language-bash">npm run seed-demo
node bin/chub-lab.mjs evaluate
</code></pre>
<p>That demonstrates how annotations and feedback can push relevant entries even higher when the query overlaps with the agent’s own history.</p>
<h3 id="heading-how-to-launch-the-local-comparison-ui">How to Launch the Local Comparison UI</h3>
<pre><code class="language-bash">npm run serve
</code></pre>
<p>Then open <code>http://localhost:8787</code> in your browser.</p>
<p>The UI lets you compare baseline and improved retrieval, inspect stored annotations and feedback, rebuild the local artifacts, and rerun the benchmark from one place.</p>
<h2 id="heading-how-to-read-the-benchmark-honestly">How to Read the Benchmark Honestly</h2>
<p>The benchmark in this repo is intentionally small.</p>
<p>That is a feature, not a flaw.</p>
<p>The point is not to claim universal search quality. The point is to make a handful of realistic failure modes easy to reproduce:</p>
<ul>
<li><p>acronym queries</p>
</li>
<li><p>shorthand workflow queries</p>
</li>
<li><p>reference-file topic queries</p>
</li>
<li><p>memory-aware reranking</p>
</li>
</ul>
<p>That keeps the evaluation honest.</p>
<p>If a future scoring change breaks <code>rrf</code>, <code>signin</code>, or <code>raw body stripe</code>, you'll know immediately. And if you add a stronger dataset later, you can keep these tests as regression guards.</p>
<p>The benchmark files included in the repo are:</p>
<ul>
<li><p><code>demo/benchmark.json</code></p>
</li>
<li><p><code>docs/benchmark-empty-store.json</code></p>
</li>
<li><p><code>docs/benchmark-seeded-store.json</code></p>
</li>
<li><p><code>docs/relevance-improvement-plan.md</code></p>
</li>
</ul>
<h2 id="heading-how-to-connect-the-companion-repo-to-the-upstream-pr">How to Connect the Companion Repo to the Upstream PR</h2>
<p>A good companion repo is broad enough to explore ideas quickly. A good upstream PR is narrow enough to review.</p>
<p>That's why the two shouldn't be identical.</p>
<p>The companion repository is where you can keep the full relevance story together:</p>
<ul>
<li><p>the local comparison UI</p>
</li>
<li><p>the synthetic benchmark</p>
</li>
<li><p>the richer reranking signals</p>
</li>
<li><p>the debug and explain surfaces</p>
</li>
<li><p>the documentation that walks through tradeoffs end to end</p>
</li>
</ul>
<p>The upstream PR should be smaller and more surgical. In practice, that usually means proposing the most reviewable slices first, such as:</p>
<ol>
<li><p>reference-file signal extraction</p>
</li>
<li><p>explainable score output for debugging</p>
</li>
<li><p>a lightweight benchmark fixture format</p>
</li>
<li><p>one additive reranking hook behind a flag</p>
</li>
</ol>
<p>That keeps the main repository maintainable while still letting the article and companion repo tell the full engineering story. The upstream thread for this work lives here: <a href="https://github.com/andrewyng/context-hub/pulls?q=is%3Apr+author%3Anatarajsundar">andrewyng/context-hub pull requests by <code>natarajsundar</code></a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>What makes Context Hub interesting is not just that it stores documentation. It gives you a clear system boundary for improving coding agents.</p>
<p>You can inspect what the agent reads. You can decide when it should retrieve. You can layer public and private sources. You can persist local lessons. And you can improve ranking without tearing the whole model apart.</p>
<p>The companion relevance engine shows how to keep what already works, make one part of the system measurably better, and package the result in a way other developers can run, inspect, and extend. The upstream PR, in turn, shows how to turn a broad idea into smaller pieces that are realistic to review in the main project.</p>
<h2 id="heading-diagram-attribution">Diagram Attribution</h2>
<p>All diagrams used in this article were created by the author specifically for this tutorial and its companion repository.</p>
<h2 id="heading-sources">Sources</h2>
<ul>
<li><p><a href="https://github.com/andrewyng/context-hub">Context Hub repository</a></p>
</li>
<li><p><a href="https://github.com/andrewyng/context-hub/blob/main/README.md">Context Hub README</a></p>
</li>
<li><p><a href="https://github.com/andrewyng/context-hub/blob/main/cli/README.md">Context Hub CLI README</a></p>
</li>
<li><p><a href="https://github.com/andrewyng/context-hub/blob/main/docs/cli-reference.md">Context Hub CLI reference</a></p>
</li>
<li><p><a href="https://github.com/andrewyng/context-hub/blob/main/docs/content-guide.md">Context Hub content guide</a></p>
</li>
<li><p><a href="https://github.com/andrewyng/context-hub/blob/main/docs/byod-guide.md">Context Hub bring-your-own-docs guide</a></p>
</li>
<li><p><a href="https://github.com/andrewyng/context-hub/blob/main/docs/feedback-and-annotations.md">Context Hub feedback and annotations guide</a></p>
</li>
<li><p><a href="https://github.com/natarajsundar/context-hub-relevance-engine/">Companion repository: <code>context-hub-relevance-engine</code></a></p>
</li>
<li><p><a href="https://github.com/andrewyng/context-hub/pulls?q=is%3Apr+author%3Anatarajsundar">Upstream pull request search filtered by author</a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Fashion App That Helps You Organize Your Wardrobe  ]]>
                </title>
                <description>
                    <![CDATA[ I used to spend too long deciding what to wear, even when my closet was full. That frustration made the problem feel very clear to me: it was not about having fewer clothes. It was about having better ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-fashion-app-to-organize-your-wardrobe/</link>
                <guid isPermaLink="false">69de6abf91716f3cfb5448a1</guid>
                
                    <category>
                        <![CDATA[ webdev ]]>
                    </category>
                
                    <category>
                        <![CDATA[ JavaScript ]]>
                    </category>
                
                    <category>
                        <![CDATA[ React ]]>
                    </category>
                
                    <category>
                        <![CDATA[ full stack ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mokshita V P ]]>
                </dc:creator>
                <pubDate>Tue, 14 Apr 2026 16:26:39 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/bf593ff6-6de8-4b30-ab0a-700c3410ccb1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I used to spend too long deciding what to wear, even when my closet was full.</p>
<p>That frustration made the problem feel very clear to me: it was not about having fewer clothes. It was about having better organization, better visibility, and better guidance when making outfit decisions.</p>
<p>So I built a fashion web app that helps users organize their wardrobe, get outfit suggestions, evaluate shopping decisions, and improve recommendations over time using feedback.</p>
<p>In this article, I’ll walk through what the app does, how I built it, the decisions I made along the way, and the challenges that shaped the final result.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-table-of-contents">Table of Contents</a></p>
</li>
<li><p><a href="#heading-what-the-app-does">What the App Does</a></p>
</li>
<li><p><a href="#heading-why-i-built-it">Why I Built It</a></p>
</li>
<li><p><a href="#heading-tech-stack">Tech Stack</a></p>
</li>
<li><p><a href="#heading-product-walkthrough-what-users-see">Product Walkthrough (What Users See)</a></p>
</li>
<li><p><a href="#heading-how-i-built-it">How I Built It</a></p>
</li>
<li><p><a href="#heading-challenges-i-faced">Challenges I Faced</a></p>
</li>
<li><p><a href="#heading-what-i-learned">What I Learned</a></p>
</li>
<li><p><a href="#heading-what-i-want-to-improve-next">What I Want to Improve Next</a></p>
</li>
<li><p><a href="#heading-future-improvements">Future Improvements</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-the-app-does">What the App Does</h2>
<p>At a high level, the app combines six core capabilities:</p>
<ol>
<li><p>Wardrobe management</p>
</li>
<li><p>Outfit recommendations</p>
</li>
<li><p>Shopping suggestions</p>
</li>
<li><p>Discard recommendations</p>
</li>
<li><p>Feedback and usage tracking</p>
</li>
<li><p>Secure multi-user accounts</p>
</li>
</ol>
<p>Users can upload clothing items, explore suggested outfits, and mark recommendations as helpful or not helpful. They can also rate outfits and track whether items are worn, kept, or discarded.</p>
<p>That feedback becomes structured data for improving future recommendation quality.</p>
<h2 id="heading-why-i-built-it">Why I Built It</h2>
<p>I wanted to create something that felt personal and actually useful. A lot of fashion apps look polished, but they do not always help with everyday decisions. My goal was to build something that could make wardrobe management easier and outfit selection less overwhelming. The app needed to do three things well:</p>
<ul>
<li><p>store each user’s wardrobe data</p>
</li>
<li><p>personalize recommendations</p>
</li>
<li><p>learn from user feedback over time .</p>
</li>
</ul>
<p>That feedback loop mattered to me because it makes the app feel more alive instead of static.</p>
<h2 id="heading-tech-stack">Tech Stack</h2>
<p>Here are the tools I used to built the app:</p>
<ul>
<li><p>Frontend: React + Vite</p>
</li>
<li><p>Backend: FastAPI</p>
</li>
<li><p>Database: SQLite (local development)</p>
</li>
<li><p>Background jobs: Celery + Redis</p>
</li>
<li><p>Authentication: JWT (access + refresh token flow)</p>
</li>
<li><p>Deployment support: Docker and GitHub Codespaces</p>
</li>
</ul>
<p>This ended up giving me a pretty modular setup, which helped a lot as features started increasing: fast frontend iteration, clean API boundaries, and room to evolve recommendations separately from UI.</p>
<h2 id="heading-product-walkthrough-what-users-see">Product Walkthrough (What Users See)</h2>
<h3 id="heading-1-onboarding-and-account-setup">1. Onboarding and Account Setup</h3>
<p>To start using the app, a user needs to register, verify their email, and complete some profile basics.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/1ff4fb0d-dc97-4088-b720-db917b53ba5b.png" alt="Onboarding screen showing account creation, email verification, and profile fields for body shape, height, weight, and style preferences." style="display:block;margin:0 auto" width="1319" height="850" loading="lazy">

<p>Each account is isolated, so wardrobe history and recommendations stay user-specific.</p>
<p>In this onboarding screen above, you can see account creation, email verification, and profile fields for body shape, height, weight, and style preferences.</p>
<h3 id="heading-2-wardrobe-upload">2. Wardrobe Upload</h3>
<p>Users can upload clothing images .</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/d69bf10b-b79b-4294-923c-5c9e5840098a.png" alt="Wardrobe upload form showing clothing image analysis results with category, dominant color, secondary color, and pattern details." style="display:block;margin:0 auto" width="1320" height="625" loading="lazy">

<p>Image analysis labels each item and makes it searchable for recommendations. The wardrobe upload form shows image analysis results with category, dominant color, secondary color, and pattern details listed.</p>
<h3 id="heading-3-outfit-recommendations">3. Outfit Recommendations</h3>
<p>Users can request recommendations, then rate outputs.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/61527ddf-11e4-4284-92fd-2d0c948ae2db.png" alt="Outfit recommendation dashboard showing ranked outfit cards with feedback and rating actions." style="display:block;margin:0 auto" width="1011" height="692" loading="lazy">

<p>Above you can see the outfit recommendation dashboard that shows ranked outfit cards with feedback and rating actions. Recommendations are ranked by a weighted scoring model.</p>
<h3 id="heading-4-shopping-and-discard-assistants">4. Shopping and Discard Assistants</h3>
<p>The app evaluates new items against existing wardrobe data and flags low-value wardrobe items that may be worth removing.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68ab1274684dc97382d342ea/88ed83c4-fdba-40e7-ad32-f77bdf21cb4d.png" alt="Shopping and discard analysis screen showing recommendation scores, written reasons, and styling guidance for each item." style="display:block;margin:0 auto" width="1324" height="852" loading="lazy">

<p>You can see the recommendation scores, written reasons (not just a binary decision), and styling guidance for each item above. It also features a "how to style it" incase the user still wants to keep the item.</p>
<h2 id="heading-how-i-built-it">How I Built It</h2>
<h3 id="heading-1-frontend-setup-react-vite">1. Frontend Setup (React + Vite)</h3>
<p>I used React + Vite because I wanted fast iteration and a clean component structure.</p>
<p>The frontend is split into feature areas like onboarding, wardrobe management, outfits, shopping, and discarded-item suggestions. I also keep API calls in a service layer so the UI components stay focused on rendering and interaction.</p>
<p>The snippet below is a simplified example of the API service pattern used in the app. It is not meant to be copy-pasted as-is, but it shows the same structure the frontend uses when talking to the backend.</p>
<p>Example API client pattern:</p>
<pre><code class="language-javascript">export async function getOutfitRecommendations(userId, params = {}) {
  const query = new URLSearchParams(params).toString();
  const url = `/users/\({userId}/outfits/recommend\){query ? `?${query}` : ""}`;

  const response = await fetch(url, {
    headers: {
      Authorization: `Bearer ${localStorage.getItem("access_token")}`,
    },
  });

  if (!response.ok) {
    throw new Error("Failed to fetch outfit recommendations");
  }

  return response.json();
}
</code></pre>
<p>Here's what's happening in that snippet:</p>
<ul>
<li><p><code>URLSearchParams</code> builds optional query strings like <code>occasion</code>, <code>season</code>, or <code>limit</code>.</p>
</li>
<li><p>The request path is user-scoped, which keeps each user’s recommendations isolated.</p>
</li>
<li><p>The <code>Authorization</code> header sends the access token so the backend can verify the session.</p>
</li>
<li><p>The response is checked before parsing so the UI can surface a useful error if the request fails.</p>
</li>
</ul>
<p>This pattern kept the frontend simple and reusable as the number of API calls grew.</p>
<h3 id="heading-2-backend-architecture-with-fastapi">2. Backend Architecture with FastAPI</h3>
<p>The backend is organized around clear route groups:</p>
<ul>
<li><p>auth routes for register, login, refresh, logout, and sessions</p>
</li>
<li><p>user analysis routes</p>
</li>
<li><p>wardrobe CRUD routes</p>
</li>
<li><p>recommendation routes for outfits, shopping, and discard analysis</p>
</li>
<li><p>feedback routes for ratings and helpfulness signals</p>
</li>
</ul>
<p>One of the most important design choices was enforcing ownership checks on user-scoped resources. That prevented one user from accessing another user’s wardrobe or feedback data.</p>
<p>The backend snippet below is another simplified example from the app’s route layer. It shows the request validation and orchestration logic, while the actual scoring work stays in the recommendation service.</p>
<pre><code class="language-python">@app.get("/users/{user_id}/outfits/recommend")
def recommend_outfits(user_id: int, occasion: str | None = None, season: str | None = None, limit: int = 10):
    user = get_user_or_404(user_id)
    wardrobe_items = get_user_wardrobe(user_id)

    if len(wardrobe_items) &lt; 2:
        raise HTTPException(status_code=400, detail="Not enough wardrobe items")

    recommendations = outfit_generator.generate_outfit_recommendations(
        wardrobe_items=wardrobe_items,
        body_shape=user.body_shape,
        undertone=user.undertone,
        occasion=occasion,
        season=season,
        top_k=limit,
    )

    return {"user_id": user_id, "recommendations": recommendations}
</code></pre>
<p>Here's how to read that code:</p>
<ul>
<li><p><code>get_user_or_404</code> loads the profile data needed for personalization.</p>
</li>
<li><p><code>get_user_wardrobe</code> fetches only the current user’s items.</p>
</li>
<li><p>The minimum wardrobe check prevents the recommendation logic from running on incomplete data.</p>
</li>
<li><p><code>generate_outfit_recommendations</code> handles the scoring logic separately, which keeps the route handler small and easier to test.</p>
</li>
<li><p>The response returns the results in a shape the frontend can consume directly.</p>
</li>
</ul>
<p>That separation helped keep the API layer readable while the recommendation logic stayed isolated in its own service.</p>
<h3 id="heading-3-recommendation-logic">3. Recommendation Logic</h3>
<p>I intentionally started with deterministic rules before introducing heavy ML. That made behavior easier to debug and explain.</p>
<p>The outfit recommender scores combinations using weighted signals:</p>
<p>$$\text{outfit score} = 0.4 \cdot \text{color harmony} + 0.4 \cdot \text{body-shape fit} + 0.2 \cdot \text{undertone fit}$$</p>
<p>The snippet below is a simplified example from the recommendation engine. It shows how the app combines multiple signals into a single score:</p>
<pre><code class="language-python">def score_outfit(combo, user_context):
    color_score = color_harmony.score(combo)
    shape_score = body_shape_rules.score(combo, user_context.body_shape)
    undertone_score = undertone_rules.score(combo, user_context.undertone)

    total = 0.4 * color_score + 0.4 * shape_score + 0.2 * undertone_score
    return round(total, 3)
</code></pre>
<p>The logic behind this approach is straightforward:</p>
<ul>
<li><p>color harmony helps the outfit feel visually coherent</p>
</li>
<li><p>body-shape scoring helps the outfit feel flattering</p>
</li>
<li><p>undertone scoring helps the colors work better with the user’s profile</p>
</li>
</ul>
<p>I used a similar structure for discard recommendations and shopping suggestions, but with different factors and thresholds.</p>
<h3 id="heading-4-authentication-and-secure-multi-user-design">4. Authentication and Secure Multi-user Design</h3>
<p>Security was one of the most important parts of this build.</p>
<p>I implemented:</p>
<ul>
<li><p>short-lived access tokens</p>
</li>
<li><p>refresh tokens with JTI tracking</p>
</li>
<li><p>token rotation on refresh</p>
</li>
<li><p>session revocation (single session and all sessions)</p>
</li>
<li><p>email verification and password reset flows</p>
</li>
</ul>
<p>The snippet below is a simplified example of the refresh-token lifecycle used in the app. It shows the important control points rather than every helper function:</p>
<pre><code class="language-python">def refresh_access_token(refresh_token: str):
    payload = decode_jwt(refresh_token)
    jti = payload["jti"]

    token_record = db.get_refresh_token(jti)
    if not token_record or token_record.revoked:
        raise AuthError("Invalid refresh token")

    new_refresh, new_jti = issue_refresh_token(payload["sub"])
    token_record.revoked = True
    token_record.replaced_by_jti = new_jti

    new_access = issue_access_token(payload["sub"])
    return {"access_token": new_access, "refresh_token": new_refresh}
</code></pre>
<p>What this code is doing:</p>
<ul>
<li><p>It decodes the refresh token and looks up its JTI in the database.</p>
</li>
<li><p>It rejects reused or revoked sessions, which helps prevent replay attacks.</p>
</li>
<li><p>It rotates the refresh token instead of reusing it.</p>
</li>
<li><p>It issues a fresh access token so the session stays valid without forcing the user to log in again.</p>
</li>
</ul>
<p>This design made multi-device sessions safer and gave me server-side control over logout behavior.</p>
<h3 id="heading-5-background-jobs-for-long-running-operations">5. Background Jobs for Long-running Operations</h3>
<p>Image analysis can be expensive, especially when the app needs to classify clothing, analyze colors, and estimate body-shape-related signals. To keep the request path responsive, I added Celery + Redis support for background tasks.</p>
<p>That gave the app two modes:</p>
<ul>
<li><p>synchronous processing for simpler local development</p>
</li>
<li><p>queued processing for heavier or slower jobs</p>
</li>
</ul>
<p>That tradeoff mattered because it let me keep the developer experience simple without blocking the app during more expensive work.</p>
<h3 id="heading-6-data-model-and-feedback-capture">6. Data Model and Feedback Capture</h3>
<p>A recommendation system only improves if it captures the right signals.</p>
<p>So I added dedicated feedback tables for:</p>
<ul>
<li><p>outfit ratings (1-5 + optional comments)</p>
</li>
<li><p>recommendation helpful/unhelpful feedback</p>
</li>
<li><p>item usage actions (worn/kept/discarded)</p>
</li>
</ul>
<p>Here is the shape of one of those models:</p>
<pre><code class="language-python">class RecommendationFeedback(Base):
    __tablename__ = "recommendation_feedback"

    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey("users.id"), nullable=False)
    recommendation_type = Column(String(50), nullable=False)
    recommendation_id = Column(Integer, nullable=False)
    helpful = Column(Boolean, nullable=False)
    created_at = Column(DateTime, default=datetime.utcnow)
</code></pre>
<p>How to read this model:</p>
<ul>
<li><p><code>user_id</code> ties feedback to the person who gave it.</p>
</li>
<li><p><code>recommendation_type</code> tells me whether the feedback belongs to outfits, shopping, or discard suggestions.</p>
</li>
<li><p><code>recommendation_id</code> identifies the exact recommendation.</p>
</li>
<li><p><code>helpful</code> stores the user’s direct response.</p>
</li>
<li><p><code>created_at</code> makes it possible to analyze feedback trends over time.</p>
</li>
</ul>
<p>This part of the system gives the app a real learning foundation, even though the feedback-to-model-update loop is still a future improvement.</p>
<h2 id="heading-challenges-i-faced">Challenges I Faced</h2>
<p>This was the section that taught me the most.</p>
<h3 id="heading-1-image-heavy-endpoints-were-slower-than-i-wanted">1. Image-heavy endpoints were slower than I wanted</h3>
<p>The analyze and wardrobe upload flows were doing a lot of work at once: image validation, classification, color extraction, storage, and database writes.</p>
<p>At first, that made the request flow feel heavier than it should have.</p>
<p>What I changed:</p>
<ul>
<li><p>I bounded concurrent image jobs so the app wouldn't try to do too much at once.</p>
</li>
<li><p>I separated slower jobs into background processing where possible.</p>
</li>
<li><p>I used load-test results to confirm which endpoints were actually expensive.</p>
</li>
</ul>
<p>The practical effect was that heavy image requests stopped competing with each other so aggressively. Instead of letting many expensive tasks pile up inside the same request cycle, I limited the active work and pushed slower operations into the queue when needed.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Bounding concurrency prevented the system from overloading CPU-bound tasks.</p>
</li>
<li><p>Moving expensive work into async jobs kept the main request/response cycle more responsive.</p>
</li>
<li><p>Load testing gave me evidence instead of guesswork, so I could tune the system based on real performance behavior.</p>
</li>
</ul>
<p>In other words, I didn't just “optimize” the endpoint in theory. I changed the execution model so expensive analysis could not block every other request behind it.</p>
<h3 id="heading-2-jwt-sessions-needed-real-server-side-control">2. JWT sessions needed real server-side control</h3>
<p>A basic JWT setup is easy to get working, but it becomes less useful if you cannot revoke sessions or manage multiple devices cleanly.</p>
<p>What I changed:</p>
<ul>
<li><p>I stored refresh tokens in the database.</p>
</li>
<li><p>I tracked token JTI values.</p>
</li>
<li><p>I rotated refresh tokens when users refreshed their session.</p>
</li>
<li><p>I added endpoints for logging out a single session or all sessions.</p>
</li>
</ul>
<p>The important shift here was moving from “token exists, therefore session is valid” to “token exists, matches the database record, and has not been revoked or replaced.” That gave the server the authority to invalidate old sessions immediately.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Server-side token tracking made revocation possible.</p>
</li>
<li><p>Rotation reduced the chance of token reuse.</p>
</li>
<li><p>Session management became visible to the user, which made the app feel more trustworthy.</p>
</li>
</ul>
<p>This is what made logout-all and multi-device management work in a real way instead of just being cosmetic UI actions.</p>
<h3 id="heading-3-user-data-isolation-had-to-be-explicit">3. User data isolation had to be explicit</h3>
<p>Because this is a multi-user app, I had to be careful that one account could never accidentally see another account’s wardrobe data.</p>
<p>What I changed:</p>
<ul>
<li><p>I added ownership checks to user-scoped routes.</p>
</li>
<li><p>I kept all wardrobe and feedback queries filtered by <code>user_id</code>.</p>
</li>
<li><p>I used encrypted image storage instead of exposing raw paths.</p>
</li>
</ul>
<p>In practice, this meant every route had to ask the same question: “Does this user own the resource they are trying to access?” If the answer was no, the request stopped immediately.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Ownership checks made data access rules explicit.</p>
</li>
<li><p>User-filtered queries prevented accidental cross-account reads.</p>
</li>
<li><p>Encrypted storage improved privacy and reduced the risk of exposing image data directly.</p>
</li>
</ul>
<p>That combination is what kept wardrobe data, feedback history, and images separated correctly across accounts.</p>
<h3 id="heading-4-docker-made-the-project-easier-to-share-but-only-after-the-stack-was-organized">4. Docker made the project easier to share, but only after the stack was organized</h3>
<p>The app includes the frontend, backend, Redis, Celery worker, and Celery Beat, so the first challenge was making the setup feel reproducible instead of fragile.</p>
<p>What I changed:</p>
<ul>
<li><p>I defined the stack in Docker Compose.</p>
</li>
<li><p>I documented the required environment variables.</p>
</li>
<li><p>I kept the dev stack aligned with how the app runs in practice.</p>
</li>
</ul>
<p>This removed a lot of setup ambiguity. Instead of asking someone to manually figure out how the frontend, backend, Redis, and workers fit together, I made the stack describe itself.</p>
<p>Why this fixed it:</p>
<ul>
<li><p>Docker let contributors start the project with fewer manual steps.</p>
</li>
<li><p>Clear environment configuration reduced setup mistakes.</p>
</li>
<li><p>Matching the stack to the architecture made the app easier to understand and test.</p>
</li>
</ul>
<p>That was important because the app depends on several moving parts, and the simplest way to make the project approachable was to make startup behavior predictable.</p>
<h2 id="heading-what-i-learned">What I Learned</h2>
<p>This project taught me a few important lessons:</p>
<ul>
<li><p>Small features become much more valuable when they work together.</p>
</li>
<li><p>Feedback data is one of the strongest signals for improving recommendations.</p>
</li>
<li><p>Clean data modeling matters a lot when multiple users are involved.</p>
</li>
<li><p>Docker and clear setup instructions make a project much easier for other people to try.</p>
</li>
</ul>
<p>I also learned that a project does not need to be huge to be useful. A focused app that solves one problem well can still feel meaningful.</p>
<h2 id="heading-what-i-want-to-improve-next">What I Want to Improve Next</h2>
<p>My roadmap from here:</p>
<ol>
<li><p>Integrate feedback directly into ranking updates</p>
</li>
<li><p>Add visual analytics for recommendation quality trends</p>
</li>
<li><p>Improve mobile UX parity</p>
</li>
<li><p>Deploy with persistent cloud storage and production database defaults</p>
</li>
<li><p>Provide a public demo mode for easier evaluation</p>
</li>
</ol>
<h2 id="heading-future-improvements">Future Improvements</h2>
<p>There are still a few things I would like to add later:</p>
<ul>
<li><p>a more advanced recommendation engine</p>
</li>
<li><p>visual analytics for user feedback</p>
</li>
<li><p>better mobile support</p>
</li>
<li><p>live deployment with persistent cloud storage</p>
</li>
<li><p>a public demo mode for easier testing</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This project began as a personal frustration and turned into a full web application with authentication, wardrobe storage, recommendation logic, and feedback infrastructure.</p>
<p>The most rewarding part was seeing how practical software decisions, not just flashy UI, can help people make everyday choices faster.</p>
<p>If you want to explore or run the project, <a href="https://github.com/Mokshitavp1/fashion_assistant">check out the repo</a>. You can try the flows and share feedback. I would especially love input on recommendation quality, UX clarity, and what features would make this genuinely useful in daily life.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How the Mixture of Experts Architecture Works in AI Models ]]>
                </title>
                <description>
                    <![CDATA[ Artificial intelligence (AI) has seen remarkable advancements over the years, with AI models growing in size and complexity. Among the innovative approaches gaining traction today is the Mixture of Ex ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-the-mixture-of-experts-architecture-works-in-ai-models/</link>
                <guid isPermaLink="false">69d53c4d5da14bc70e77ff78</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Tue, 07 Apr 2026 17:18:05 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/21b2975b-e6ad-462c-84c7-d966bf2092cb.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Artificial intelligence (AI) has seen remarkable advancements over the years, with AI models growing in size and complexity.</p>
<p>Among the innovative approaches gaining traction today is the <a href="https://www.ibm.com/think/topics/mixture-of-experts">Mixture of Experts (MoE)</a> architecture. This method optimizes AI model performance by distributing processing tasks across specialized subnetworks known as “experts.”</p>
<p>In this article, we’ll explore how this architecture works, the role of sparsity, routing strategies, and its real-world application in the Mixtral model. We’ll also discuss the challenges these systems face and the solutions developed to address them.</p>
<h3 id="heading-well-cover">We'll Cover:</h3>
<ul>
<li><p><a href="#heading-understanding-the-mixture-of-experts-moe-approach">Understanding the Mixture of Experts (MoE) Approach</a></p>
</li>
<li><p><a href="#heading-the-role-of-sparsity-in-ai-models">The Role of Sparsity in AI Models</a></p>
</li>
<li><p><a href="#heading-the-art-of-routing-in-moe-architectures">The Art of Routing in MoE Architectures</a></p>
</li>
<li><p><a href="#heading-load-balancing-challenges-and-solutions">Load Balancing Challenges and Solutions</a></p>
<ul>
<li><p><a href="#heading-real-world-application-the-mixtral-model">Real-World Application: The Mixtral Model</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-understanding-the-mixture-of-experts-moe-approach">Understanding the Mixture of Experts (MoE)&nbsp;Approach</h2>
<img src="https://cdn.hashnode.com/uploads/covers/66c6d8f04fa7fe6a6e337edd/71385c3e-47b8-4040-adfd-30d5cb57fcd3.jpg" alt="71385c3e-47b8-4040-adfd-30d5cb57fcd3" style="display:block;margin:0 auto" width="1920" height="1080" loading="lazy">

<p>The Mixture of Experts (MoE) is a machine learning technique that divides an AI model into smaller, specialized networks, each focusing on specific tasks.</p>
<p>This is akin to assembling a team where each member possesses unique skills suited for particular challenges.</p>
<p>The idea isn't new. It dates back to a groundbreaking <a href="https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf">1991 paper</a> that highlighted the benefits of having separate networks specialize in different training cases.</p>
<p>Fast forward to today, and MoE is experiencing a resurgence, particularly among large language models, which utilize this approach to enhance efficiency and effectiveness.</p>
<p>At its core, this system comprises several components: an input layer, multiple expert networks, a gating network, and an output layer.</p>
<p>The gating network serves as a coordinator, determining which expert networks should be activated for a given task.</p>
<p>By doing so, MoE significantly reduces the need to engage the entire network for every operation. This improves performance and reduces computational overhead.</p>
<h2 id="heading-the-role-of-sparsity-in-ai-models">The Role of Sparsity in AI&nbsp;Models</h2>
<p>An essential concept within MoE architecture is sparsity, which refers to activating only a subset of experts for each processing task.</p>
<p>Instead of engaging all network resources, sparsity ensures that only the relevant experts and their parameters are used. This targeted selection significantly reduces computation needs, especially when dealing with complex, high-dimensional data such as natural language processing tasks.</p>
<p>Sparse models excel because they allow for specialized processing. For example, different parts of a sentence may require distinct types of analysis: one expert might be adept at understanding idioms, while another could specialise in parsing complex grammar structures.</p>
<p>By activating only the necessary experts, MoE models can provide more precise and efficient analysis of the input data.</p>
<h2 id="heading-the-art-of-routing-in-moe-architectures">The Art of Routing in MoE Architectures</h2>
<p>Routing is another critical component of the Mixture of Experts model.</p>
<img src="https://cdn.hashnode.com/uploads/covers/66c6d8f04fa7fe6a6e337edd/15cad578-a77d-464b-a97a-8c7240ba6263.png" alt="MoE Router" style="display:block;margin:0 auto" width="1000" height="715" loading="lazy">

<p>The gating network plays a crucial role here, as it determines which experts to activate for each input. A successful routing strategy ensures that the network is capable of selecting the most suitable experts, optimizing performance and maintaining balance across the network.</p>
<p>Typically, the routing process involves predicting which expert will provide the best output for a given input. This prediction is made based on the strength of the connection between the expert and the data.</p>
<p>One popular strategy is the <a href="https://mbrenndoerfer.com/writing/top-k-routing-mixture-of-experts-expert-selection">“top-k” routing</a> method, where the k most suitable experts are chosen for a task. In practice, a variant known as “top-2” routing is often used, activating the best two experts, which balances effectiveness and computational cost.</p>
<h2 id="heading-load-balancing-challenges-and-solutions">Load Balancing Challenges and Solutions</h2>
<p>While MoE models have clear advantages, they also introduce specific challenges, particularly regarding load balancing.</p>
<p>The potential issue is that the gating network might consistently select only a few experts, leading to an uneven distribution of tasks. This imbalance can result in some experts being over-utilised and, consequently, over-trained, while others remain underutilised.</p>
<p>To address this challenge, researchers have developed <a href="https://apxml.com/courses/mixture-of-experts-advanced-implementation/chapter-2-advanced-routing-mechanisms/noisy-top-k-gating">“noisy top-k”</a> gating, a technique introducing Gaussian noise to the selection process. This introduces an element of controlled randomness, promoting a more balanced activation of experts.</p>
<p>By distributing the workload more evenly across experts, this approach mitigates the risk of inefficiencies and ensures that the entire network remains effective.</p>
<h3 id="heading-what-actually-happens-during-an-moe-inference">What Actually Happens During an MoE Inference</h3>
<p>To make the Mixture of Experts architecture more concrete, it helps to walk through what happens during a single request.</p>
<p>Consider a prompt like:</p>
<blockquote>
<p>“Explain why startups fail due to poor cash flow management.”</p>
</blockquote>
<p>In a traditional dense model, every layer and every parameter contribute to generating the response. In an MoE model, the process is more selective.</p>
<p>As the input is processed, each layer passes the token representations to the gating network. This component evaluates all available experts and assigns them scores based on how relevant they are to the input. Instead of activating the full network, the model selects only the top-k experts (commonly two).</p>
<p>For this example, the gating network might select:</p>
<ul>
<li><p>One expert specialized in financial reasoning</p>
</li>
<li><p>Another expert better at structuring causal explanations</p>
</li>
</ul>
<p>Only these selected experts process the input, producing intermediate outputs that are then combined and passed to the next layer. The rest of the experts remain inactive for that token.</p>
<p>This selection and combination process repeats across layers, meaning that at any given point, only a small fraction of the model’s total parameters are being used.</p>
<p>The result is a system that behaves like a large, highly capable model, but executes more like a smaller one in terms of compute. This is the practical advantage of MoE: it doesn’t just improve model capacity, it ensures that capacity is used selectively and efficiently for each request.</p>
<h2 id="heading-real-world-application-the-mixtral-model">Real-World Application: The Mixtral&nbsp;Model</h2>
<p>A compelling example of the Mixture of Experts architecture in action is the <a href="https://huggingface.co/docs/transformers/en/model_doc/mixtral">Mixtral model</a>. This open-source large language model exemplifies how MoE can enhance efficiency in processing tasks.</p>
<p>Each layer of the Mixtral model comprises eight experts, each with seven billion parameters. As the model processes each token of input data, the gating network selects the two most suitable experts. These experts handle the task, and their outputs are combined before moving to the next model layer.</p>
<p>This approach allows Mixtral to deliver high performance despite its seemingly modest size for a large language model. By efficiently utilising resources and ensuring specialised processing, Mixtral stands as a testament to the potential of MoE architectures in advancing AI technology.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The Mixture of Experts architecture represents a significant step forward in developing efficient AI systems. With its focus on specialised processing and resource optimisation, MoE offers numerous benefits, particularly for large-scale language models.</p>
<p>Key concepts like sparsity and effective routing ensure that these models can handle complex tasks with precision, while innovations like noisy top-k gating address the common challenges of load balancing.</p>
<p>Despite its complexity and the need for careful tuning, the MoE approach remains promising in elevating AI model performance. As AI continues to advance, architectures like MoE could play a crucial role in powering the next generation of intelligent systems, offering improved efficiency and specialised processing capabilities.</p>
<p>Hope you enjoyed this article. Signup for <a href="https://www.manishmshiva.me/">my free newsletter</a> to get more articles delivered to your inbox. You can also <a href="https://www.linkedin.com/in/manishmshiva">connect with me</a> on Linkedin.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use MLflow to Manage Your Machine Learning Lifecycle ]]>
                </title>
                <description>
                    <![CDATA[ Training machine learning models usually starts out being organized and ends up in absolute chaos. We’ve all been there: dozens of experiments scattered across random notebooks, and model files saved  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-mlflow-to-manage-your-machine-learning-lifecycle/</link>
                <guid isPermaLink="false">69c18bfc30a9b81e3a92bbbd</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ containers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Temitope Oyedele ]]>
                </dc:creator>
                <pubDate>Mon, 23 Mar 2026 18:52:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/f829ab55-926d-43cd-b027-16c754445b09.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Training machine learning models usually starts out being organized and ends up in absolute chaos.</p>
<p>We’ve all been there: dozens of experiments scattered across random notebooks, and model files saved as <code>model_v2_final_FINAL.pkl</code> because no one is quite sure which version actually worked.</p>
<p>Once you move from a solo project to a team, or try to push something to production, that "organized chaos" quickly becomes a serious bottleneck.</p>
<p>Solving this mess requires more than just better naming conventions: it requires a way to standardize how we track and hand off our work. This is the specific gap MLflow was built to fill.</p>
<p>Originally released by the team at Databricks in 2018, it has become a standard open-source platform for managing the entire machine learning lifecycle. It acts as a central hub where your experiments, code, and models live together, rather than being tucked away in forgotten folders.</p>
<p>In this tutorial, we'll cover the core philosophy behind MLflow and how its modular architecture solves the 'dependency hell' of machine learning. We'll break down the four primary pillars of Tracking, Projects, Models, and the Model Registry, and walk through a practical implementation of each so you can move your projects from local notebooks to a production-ready lifecycle.</p>
<h3 id="heading-table-of-contents">Table of Contents:</h3>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites:</a></p>
</li>
<li><p><a href="#heading-mlflow-architecture-the-big-picture">MLflow Architecture: The Big Picture</a></p>
</li>
<li><p><a href="#heading-understanding-mlflow-tracking">Understanding MLflow Tracking</a></p>
<ul>
<li><p><a href="#heading-a-tracking-example">A Tracking Example</a></p>
</li>
<li><p><a href="#heading-where-does-the-data-actually-go">Where Does the Data Actually Go?</a></p>
</li>
<li><p><a href="#heading-why-bother-with-this-setup">Why Bother with This Setup?</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-understanding-mlflow-projects">Understanding MLflow Projects</a></p>
<ul>
<li><p><a href="#heading-the-mlproject-file">The MLproject File</a></p>
</li>
<li><p><a href="#heading-why-this-actually-matters">Why this Actually Matters</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-understanding-the-mlflow-model-registry">Understanding the MLflow Model Registry</a></p>
</li>
<li><p><a href="#heading-moving-a-model-through-the-pipeline">Moving a Model through the Pipeline</a></p>
<ul>
<li><a href="#heading-why-does-this-matter">Why Does This Matter?</a></li>
</ul>
</li>
<li><p><a href="#heading-how-the-components-fit-together">How the Components Fit Together</a></p>
</li>
<li><p><a href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<p>To get the most out of this tutorial, you should have:</p>
<ul>
<li><p><strong>Basic Python proficiency:</strong> Comfort with context managers (<code>with</code> statements) and decorators.</p>
</li>
<li><p><strong>Machine Learning fundamentals:</strong> A general understanding of training/testing splits and model evaluation metrics (like accuracy or loss).</p>
</li>
<li><p><strong>Local Environment:</strong> Python 3.8+ installed. Familiarity with <code>pip</code> or <code>conda</code> for installing packages is helpful.</p>
</li>
</ul>
<h2 id="heading-mlflow-architecture-the-big-picture">MLflow Architecture: The Big Picture</h2>
<p>To understand why MLflow is so effective, you have to look at how it's actually put together. MLflow isn't one giant or rigid tool. It’s a modular system designed around four loosely coupled components that are its core pillars.</p>
<p>This is a big deal because it means you don’t have to commit to the entire ecosystem at once. If you only need to track experiments and don't care about the other features, you can just use that part and ignore the rest.</p>
<p>To make this a bit more concrete, here is how those pieces map to things you probably already use:</p>
<ul>
<li><p><strong>MLflow Tracking:</strong> Logs experiments, metrics, and parameters. (Think: <strong>Git commits for ML runs</strong>)</p>
</li>
<li><p><strong>MLflow Projects:</strong> Packages code for reproducibility. (Think: <strong>A Docker image for ML code</strong>)</p>
</li>
<li><p><strong>MLflow Models:</strong> A standard format for multiple frameworks. (Think: <strong>A universal adapter</strong>)</p>
</li>
<li><p><strong>Model Registry:</strong> Handles versioning and governing models. (Think: <strong>A CI/CD pipeline for models</strong>)</p>
</li>
</ul>
<p>Architecturally, you can think of MLflow in two layers: the Client and the Server.</p>
<p>The Client is where you spend most of your time. It’s your training script or your Jupyter notebook where you log metrics or register a model.</p>
<p>The Server is the brain in the background that handles the storage. It consists of a Tracking Server, a Backend Store (usually a database like PostgreSQL), and an Artifact Store. That’s the place where big files like model weights live, such as S3 or GCS.</p>
<p>This separation is why MLflow is so flexible. You can start with everything running locally on your laptop using just your file system. When you're ready to scale up to a larger team, you can swap that out for a centralized server and cloud storage with almost no changes to your actual code. It grows with your project instead of forcing you to start over once things get serious.</p>
<p>Now, let's look at each of these four pillars of MLflow so you understand how they work.</p>
<h2 id="heading-understanding-mlflow-tracking">Understanding MLflow Tracking</h2>
<p>For most teams, the <strong>Tracking</strong> component is the front door to MLflow. Its job is simple: it acts as a digital lab notebook that records everything happening during a training run.</p>
<p>Instead of you frantically trying to remember what your learning rate was or where you saved that accuracy plot, MLflow just sits in the background and logs it for you.</p>
<p>The core unit here is the <strong>run</strong>. Think of a run as a single execution of your training code. During that run, the architecture captures four specific types of information:</p>
<ul>
<li><p><strong>Parameters:</strong> Your inputs, like batch size or the number of trees in a forest.</p>
</li>
<li><p><strong>Metrics:</strong> Your outputs, like accuracy or loss, which can be tracked over time.</p>
</li>
<li><p><strong>Artifacts:</strong> The "heavy" stuff, such as model weights, confusion matrices, or images.</p>
</li>
<li><p><strong>Tags and Metadata:</strong> Context like which developer ran the code and which Git commit was used.</p>
</li>
</ul>
<h3 id="heading-a-tracking-example">A Tracking Example</h3>
<p>Seeing this in practice is the best way to understand how the architecture actually works. You don't need to rebuild your entire pipeline – you just wrap your training logic in a context manager.</p>
<p>Here is what a basic integration looks like in Python:</p>
<pre><code class="language-python">import mlflow 
import mlflow.sklearn 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score 

# This block opens the run and keeps things organized
with mlflow.start_run():    
    # Log parameters    
    mlflow.log_param("n_estimators", 100)    
    mlflow.log_param("max_depth", 5)    
    
    # Train the model    
    model = RandomForestClassifier(n_estimators=100, max_depth=5)    
    model.fit(X_train, y_train)    
    
    # Log metrics    
    accuracy = accuracy_score(y_test, model.predict(X_test))    
    mlflow.log_metric("accuracy", accuracy)    
    
    # Log the model as an artifact    
    mlflow.sklearn.log_model(model, "random_forest_model")
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/0c63f9c4-3f16-4591-be58-51a0acca5f80.png" alt="A comparison table in the MLflow UI showing three training runs side-by-side, highlighting differences in parameters and metrics." style="display:block;margin:0 auto" width="2862" height="1384" loading="lazy">

<p>The <code>mlflow.start_run()</code> context manager creates a new run and automatically closes it when the block exits. Everything logged inside that block is associated with that run and stored in the Backend Store.</p>
<h3 id="heading-where-does-the-data-actually-go">Where Does the Data Actually Go?</h3>
<p>When you’re just starting out on your laptop, MLflow keeps things simple by creating a local <code>./mlruns</code> directory. The real power shows up when you move to a team environment and point everyone to a centralized Tracking Server.</p>
<p>The system splits the data based on how "heavy" it is. Your structured data (parameters and metrics) is small and needs to be searchable, so it goes into a SQL database like PostgreSQL. Your unstructured data (the actual model files or large plots) is too bulky for a database. The architecture ships that off to an Artifact Store like Amazon S3 or Google Cloud Storage.</p>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/e8aa2e4e-09a8-4767-a1f3-b07810680615.png" alt="The MLflow Artifact Store view showing the directory structure for a logged model, including the MLmodel metadata and model.pkl file." style="display:block;margin:0 auto" width="2862" height="1384" loading="lazy">

<h3 id="heading-why-bother-with-this-setup">Why Bother with This Setup?</h3>
<p>Relying on "vibes" and messy naming conventions is a recipe for disaster once your project grows. It might work for a day or two, but it falls apart the moment you need to compare twenty different versions of a model.</p>
<p>By separating the tracking into its own architectural pillar, MLflow gives you a queryable history. Instead of digging through old notebooks, you can just hop into the UI, filter for the best results, and see exactly which configuration got you there. It takes the guesswork out of the "science" part of data science.</p>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/cd83e4b7-38b7-4644-8166-e48ba00d581a.png" alt="An MLflow Parallel Coordinates plot visualizing the relationship between the number of estimators and model accuracy across multiple runs." style="display:block;margin:0 auto" width="2862" height="1384" loading="lazy">

<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/6d1383f5-7ace-4b9d-a566-64a3807cdcd7.png" alt="An MLflow scatter plot illustrating the positive correlation between the n_estimators parameter and the resulting model accuracy." style="display:block;margin:0 auto" width="2862" height="1384" loading="lazy">

<h2 id="heading-understanding-mlflow-projects">Understanding MLflow Projects</h2>
<p>You can train the most accurate model in the world, but if your colleague can’t reproduce your results on their machine, that model isn't worth much.</p>
<p>This is where MLflow Projects come in. They solve the reproducibility headache by providing a standard way to package your code, your dependencies, and your entry points into one neat bundle.</p>
<p>Think of an MLflow Project as a directory (or a Git repo) with a special "instruction manual" at its root called an <code>MLproject</code> file. This file tells anyone (or any server) exactly what environment is needed and how to kick off the execution.</p>
<h3 id="heading-the-mlproject-file">The MLproject File</h3>
<p>Instead of sending someone a long README with installation steps, you just give them this file. Here is what a typical MLproject setup looks like for a training pipeline:</p>
<pre><code class="language-yaml">name: my_ml_project
conda_env: conda.yaml

entry_points:
  train:
    parameters:
      learning_rate: {type: float, default: 0.01}
      epochs: {type: int, default: 50}
      data_path: {type: str}
    command: "python train.py --lr {learning_rate} --epochs {epochs} --data {data_path}"
  
  evaluate:
    parameters:
      model_path: {type: str}
    command: "python evaluate.py --model {model_path}"
</code></pre>
<p>The conda_env line points to a conda.yaml file that lists the exact Python packages and versions your code needs. If you want even more isolation, MLflow supports Docker environments too.</p>
<p>The beauty of this setup is the simplicity. Anyone with MLflow installed can run your entire project with a single command:</p>
<pre><code class="language-bash">mlflow run . -P learning_rate=0.001 -P epochs=100 -P data_path=./data/train.csv
</code></pre>
<h3 id="heading-why-this-actually-matters">Why this Actually Matters</h3>
<p>MLflow Projects really shine in two specific scenarios. The first is onboarding. A new team member can clone your repo and be up and running in minutes, rather than spending their entire first day debugging library version conflicts.</p>
<p>The second is CI/CD. Because these projects are triggered programmatically, they fit perfectly into automated retraining pipelines. When reproducibility is non-negotiable, having a "single source of truth" for how to run your code makes life a lot easier for everyone involved.</p>
<h2 id="heading-understanding-the-mlflow-model-registry">Understanding the MLflow Model Registry</h2>
<p>Tracking experiments tells you which model is the "winner," but the Model Registry is where you actually manage that winner’s journey from your notebook to a live production environment.</p>
<p>Think of it as the governance layer. It handles versioning, stage management, and creates a clear audit trail so you never have to guess which model is currently running in the wild.</p>
<p>The Registry uses a few simple concepts to keep things organized:</p>
<ul>
<li><p><strong>Registered Model:</strong> This is the overall name for your project, like CustomerChurnPredictor.</p>
</li>
<li><p><strong>Model Version:</strong> Every time you push a new iteration, MLflow auto-increments the version (v1, v2, and so on).</p>
</li>
<li><p><strong>Stage:</strong> These are labels like <strong>Staging</strong>, <strong>Production</strong>, or <strong>Archived</strong>. They tell your team exactly where a model stands in its lifecycle.</p>
</li>
<li><p><strong>Annotations:</strong> These are just notes and tags. They’re great for documenting why a specific version was promoted or what its quirks are.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/627d043a4903bec29b5871be/bcd77d8f-a37c-4b0f-a112-9e2ad36d8cc2.png" alt="The MLflow Model Registry interface showing Version 1 of the IrisClassifier model officially transitioned to the Production stage." style="display:block;margin:0 auto" width="2862" height="1384" loading="lazy">

<h2 id="heading-moving-a-model-through-the-pipeline">Moving a Model through the Pipeline</h2>
<p>In a real-world workflow, you don't just "deploy" a file. You transition it through stages. Here's how that looks using the MLflow Client:</p>
<pre><code class="language-plaintext">Python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# First, we register the model from a run that went well
result = mlflow.register_model(
    model_uri=f"runs:/{run_id}/random_forest_model",
    name="CustomerChurnPredictor"
)

# Then, we move Version 1 to Staging so the QA team can look at it
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Staging"
)

# Once everything checks out, we promote it to Production
client.transition_model_version_stage(
    name="CustomerChurnPredictor",
    version=1,
    stage="Production"
)
</code></pre>
<h3 id="heading-why-does-this-matter">Why Does This Matter?</h3>
<p>The Model Registry solves a problem that usually gets messy the moment a team grows: knowing exactly which version is live, who approved it, and what it was compared against. Without this, that information usually ends up buried in Slack threads or outdated spreadsheets.</p>
<p>It also makes rollbacks incredibly painless. If Version 3 starts acting up in production, you don't need to redeploy your entire stack. You can just transition Version 2 back to the "Production" stage in the registry. Since your serving infrastructure is built to always pull the "Production" tag, it will automatically swap back to the stable version.</p>
<h2 id="heading-how-the-components-fit-together">How the Components Fit Together</h2>
<p>To see how all of this actually works in the real world, it helps to walk through a typical workflow from start to finish. It's essentially a relay race where each component hands off the baton to the next one.</p>
<p>It starts with a data scientist running a handful of experiments. Every time they hit run, MLflow Tracking is in the background taking notes. It logs metrics and saves model artifacts into the Backend Store automatically. At this stage, everything is about exploration and finding that one winner.</p>
<p>Once that best run is identified, the model gets officially registered in the Model Registry. This is where the team takes over. They can hop into the UI to check the annotations, review the evaluation results, and move the model into Staging. After it passes a few more validation tests, it gets the green light and is promoted to Production.</p>
<p>When it is time to actually serve the model, the deployment system simply asks the Registry for the current Production version. This happens whether you are using Kubernetes, a cloud endpoint, or MLflow’s built-in server.</p>
<p>Because the MLproject file handled the dependencies and the MLflow Models format handled the framework details, the serving infrastructure does not have to care if the model was built with Scikit-learn or PyTorch. The hand-off is smooth because all the necessary info is already there.</p>
<p>This flow is what turns MLflow from a collection of useful utilities into a full MLOps platform. It connects the messy experimental phase of data science to the rigid world of production software.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>At the end of the day, MLflow architecture is built to stay out of your way. It doesn't force you to change how you write your code or which libraries you use. Instead, it just provides the structure needed to make your machine learning projects reproducible and easier to manage as a team.</p>
<p>Whether you're just trying to get away from naming files model_final_v2.pkl or you are building a complex CI/CD pipeline for your models, understanding these four pillars is the best place to start. The best way to learn is to just fire up a local tracking server and start logging. You will probably find that once you have that "source of truth" for your experiments, you will never want to go back to the old way of doing things.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and tru ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-end-to-end-ml-platform-locally-from-experiment-tracking-to-cicd/</link>
                <guid isPermaLink="false">69b9bab4c22d3eeb8afd5284</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Platform Engineering  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ FastAPI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sandeep Bharadwaj Mannapur ]]>
                </dc:creator>
                <pubDate>Tue, 17 Mar 2026 20:33:56 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/8401d978-0bed-4534-af93-f6bfc1b77c89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and trust over time.</p>
<p>Most ML systems fail in production for boring (and painful) reasons: the training code and the serving code drift apart, input data changes shape, a “small” preprocessing tweak breaks predictions, or the model silently degrades because real-world behavior shifts. None of these problems are solved by a better algorithm, they’re solved by engineering: repeatable pipelines, validation, versioning, monitoring, and automated checks.</p>
<p>In this hands-on handbook, you’ll build a complete mini ML platform on your local machine, an end-to-end project that takes a model from training to deployment with the core “last mile” infrastructure in place. We’ll use a fraud detection example (predicting fraudulent transactions), but the same workflow works for churn prediction or any binary classification problem. Everything runs locally (no cloud required), and every step is copy-paste runnable so you can follow along and verify outputs as you go.</p>
<p>By the end, you'll have a production-ready ML pipeline running on your machine – from training the model to serving predictions, with the infrastructure to test, monitor, and iterate with confidence. And yes, we'll do it in a hands-on manner with code snippets you can copy-paste and run. Let's dive in!</p>
<p>📦 <strong>Get the Complete Code</strong><br>All code from this handbook is available in a ready-to-run repository:<br><strong>Repository:</strong> <a href="https://github.com/sandeepmb/freecodecamp-local-ml-platform">https://github.com/sandeepmb/freecodecamp-local-ml-platform</a><br>Clone it and follow along, or use it as a reference implementation.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-project-overview-and-setup">Project Overview and Setup</a></p>
</li>
<li><p><a href="#heading-1-build-a-simple-model-and-api-the-naive-approach">Build a Simple Model and API (The Naive Approach)</a></p>
<ul>
<li><p><a href="#heading-11-train-a-quick-model">Train a Quick Model</a></p>
</li>
<li><p><a href="#heading-12-serve-predictions-with-fastapi">Serve Predictions with FastAPI</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-2-where-the-naive-approach-breaks">Where the Naive Approach Breaks</a></p>
<ul>
<li><p><a href="#heading-problem-1-no-experiment-tracking-reproducibility">Problem 1: No Experiment Tracking (Reproducibility)</a></p>
</li>
<li><p><a href="#heading-problem-2-model-versioning-and-deployment-chaos">Problem 2: Model Versioning and Deployment Chaos</a></p>
</li>
<li><p><a href="#heading-problem-3-no-data-validation-garbage-in-garbage-out">Problem 3: No Data Validation – Garbage In, Garbage Out</a></p>
</li>
<li><p><a href="#heading-problem-4-model-drift-performance-decay-over-time">Problem 4: Model Drift – Performance Decay Over Time</a></p>
</li>
<li><p><a href="#heading-problem-5-no-ci-cd-or-deployment-safety">Problem 5: No CI/CD or Deployment Safety</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-3-add-experiment-tracking-and-model-registry-with-mlflow">Add Experiment Tracking and Model Registry with MLflow</a></p>
<ul>
<li><p><a href="#heading-31-how-to-set-up-the-mlflow-tracking-server">How to Set Up the MLflow Tracking Server</a></p>
</li>
<li><p><a href="#heading-32-how-to-log-experiments-in-code">How to Log Experiments in Code</a></p>
</li>
<li><p><a href="#heading-33-how-to-use-the-model-registry">How to Use the Model Registry</a></p>
</li>
<li><p><a href="#heading-34-update-api-to-load-from-registry">Update API to Load from Registry</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-4-ensure-feature-consistency-with-feast">Ensure Feature Consistency with Feast</a></p>
<ul>
<li><p><a href="#heading-41-what-is-feast-and-why-use-it">What Is Feast and Why Use It?</a></p>
</li>
<li><p><a href="#heading-42-install-and-initialize-feast">Install and Initialize Feast</a></p>
</li>
<li><p><a href="#heading-43-define-feature-definitions">Define Feature Definitions</a></p>
</li>
<li><p><a href="#heading-44-materialize-features-to-the-online-store">Materialize Features to the Online Store</a></p>
</li>
<li><p><a href="#heading-45-retrieve-features-for-training-and-serving">Retrieve Features for Training and Serving</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-5-add-data-validation-with-great-expectations">Add Data Validation with Great Expectations</a></p>
<ul>
<li><p><a href="#heading-51-define-expectations">Define Expectations</a></p>
</li>
<li><p><a href="#heading-52-integrate-validation-into-fastapi">Integrate Validation into FastAPI</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-6-monitor-model-performance-and-data-drift">Monitor Model Performance and Data Drift</a></p>
<ul>
<li><p><a href="#heading-61-the-four-pillars-of-ml-observability">The Four Pillars of ML Observability</a></p>
</li>
<li><p><a href="#heading-62-build-a-drift-monitor-with-evidently">Build a Drift Monitor with Evidently</a></p>
</li>
<li><p><a href="#heading-63-production-monitoring-strategy">Production Monitoring Strategy</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-7-automate-testing-and-deployment-with-ci-cd">Automate Testing and Deployment with CI/CD</a></p>
<ul>
<li><p><a href="#heading-71-write-tests-for-data-and-model">Write Tests for Data and Model</a></p>
</li>
<li><p><a href="#heading-72-github-actions-workflow">GitHub Actions Workflow</a></p>
</li>
<li><p><a href="#heading-73-dockerize-the-application">Dockerize the Application</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-8-incident-response-playbook">Incident Response Playbook</a></p>
<ul>
<li><p><a href="#heading-scenario-false-positive-spike">Scenario: False Positive Spike</a></p>
</li>
<li><p><a href="#heading-scenario-gradual-performance-decay">Scenario: Gradual Performance Decay</a></p>
</li>
<li><p><a href="#heading-scenario-upstream-data-schema-change">Scenario: Upstream Data Schema Change</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-9-how-to-put-it-all-together">How to Put It All Together</a></p>
</li>
<li><p><a href="#heading-10-whats-next-scale-to-production">What’s Next: Scale to Production</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ol>
<h2 id="heading-project-overview-and-setup"><strong>Project Overview and Setup</strong></h2>
<p>Before we jump into coding, let's set the stage. Our use-case is <strong>credit card fraud detection</strong> – a binary classification problem where we predict whether a transaction is fraudulent (<code>is_fraud = 1</code>) or legitimate (<code>is_fraud = 0</code>). This is a common ML task and a good proxy for production ML challenges because fraud patterns can change over time (allowing us to discuss model drift), and bad input data (for example, malformed transaction info) can cause serious issues if not handled properly.</p>
<h3 id="heading-tech-stack"><strong>Tech Stack</strong></h3>
<p>We will use Python-based tools that are popular in MLOps but still beginner-friendly:</p>
<table>
<thead>
<tr>
<th><strong>Tool</strong></th>
<th><strong>Purpose</strong></th>
<th><strong>Why We Chose It</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>MLflow</strong></td>
<td>Experiment tracking and model registry</td>
<td>Open-source, widely adopted, great UI</td>
</tr>
<tr>
<td><strong>Feast</strong></td>
<td>Feature store for consistent feature serving</td>
<td>Production-grade, runs locally, same API for offline/online</td>
</tr>
<tr>
<td><strong>FastAPI</strong></td>
<td>High-performance web framework for serving predictions</td>
<td>Fast, automatic docs, modern Python</td>
</tr>
<tr>
<td><strong>Great Expectations</strong></td>
<td>Data validation framework</td>
<td>Declarative expectations, great reports</td>
</tr>
<tr>
<td><strong>Evidently</strong></td>
<td>Monitoring for data drift and model decay</td>
<td>Beautiful reports, easy to integrate</td>
</tr>
<tr>
<td><strong>Docker</strong></td>
<td>Containerization for environment consistency</td>
<td>Industry standard, works everywhere</td>
</tr>
<tr>
<td><strong>GitHub Actions</strong></td>
<td>CI/CD automation</td>
<td>Free for public repos, tight GitHub integration</td>
</tr>
</tbody></table>
<p>Let me explain each tool briefly:</p>
<p><strong>MLflow</strong> is an open-source platform designed to manage the ML lifecycle. It provides experiment tracking (logging parameters, metrics, and artifacts), a model registry (versioning models with aliases), and model serving capabilities. We'll use it to ensure our experiments are reproducible and our models are versioned.</p>
<p><strong>Feast</strong> (Feature Store) is an open-source feature store that helps manage and serve features consistently between training and inference. This prevents a common problem called "training-serving skew" where the features used in production differ slightly from those used in training, causing silent accuracy degradation.</p>
<p><strong>FastAPI</strong> is a modern, fast web framework for building APIs with Python. It's known for being easy to use, efficient, and producing automatic interactive documentation. We'll use it to serve our model predictions.</p>
<p><strong>Great Expectations</strong> is an open-source tool for data quality testing. It allows us to define "expectations" on data (like "amount should be positive" or "hour should be between 0 and 23") and test incoming data against them.</p>
<p><strong>Evidently</strong> is an open-source library for monitoring data and model performance over time. It can detect data drift (when input distributions change) and model decay (when accuracy drops).</p>
<p><strong>Docker</strong> ensures the same environment and dependencies in development and deployment, avoiding the classic "works on my machine" problem.</p>
<p><strong>GitHub Actions</strong> provides CI/CD automation. An efficient CI/CD pipeline helps integrate and deploy changes faster and with fewer errors.</p>
<p>💡 <strong>Mental Model</strong>: Think of this as building a "safety net" around your ML model. Each tool we add catches a different failure mode, like defensive driving for machine learning.</p>
<h3 id="heading-prerequisites"><strong>Prerequisites</strong></h3>
<p>You'll need:</p>
<ul>
<li><p><strong>Python 3.9+</strong> installed on your machine</p>
</li>
<li><p><strong>Docker Desktop</strong> installed and running</p>
</li>
<li><p><strong>GitHub account</strong> (if you want to try the CI/CD pipeline)</p>
</li>
<li><p><strong>Basic familiarity with Python</strong> and ML concepts (what training and prediction mean)</p>
</li>
</ul>
<p>You don't need MLOps or Kubernetes experience. Everything will be done locally with just Python and Docker – <strong>no cloud and no Kubernetes needed</strong>.</p>
<h3 id="heading-project-structure"><strong>Project Structure</strong></h3>
<p>Let's set up a basic project structure on your local machine. Open your terminal and run:</p>
<pre><code class="language-python"># Create project directory and subfolders
mkdir ml-platform-tutorial &amp;&amp; cd ml-platform-tutorial
mkdir -p data models src tests feature_repo

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate
</code></pre>
<p>Your project structure should look like this:</p>
<pre><code class="language-python">ml-platform-tutorial/
├── data/              # Training and test datasets
├── models/            # Saved model files
├── src/               # Source code
├── tests/             # Test files
├── feature_repo/      # Feast feature repository
├── venv/              # Virtual environment
└── requirements.txt   # Dependencies
</code></pre>
<p>Next, create a <code>requirements.txt</code> with all the necessary libraries:</p>
<pre><code class="language-python"># requirements.txt

# Core ML libraries
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0

# Experiment tracking and model registry
mlflow==2.10.0

# Feature store
feast==0.36.0

# API framework
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.26.0

# Data validation
great-expectations==0.18.8

# Monitoring
evidently==0.7.20

# Testing
pytest==8.0.0
pytest-cov==4.1.0

# Utilities
pyarrow==15.0.0
pydantic==2.6.0
</code></pre>
<p>📌 <strong>Version Note:</strong> Exact versions are pinned to ensure reproducibility. Newer versions may work, but all examples were tested with the versions listed here.</p>
<p>Install the dependencies:</p>
<pre><code class="language-python">pip install -r requirements.txt
</code></pre>
<p>This might take a few minutes as it installs all the packages. Once complete, we're ready to start building our project step by step.</p>
<p><strong>Checkpoint:</strong> You should have a project folder with <code>data/</code>, <code>models/</code>, <code>src/</code>, <code>tests/</code>, and <code>feature_repo/</code> directories, and an activated virtual environment with all dependencies installed. Verify by running <code>python -c "import mlflow; import feast; import fastapi; print('All imports successful!')"</code>.</p>
<p><strong>Figure 1: The Complete ML Platform We'll Build</strong></p>
<p><em>Don't worry if this looks complex, we'll build each component step by step, starting with the simplest piece and connecting them together.</em></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771392341567/4bfdd727-32fb-4f30-a63e-c94f61a9f2db.png" alt="Architecture diagram of a local end-to-end machine learning platform for fraud detection. Transaction data flows through model training, experiment tracking and model registry in MLflow, feature management in Feast, data validation with Great Expectations, prediction serving through FastAPI, monitoring with Evidently, and automated testing and deployment with Docker and GitHub Actions." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-1-build-a-simple-model-and-api-the-naive-approach"><strong>1. Build a Simple Model and API (The Naive Approach)</strong></h2>
<p>To illustrate why we need all these tools, let's start by building a <strong>naive ML system without any MLOps infrastructure</strong>. We'll train a simple model and deploy it quickly, then observe what problems arise. This "naive approach" is how most ML projects start – and understanding its limitations will motivate the solutions we implement later.</p>
<h3 id="heading-11-train-a-quick-model"><strong>1.1 Train a Quick Model</strong></h3>
<p>First, we need some data. For simplicity, we'll generate a synthetic dataset for fraud detection so that we don't rely on any external data files. The dataset will have features like:</p>
<ul>
<li><p><code>amount</code>: Transaction amount in dollars</p>
</li>
<li><p><code>hour</code>: Hour of the day (0-23) when the transaction occurred</p>
</li>
<li><p><code>day_of_week</code>: Day of the week (0=Monday, 6=Sunday)</p>
</li>
<li><p><code>merchant_category</code>: Type of merchant (grocery, restaurant, retail, online, travel)</p>
</li>
<li><p><code>is_fraud</code>: Label indicating if the transaction is fraudulent (1) or legitimate (0)</p>
</li>
</ul>
<p>We will simulate that only ~2% of transactions are fraud, which is an imbalance typical in real fraud data. This imbalance is important because it affects how we evaluate our model.</p>
<p>Create <code>src/generate_data.py</code>:</p>
<pre><code class="language-python"># src/generate_data.py
"""
Generate synthetic fraud detection dataset.

This script creates realistic-looking transaction data where fraudulent
transactions have different patterns than legitimate ones:
- Fraud tends to have higher amounts
- Fraud tends to occur late at night
- Fraud is more common for online and travel merchants
"""
import pandas as pd
import numpy as np

def generate_transactions(n_samples=10000, fraud_ratio=0.02, seed=42):
    """
    Generate synthetic fraud detection dataset.
    
    Args:
        n_samples: Total number of transactions to generate
        fraud_ratio: Proportion of fraudulent transactions (default 2%)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with transaction features and fraud labels
    
    Fraud transactions have different patterns:
    - Higher amounts (mean \(245 vs \)33 for legit)
    - Late night hours (0-5, 23)
    - More likely to be online or travel merchants
    """
    np.random.seed(seed)
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud

    # Legitimate transactions: normal shopping patterns
    # - Amounts follow a log-normal distribution (most small, some large)
    # - Hours are uniformly distributed throughout the day
    # - Merchant categories weighted toward everyday shopping
    legit = pd.DataFrame({
        "amount": np.random.lognormal(mean=3.5, sigma=1.2, size=n_legit),  # ~$33 average
        "hour": np.random.randint(0, 24, size=n_legit),
        "day_of_week": np.random.randint(0, 7, size=n_legit),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_legit,
            p=[0.30, 0.25, 0.25, 0.15, 0.05]  # Weighted toward everyday shopping
        ),
        "is_fraud": 0
    })
    
    # Fraudulent transactions: suspicious patterns
    # - Higher amounts (fraudsters go big)
    # - Late night hours (less scrutiny)
    # - More online and travel (easier to exploit)
    fraud = pd.DataFrame({
        "amount": np.random.lognormal(mean=5.5, sigma=1.5, size=n_fraud),  # ~$245 average
        "hour": np.random.choice([0, 1, 2, 3, 4, 5, 23], size=n_fraud),  # Late night
        "day_of_week": np.random.randint(0, 7, size=n_fraud),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_fraud,
            p=[0.05, 0.05, 0.10, 0.60, 0.20]  # Weighted toward online/travel
        ),
        "is_fraud": 1
    })
    
    # Combine and shuffle
    df = pd.concat([legit, fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    return df

if __name__ == "__main__":
    # Generate dataset
    print("Generating synthetic fraud detection dataset...")
    df = generate_transactions(n_samples=10000, fraud_ratio=0.02)
    
    # Split into train (80%) and test (20%)
    train_df = df.sample(frac=0.8, random_state=42)
    test_df = df.drop(train_df.index)
    
    # Save to CSV files
    train_df.to_csv("data/train.csv", index=False)
    test_df.to_csv("data/test.csv", index=False)
    
    # Print summary statistics
    print(f"\nDataset generated successfully!")
    print(f"Training set: {len(train_df):,} transactions")
    print(f"Test set: {len(test_df):,} transactions")
    print(f"Overall fraud ratio: {df['is_fraud'].mean():.2%}")
    print(f"\nLegitimate transactions - Average amount: ${df[df['is_fraud']==0]['amount'].mean():.2f}")
    print(f"Fraudulent transactions - Average amount: ${df[df['is_fraud']==1]['amount'].mean():.2f}")
    print(f"\nMerchant category distribution (fraud):")
    print(df[df['is_fraud']==1]['merchant_category'].value_counts(normalize=True))
</code></pre>
<p>Run the data generation script:</p>
<pre><code class="language-python">python src/generate_data.py
</code></pre>
<p>You should see output like:</p>
<pre><code class="language-python">Generating synthetic fraud detection dataset...

Dataset generated successfully!
Training set: 8,000 transactions
Test set: 2,000 transactions
Overall fraud ratio: 2.00%

Legitimate transactions - Average amount: $33.45
Fraudulent transactions - Average amount: $245.67

Merchant category distribution (fraud):
online        0.60
travel        0.20
retail        0.10
restaurant    0.05
grocery       0.05
</code></pre>
<p>Now you have <code>data/train.csv</code> and <code>data/test.csv</code> with ~8000 training and ~2000 testing transactions.</p>
<p><strong>Why This Matters:</strong> The synthetic data has realistic patterns — fraud is rare (2%), high-value, late-night, and concentrated in certain merchant categories. These patterns give our model something to learn.</p>
<p>Now, let's train a quick model. We'll use a simple <strong>Random Forest classifier</strong> from scikit-learn to predict <code>is_fraud</code>. In this naive version, we won't do much feature engineering – just label encode the categorical <code>merchant_category</code> and feed everything to the model.</p>
<p>Create <code>src/train_naive.py</code>:</p>
<pre><code class="language-python"># src/train_naive.py
"""
Train a fraud detection model - NAIVE VERSION.

This script demonstrates the "quick and dirty" approach to ML:
- No experiment tracking
- No model versioning
- Just train and save to a pickle file

We'll improve on this in later sections.
"""
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report
)

def main():
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    print(f"Training samples: {len(train_df):,}")
    print(f"Test samples: {len(test_df):,}")
    print(f"Training fraud ratio: {train_df['is_fraud'].mean():.2%}")
    
    # Encode the categorical feature
    # We need to save the encoder to use the same mapping at inference time
    print("\nEncoding categorical features...")
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    print(f"Merchant category mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
    
    # Prepare features and labels
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    # Train a Random Forest classifier
    print("\nTraining Random Forest model...")
    model = RandomForestClassifier(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of each tree
        random_state=42,       # For reproducibility
        n_jobs=-1              # Use all CPU cores
    )
    model.fit(X_train, y_train)
    print("Training complete!")
    
    # Evaluate on test data
    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"  True Negatives:  {cm[0][0]:,} (correctly identified legitimate)")
    print(f"  False Positives: {cm[0][1]:,} (legitimate flagged as fraud)")
    print(f"  False Negatives: {cm[1][0]:,} (fraud missed - DANGEROUS!)")
    print(f"  True Positives:  {cm[1][1]:,} (correctly caught fraud)")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Feature importance
    print("\nFeature Importance:")
    for name, importance in sorted(
        zip(feature_cols, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    ):
        print(f"  {name}: {importance:.4f}")
    
    # Save the model and encoder together
    print("\nSaving model to models/model.pkl...")
    with open("models/model.pkl", "wb") as f:
        pickle.dump((model, encoder), f)
    
    print("\nModel trained and saved successfully!")
    print("\nWARNING: This naive approach has several problems:")
    print("  - No record of hyperparameters or metrics")
    print("  - No model versioning")
    print("  - No way to reproduce this exact model")
    print("  - We'll fix these issues in the following sections!")

if __name__ == "__main__":
    main()
</code></pre>
<p>Run the training script:</p>
<pre><code class="language-python">python src/train_naive.py
</code></pre>
<p>You should see output similar to:</p>
<pre><code class="language-python">Loading data...
Training samples: 8,000
Test samples: 2,000
Training fraud ratio: 2.00%

Encoding categorical features...
Merchant category mapping: {'grocery': 0, 'online': 1, 'restaurant': 2, 'retail': 3, 'travel': 4}

Training Random Forest model...
Training complete!

==================================================
MODEL EVALUATION
==================================================

Accuracy:  0.9820
Precision: 0.7273
Recall:    0.6154
F1-score:  0.6667

Confusion Matrix:
  True Negatives:  1,956 (correctly identified legitimate)
  False Positives: 4 (legitimate flagged as fraud)
  False Negatives: 32 (fraud missed - DANGEROUS!)
  True Positives:  8 (correctly caught fraud)

Feature Importance:
  amount: 0.5423
  hour: 0.2156
  merchant_encoded: 0.1345
  day_of_week: 0.1076
</code></pre>
<p><strong>Important observation:</strong> You'll see ~98% accuracy but a lower F1-score (around 0.5-0.7). <strong>With only 2% fraud, accuracy is extremely misleading!</strong> A model that always predicts "not fraud" would achieve 98% accuracy while catching zero fraud. This is why we focus on F1-score, precision, and recall for imbalanced classification problems.</p>
<p>💡 If you're new to imbalanced classification, remember: high accuracy can be meaningless when the positive class is rare.</p>
<p>The script outputs a file <code>models/model.pkl</code> containing both the trained model and the label encoder (we need both for inference).</p>
<p><strong>Checkpoint:</strong> You should now have:</p>
<ul>
<li><p><code>data/train.csv</code> (~8,000 rows)</p>
</li>
<li><p><code>data/test.csv</code> (~2,000 rows)</p>
</li>
<li><p><code>models/model.pkl</code> (trained model + encoder)</p>
</li>
</ul>
<p>The model should show ~98% accuracy but F1 around 0.5-0.7. Verify the files exist: <code>ls -la data/ models/</code></p>
<h3 id="heading-12-serve-predictions-with-fastapi"><strong>1.2 Serve Predictions with FastAPI</strong></h3>
<p>Now that we have a model, let's deploy it as an API so that clients can get predictions. We'll use <strong>FastAPI</strong> because it's straightforward, very fast, and produces automatic interactive documentation.</p>
<p>FastAPI is known for:</p>
<ul>
<li><p><strong>Easy to use</strong>: Pythonic syntax with type hints</p>
</li>
<li><p><strong>High performance</strong>: One of the fastest Python frameworks</p>
</li>
<li><p><strong>Automatic documentation</strong>: Swagger UI out of the box</p>
</li>
<li><p><strong>Data validation</strong>: Using Pydantic models</p>
</li>
</ul>
<p>Create <code>src/serve_naive.py</code>:</p>
<pre><code class="language-python"># src/serve_naive.py
"""
Serve fraud detection model as a REST API - NAIVE VERSION.

This is a simple API that:
1. Loads the trained model at startup
2. Accepts transaction data via POST request
3. Returns fraud prediction

We'll improve this with validation, monitoring, and better
model loading in later sections.
"""
import pickle
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional

# Load the trained model and encoder at startup
# This is loaded once when the server starts, not on every request
print("Loading model...")
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)
print("Model loaded successfully!")

# Create the FastAPI application
app = FastAPI(
    title="Fraud Detection API",
    description="""
    Predict whether a credit card transaction is fraudulent.
    
    This API accepts transaction details and returns:
    - Whether the transaction is predicted to be fraud
    - The probability of fraud (0.0 to 1.0)
    
    **Note:** This is the naive version without validation or monitoring.
    """,
    version="1.0.0"
)

# Define the input schema using Pydantic
# This provides automatic validation and documentation
class Transaction(BaseModel):
    """Schema for a transaction to be evaluated for fraud."""
    amount: float = Field(
        ..., 
        description="Transaction amount in dollars",
        example=150.00
    )
    hour: int = Field(
        ..., 
        description="Hour of the day (0-23)",
        example=14
    )
    day_of_week: int = Field(
        ..., 
        description="Day of week (0=Monday, 6=Sunday)",
        example=3
    )
    merchant_category: str = Field(
        ..., 
        description="Type of merchant",
        example="online"
    )

class PredictionResponse(BaseModel):
    """Schema for the prediction response."""
    is_fraud: bool = Field(description="Whether the transaction is predicted as fraud")
    fraud_probability: float = Field(description="Probability of fraud (0.0 to 1.0)")
    
@app.post("/predict", response_model=PredictionResponse)
def predict(transaction: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Takes transaction details and returns a fraud prediction
    along with the probability score.
    """
    # Convert the request to a dictionary
    data = transaction.dict()
    
    # Encode the merchant category using the same encoder from training
    # This ensures consistency between training and serving
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        # Handle unknown merchant categories
        # In production, we'd want better handling here
        data["merchant_encoded"] = 0
    
    # Prepare features in the same order as training
    X = [[
        data["amount"],
        data["hour"],
        data["day_of_week"],
        data["merchant_encoded"]
    ]]
    
    # Get prediction and probability
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0][1]  # Probability of class 1 (fraud)
    
    return PredictionResponse(
        is_fraud=bool(prediction),
        fraud_probability=round(float(probability), 4)
    )

@app.get("/health")
def health_check():
    """
    Health check endpoint.
    
    Returns the status of the API. Useful for:
    - Load balancer health checks
    - Kubernetes liveness probes
    - Monitoring systems
    """
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/")
def root():
    """Root endpoint with API information."""
    return {
        "message": "Fraud Detection API",
        "version": "1.0.0",
        "docs": "/docs",
        "health": "/health"
    }
</code></pre>
<p>A few important things to note about this code:</p>
<ol>
<li><p><strong>Pydantic Models</strong>: We use <code>BaseModel</code> to define the expected input JSON schema. FastAPI automatically validates incoming requests against this schema.</p>
</li>
<li><p><strong>Type Hints</strong>: The type hints (<code>float</code>, <code>int</code>, <code>str</code>) provide both documentation and runtime validation.</p>
</li>
<li><p><strong>Feature Encoding</strong>: On each request, we encode the merchant category using the same <code>LabelEncoder</code> we saved from training. This ensures consistency between training and serving.</p>
</li>
<li><p><strong>Health Endpoint</strong>: The <code>/health</code> endpoint is standard practice for production APIs - it allows load balancers and monitoring systems to check if the service is running.</p>
</li>
</ol>
<p>To run this API, use Uvicorn (an ASGI server):</p>
<pre><code class="language-python">uvicorn src.serve_naive:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>The <code>--reload</code> flag enables auto-reload during development (the server restarts when you change code).</p>
<p>You should see:</p>
<pre><code class="language-python">Loading model...
Model loaded successfully!
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process
</code></pre>
<p>Now open your browser and go to <code>http://localhost:8000/docs</code>. You'll see the <strong>Swagger UI</strong> – an auto-generated interactive documentation where you can test the API directly from your browser!</p>
<p>Test the API using curl in another terminal:</p>
<pre><code class="language-python"># Test with a legitimate-looking transaction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}'
</code></pre>
<p>Expected response:</p>
<pre><code class="language-python">{"is_fraud": false, "fraud_probability": 0.02}
</code></pre>
<pre><code class="language-python"># Test with a suspicious transaction (high amount, late night, online)
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 500.0, "hour": 3, "day_of_week": 1, "merchant_category": "online"}'
</code></pre>
<p>Expected response:</p>
<pre><code class="language-python">{"is_fraud": true, "fraud_probability": 0.78}
</code></pre>
<p><strong>We have a working model served as an API!</strong> In a real scenario, we could now integrate this API with a payment processing frontend, mobile app, or any system that needs fraud predictions.</p>
<p>But before we celebrate, let's examine this naive approach for potential pitfalls...</p>
<p><strong>Checkpoint:</strong> Your API should be running at <code>http://localhost:8000</code>. The Swagger UI at <code>/docs</code> should show both endpoints (<code>/predict</code> and <code>/health</code>). Test with curl or the Swagger UI to verify predictions are returned.</p>
<h2 id="heading-2-where-the-naive-approach-breaks"><strong>2. Where the Naive Approach Breaks</strong></h2>
<p>Our quick-and-dirty ML pipeline works on the surface: it can train a model and serve predictions. However, <strong>hidden problems will emerge</strong> if we try to maintain or scale this system in production.</p>
<p>This section is critical: understanding these issues will motivate the solutions we implement in the following sections. Let's go through the problems one by one.</p>
<h3 id="heading-problem-1-no-experiment-tracking-reproducibility"><strong>Problem 1: No Experiment Tracking (Reproducibility)</strong></h3>
<p>Try this thought experiment: Run <code>train_naive.py</code> again with different hyperparameters (change <code>n_estimators</code> to 200, or <code>max_depth</code> to 15). Would you be able to <strong>exactly reproduce the previous model's results</strong> if someone asked?</p>
<p>Probably not. Currently, we have <strong>no record</strong> of:</p>
<ul>
<li><p>Which hyperparameters we used</p>
</li>
<li><p>What metrics we achieved</p>
</li>
<li><p>What version of the data we trained on</p>
</li>
<li><p>What library versions were installed</p>
</li>
<li><p>When the training happened</p>
</li>
<li><p>Who ran the training</p>
</li>
</ul>
<p>Three months from now, if your manager asks "How was this model trained? Can you reproduce the results?" – you'd be in trouble. You might have the code, but you don't know which version of the code, which parameters, or which data produced the model that's currently in production.</p>
<p><strong>Experiment tracking</strong> is the practice of logging all these details (code versions, parameters, metrics, data versions, artifacts) so experiments can be compared and replicated. Our naive approach lacks this entirely, making our results hard to trust or build upon.</p>
<h3 id="heading-problem-2-model-versioning-and-deployment-chaos"><strong>Problem 2: Model Versioning and Deployment Chaos</strong></h3>
<p>We trained one model and saved it as <code>model.pkl</code>. Now consider this scenario:</p>
<ol>
<li><p>You train a new model with different hyperparameters</p>
</li>
<li><p>You overwrite <code>model.pkl</code> with the new model</p>
</li>
<li><p>You deploy it to production</p>
</li>
<li><p>Users start complaining about more false positives</p>
</li>
<li><p>You want to roll back to the previous model</p>
</li>
<li><p><strong>Problem:</strong> The previous model was overwritten and is gone forever</p>
</li>
</ol>
<p>There's no systematic versioning. Questions you cannot answer:</p>
<ul>
<li><p>Which model version is currently in production?</p>
</li>
<li><p>What were the metrics for model v1 vs v2?</p>
</li>
<li><p>When was each model trained and by whom?</p>
</li>
<li><p>Can we instantly roll back if the new model performs worse?</p>
</li>
<li><p>What changed between versions?</p>
</li>
</ul>
<p>Without version control for models, you're flying blind. Imagine deploying code without Git – that's what we're doing with our model.</p>
<h3 id="heading-problem-3-no-data-validation-garbage-in-garbage-out"><strong>Problem 3: No Data Validation – Garbage In, Garbage Out</strong></h3>
<p>Right now, our API will accept <strong>any input</strong> and try to make a prediction. Let's see what happens with bad data.</p>
<p>Create a test script <code>src/test_bad_data.py</code>:</p>
<pre><code class="language-python"># src/test_bad_data.py
"""Test what happens when we send garbage data to the API."""
import requests

BASE_URL = "http://localhost:8000"

print("Testing API with various bad inputs...\n")

# Test 1: Negative amount
print("Test 1: Negative amount")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -500.0,        # Negative amount - impossible!
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 2: Invalid hour
print("Test 2: Hour = 25 (should be 0-23)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 25,              # Invalid hour!
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 3: Invalid day of week
print("Test 3: day_of_week = 10 (should be 0-6)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 10,       # Invalid day!
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 4: Unknown merchant category
print("Test 4: Unknown merchant category")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "unknown_category"  # Not in training data!
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 5: All bad at once
print("Test 5: Everything wrong")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -1000.0,
    "hour": 99,
    "day_of_week": 15,
    "merchant_category": "totally_fake"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

print("Observation: The API happily accepts ALL garbage and returns predictions!")
print("This is dangerous - bad data leads to bad predictions with no warning.")
</code></pre>
<p>Run it (make sure your API is still running):</p>
<pre><code class="language-python">python src/test_bad_data.py
</code></pre>
<p>You'll see something like:</p>
<pre><code class="language-python">Testing API with various bad inputs...

Test 1: Negative amount
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.15}

Test 2: Hour = 25 (should be 0-23)
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.08}

...

Observation: The API happily accepts ALL garbage and returns predictions!
</code></pre>
<p><strong>The API accepts garbage and returns predictions with no warning!</strong> In production, this could mean:</p>
<ul>
<li><p>Incorrect predictions based on impossible data</p>
</li>
<li><p>Fraud going undetected because of malformed input</p>
</li>
<li><p>Legitimate transactions blocked based on corrupted data</p>
</li>
<li><p>No way to debug why predictions are wrong</p>
</li>
</ul>
<p>As the saying goes: <strong>"Garbage in, garbage out."</strong> But even worse – we don't even know garbage went in!</p>
<h3 id="heading-problem-4-model-drift-performance-decay-over-time"><strong>Problem 4: Model Drift – Performance Decay Over Time</strong></h3>
<p>Here's a scenario that happens in every production ML system:</p>
<ol>
<li><p><strong>January</strong>: You train your model on historical fraud data. It achieves 98% accuracy and 0.67 F1-score. Everyone's happy.</p>
</li>
<li><p><strong>February</strong>: The model is deployed and working well. Fraud is being caught.</p>
</li>
<li><p><strong>March</strong>: Fraudsters adapt. They start using different patterns – smaller amounts, different merchant categories, different times of day.</p>
</li>
<li><p><strong>April</strong>: Your model's accuracy has dropped from 98% to 85%. F1-score dropped from 0.67 to 0.35. Fraud is slipping through.</p>
</li>
<li><p><strong>May</strong>: A major fraud incident occurs. Investigation reveals the model has been underperforming for 2 months.</p>
</li>
</ol>
<p><strong>The problem:</strong> Nobody noticed for 2 months because there was no monitoring.</p>
<p>This phenomenon is called <strong>data drift</strong> (when input data distributions change) or <strong>concept drift</strong> (when the relationship between inputs and outputs changes). Both are inevitable in real-world systems.</p>
<p>Without monitoring:</p>
<ul>
<li><p>You don't know when performance degrades</p>
</li>
<li><p>You don't know why performance degrades</p>
</li>
<li><p>You can't take corrective action until users complain</p>
</li>
<li><p>By then, significant damage may have occurred</p>
</li>
</ul>
<h3 id="heading-problem-5-no-cicd-or-deployment-safety"><strong>Problem 5: No CI/CD or Deployment Safety</strong></h3>
<p>Our "deployment process" was literally:</p>
<ol>
<li><p>SSH into the server (or run locally)</p>
</li>
<li><p>Run <code>python src/train_naive.py</code></p>
</li>
<li><p>Copy model.pkl to the right place</p>
</li>
<li><p>Restart the API</p>
</li>
<li><p>Hope for the best</p>
</li>
</ol>
<p>There's:</p>
<ul>
<li><p><strong>No automated testing</strong>: A typo could break everything</p>
</li>
<li><p><strong>No staging environment</strong>: We test directly in production</p>
</li>
<li><p><strong>No gradual rollout</strong>: 100% of traffic hits the new model immediately</p>
</li>
<li><p><strong>No rollback capability</strong>: If something breaks, we have to manually fix it</p>
</li>
<li><p><strong>No audit trail</strong>: Who deployed what and when?</p>
</li>
</ul>
<p>This is how production incidents happen. A rushed deployment at 5 PM on Friday breaks the fraud detection system, and nobody notices until Monday when fraud losses have spiked.</p>
<p><strong>Figure 2:</strong> Problems with the Naive Approach</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771392425864/75c51059-5ab3-4e08-b3ad-7f5e9c3e7445.png" alt="Diagram showing the weaknesses of a naive machine learning setup: manual training and deployment, no experiment tracking, no model versioning, inconsistent features between training and serving, no data validation, no drift or performance monitoring, and no CI/CD safeguards such as automated tests, rollback, or audit trail." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-summary-what-we-need-to-fix"><strong>Summary: What We Need to Fix</strong></h3>
<p>Our simple ML service is missing critical infrastructure. Here's the mapping of problems to solutions:</p>
<table>
<thead>
<tr>
<th><strong>Problem</strong></th>
<th><strong>Impact</strong></th>
<th><strong>Solution</strong></th>
<th><strong>Section</strong></th>
</tr>
</thead>
<tbody><tr>
<td>No experiment tracking</td>
<td>Can't reproduce or compare models</td>
<td>MLflow Tracking</td>
<td>3</td>
</tr>
<tr>
<td>No model versioning</td>
<td>Can't roll back or audit</td>
<td>MLflow Registry</td>
<td>3</td>
</tr>
<tr>
<td>No feature consistency</td>
<td>Training-serving skew</td>
<td>Feast Feature Store</td>
<td>4</td>
</tr>
<tr>
<td>No data validation</td>
<td>Garbage predictions</td>
<td>Great Expectations</td>
<td>5</td>
</tr>
<tr>
<td>No monitoring</td>
<td>Drift goes unnoticed</td>
<td>Evidently</td>
<td>6</td>
</tr>
<tr>
<td>No CI/CD</td>
<td>Risky deployments</td>
<td>GitHub Actions + Docker</td>
<td>7</td>
</tr>
</tbody></table>
<p><strong>The good news:</strong> We can fix each of these by incrementally adding components to our pipeline. Each tool addresses a specific problem, and together they form a robust ML platform.</p>
<p>Let's start fixing these issues, one by one.</p>
<h2 id="heading-3-add-experiment-tracking-and-model-registry-with-mlflow"><strong>3. Add Experiment Tracking and Model Registry with MLflow</strong></h2>
<p><strong>What breaks without this:</strong> You can't reproduce yesterday's results, can't compare experiments, and can't roll back when a new model fails in production.</p>
<p>Our first fix addresses <strong>Problems 1 and 2</strong>: experiment reproducibility and model versioning.</p>
<p><strong>MLflow</strong> is an open-source platform designed to manage the ML lifecycle. We'll use two of its key components:</p>
<ol>
<li><p><strong>MLflow Tracking</strong>: Log experiments (parameters, metrics, artifacts) so you can compare runs and reproduce results</p>
</li>
<li><p><strong>MLflow Model Registry</strong>: Version your models with aliases (champion, challenger) and manage the deployment lifecycle</p>
</li>
</ol>
<p><strong>Why This Matters:</strong> Without tracking, ML is guesswork. With MLflow, every run is logged with parameters, metrics, and artifacts. You can compare runs side-by-side, understand what actually improved your model, and reproduce any past experiment. The Model Registry adds governance – you know exactly which model is in production and can roll back in seconds.</p>
<h3 id="heading-31-how-to-set-up-the-mlflow-tracking-server"><strong>3.1</strong> How to Set Up the MLflow Tracking Server</h3>
<p>MLflow can log experiments to a local directory by default, but to use the full UI and model registry, it's best to run the MLflow tracking server.</p>
<p>Open a <strong>new terminal</strong> (keep it separate from your API terminal) and run:</p>
<pre><code class="language-python"># Create a directory for MLflow data
mkdir -p mlruns

# Start the MLflow server
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns
</code></pre>
<p>Let's break down these parameters:</p>
<ul>
<li><p><code>--host 0.0.0.0</code>: Listen on all network interfaces</p>
</li>
<li><p><code>--port 5000</code>: Run on port 5000</p>
</li>
<li><p><code>--backend-store-uri sqlite:///mlflow.db</code>: Store experiment metadata in a SQLite database (for production, you'd use PostgreSQL or MySQL)</p>
</li>
<li><p><code>--default-artifact-root ./mlruns</code>: Store model artifacts (files) in the <code>mlruns</code> directory</p>
</li>
</ul>
<p>You should see:</p>
<pre><code class="language-python">[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://0.0.0.0:5000
</code></pre>
<p>Now open your browser and navigate to <code>http://localhost:5000</code>. You'll see the <strong>MLflow UI</strong> – it should be empty initially since we haven't logged any experiments yet.</p>
<h3 id="heading-32-how-to-log-experiments-in-code"><strong>3.2</strong> How to Log Experiments in Code</h3>
<p>Now let's modify our training script to log everything to MLflow. Create <code>src/train_mlflow.py</code>:</p>
<pre><code class="language-python"># src/train_mlflow.py
"""
Train fraud detection model with MLflow experiment tracking.

This script demonstrates proper ML experiment tracking:
- Log all hyperparameters
- Log all metrics (train and test)
- Log the trained model as an artifact
- Register the model in the Model Registry

Compare this to train_naive.py to see the difference!
"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score
)
import pickle
from datetime import datetime

# Configure MLflow to use our tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create or get the experiment
# All runs will be grouped under this experiment name
mlflow.set_experiment("fraud-detection")

def load_and_preprocess_data():
    """Load and preprocess the training and test data."""
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # Encode categorical feature
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    # Prepare features
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    return X_train, y_train, X_test, y_test, encoder

def train_and_log_model(
    n_estimators: int = 100,
    max_depth: int = 10,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1
):
    """
    Train a model and log everything to MLflow.
    
    Args:
        n_estimators: Number of trees in the forest
        max_depth: Maximum depth of each tree
        min_samples_split: Minimum samples required to split a node
        min_samples_leaf: Minimum samples required at a leaf node
    """
    X_train, y_train, X_test, y_test, encoder = load_and_preprocess_data()
    
    # Start an MLflow run - everything logged will be associated with this run
    with mlflow.start_run():
        # Add a descriptive run name
        run_name = f"rf_est{n_estimators}_depth{max_depth}_{datetime.now().strftime('%H%M%S')}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        # Log all hyperparameters
        # These are the "knobs" we can tune
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)
        mlflow.log_param("min_samples_leaf", min_samples_leaf)
        mlflow.log_param("model_type", "RandomForestClassifier")
        
        # Log data information
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples", len(X_test))
        mlflow.log_param("fraud_ratio", float(y_train.mean()))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Train the model
        print(f"\nTraining model: n_estimators={n_estimators}, max_depth={max_depth}")
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Evaluate and log metrics for BOTH train and test sets
        # This helps detect overfitting
        for dataset_name, X, y in [("train", X_train, y_train), ("test", X_test, y_test)]:
            y_pred = model.predict(X)
            y_prob = model.predict_proba(X)[:, 1]
            
            # Calculate all metrics
            accuracy = accuracy_score(y, y_pred)
            precision = precision_score(y, y_pred, zero_division=0)
            recall = recall_score(y, y_pred, zero_division=0)
            f1 = f1_score(y, y_pred, zero_division=0)
            roc_auc = roc_auc_score(y, y_prob)
            
            # Log metrics with dataset prefix
            mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
            mlflow.log_metric(f"{dataset_name}_precision", precision)
            mlflow.log_metric(f"{dataset_name}_recall", recall)
            mlflow.log_metric(f"{dataset_name}_f1", f1)
            mlflow.log_metric(f"{dataset_name}_roc_auc", roc_auc)
            
            print(f"  {dataset_name.upper()} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, ROC-AUC: {roc_auc:.4f}")
        
        # Log feature importance
        for feature, importance in zip(
            ["amount", "hour", "day_of_week", "merchant_encoded"],
            model.feature_importances_
        ):
            mlflow.log_metric(f"importance_{feature}", importance)
        
        # Log the model to MLflow AND register it in the Model Registry
        # This creates a new version of the model automatically
        print("\nRegistering model in MLflow Model Registry...")
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name="fraud-detection-model",
            input_example=X_train.iloc[:5]  # Example input for documentation
        )
        
        # Save and log the encoder as a separate artifact
        # We need this for inference
        with open("encoder.pkl", "wb") as f:
            pickle.dump(encoder, f)
        mlflow.log_artifact("encoder.pkl")
        
        # Get the run ID for reference
        run_id = mlflow.active_run().info.run_id
        print(f"\nMLflow Run ID: {run_id}")
        print(f"View this run: http://localhost:5000/#/experiments/1/runs/{run_id}")
        
        return model, encoder

def run_experiment_sweep():
    """
    Run multiple experiments with different hyperparameters.
    
    This demonstrates how MLflow helps compare different configurations.
    """
    print("="*60)
    print("RUNNING HYPERPARAMETER EXPERIMENT SWEEP")
    print("="*60)
    
    # Define different configurations to try
    experiments = [
        {"n_estimators": 50, "max_depth": 5},
        {"n_estimators": 100, "max_depth": 10},
        {"n_estimators": 100, "max_depth": 15},
        {"n_estimators": 200, "max_depth": 10},
        {"n_estimators": 200, "max_depth": 20},
    ]
    
    for i, params in enumerate(experiments, 1):
        print(f"\n--- Experiment {i}/{len(experiments)} ---")
        train_and_log_model(**params)
    
    print("\n" + "="*60)
    print("EXPERIMENT SWEEP COMPLETE!")
    print("="*60)
    print("\nView all experiments at: http://localhost:5000")
    print("Compare runs to find the best hyperparameters!")

if __name__ == "__main__":
    run_experiment_sweep()
</code></pre>
<p>This script:</p>
<ol>
<li><p><strong>Connects to MLflow</strong>: <code>mlflow.set_tracking_uri("</code><a href="http://localhost:5000"><code>http://localhost:5000</code></a><code>")</code></p>
</li>
<li><p><strong>Creates an experiment</strong>: <code>mlflow.set_experiment("fraud-detection")</code></p>
</li>
<li><p><strong>Logs parameters</strong>: All hyperparameters and data info</p>
</li>
<li><p><strong>Logs metrics</strong>: Accuracy, precision, recall, F1, ROC-AUC for both train and test sets</p>
</li>
<li><p><strong>Logs the model</strong>: Saves the trained model as an artifact</p>
</li>
<li><p><strong>Registers the model</strong>: Adds it to the Model Registry with automatic versioning</p>
</li>
</ol>
<p>Run the experiment sweep:</p>
<pre><code class="language-python">python src/train_mlflow.py
</code></pre>
<p>You'll see output for each experiment:</p>
<pre><code class="language-python">============================================================
RUNNING HYPERPARAMETER EXPERIMENT SWEEP
============================================================

--- Experiment 1/5 ---
Loading data...
Training model: n_estimators=50, max_depth=5
  TRAIN - Accuracy: 0.9821, F1: 0.6545, ROC-AUC: 0.9234
  TEST - Accuracy: 0.9795, F1: 0.5714, ROC-AUC: 0.8956

Registering model in MLflow Model Registry...
MLflow Run ID: abc123...

--- Experiment 5/5 ---
Training model: n_estimators=200, max_depth=20
  TRAIN - Accuracy: 0.9856, F1: 0.7123, ROC-AUC: 0.9567
  TEST - Accuracy: 0.9810, F1: 0.6667, ROC-AUC: 0.9234

============================================================
EXPERIMENT SWEEP COMPLETE!
============================================================
</code></pre>
<p>All 5 runs are now logged to MLflow with full metrics comparison available in the UI.</p>
<p>Now refresh the MLflow UI at <code>http://localhost:5000</code>. You'll see:</p>
<ol>
<li><p><strong>Experiments tab</strong>: Shows the "fraud-detection" experiment with 5 runs</p>
</li>
<li><p><strong>Each run</strong>: Shows parameters, metrics, and artifacts</p>
</li>
<li><p><strong>Compare</strong>: You can select multiple runs and compare them side-by-side</p>
</li>
<li><p><strong>Models tab</strong>: Shows "fraud-detection-model" with 5 versions</p>
</li>
</ol>
<p><strong>MLflow Tracking UI: Compare runs, metrics, and models at a glance</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771396202929/c5a7d547-31b6-4783-acea-f4e9433d81ef.png" alt="c5a7d547-31b6-4783-acea-f4e9433d81ef" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-33-how-to-use-the-model-registry"><strong>3.3</strong> How to Use the Model Registry</h3>
<p>The <strong>Model Registry</strong> provides a central hub for managing model versions and their lifecycle stages.</p>
<p>In the MLflow UI:</p>
<ol>
<li><p>Click the <strong>"Models"</strong> tab in the top navigation</p>
</li>
<li><p>Click <strong>"fraud-detection-model"</strong></p>
</li>
<li><p>You'll see all 5 versions listed with their metrics</p>
</li>
</ol>
<p><strong>Model Aliases:</strong> MLflow now uses <strong>aliases</strong> instead of stages. If you've seen older tutorials using "Staging" and "Production" stages, aliases are the newer, more flexible approach.</p>
<ul>
<li><p><strong>@champion</strong>: The production model serving live traffic</p>
</li>
<li><p><strong>@challenger</strong>: Candidate model being tested</p>
</li>
<li><p>You can create custom aliases like @baseline, @latest and so on.</p>
</li>
</ul>
<p><strong>Assign an alias:</strong></p>
<ol>
<li><p>Open MLflow UI → Models → fraud-detection-model</p>
</li>
<li><p>Click on the version you want to promote</p>
</li>
<li><p>Click <strong>"Add Alias"</strong></p>
</li>
<li><p>Enter <code>champion</code> and save</p>
</li>
</ol>
<p>Now you've assigned the <code>@champion</code> alias to your best model. Your API will load whichever version has this alias, making rollbacks as simple as moving the alias to a different version.</p>
<p><strong>Figure 3: MLflow Model Lifecycle — From Training to Production</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771396081377/da67d89f-b82d-4189-8150-ecc142ed198a.png" alt="Diagram showing the MLflow model lifecycle for a fraud detection system: a model is trained with experiment parameters, logged to MLflow tracking with metrics and artifacts, registered in the model registry as multiple versions, assigned aliases such as champion and challenger, and served in production by loading the model through the champion alias. The diagram also shows rollback by moving the alias to an earlier version and restarting the API." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-34-update-api-to-load-from-registry"><strong>3.4 Update API to Load from Registry</strong></h3>
<p>Now let's update our API to load the champion model from the MLflow Registry instead of a pickle file. Create <code>src/serve_mlflow.py</code>:</p>
<pre><code class="language-python"># src/serve_mlflow.py
"""
Serve fraud detection model from MLflow Model Registry.

This version loads the @champion model from MLflow, which means:
- Always serves the latest @champion model
- Can roll back by changing the @champion alias
- No manual file copying needed
"""
import mlflow
import mlflow.sklearn
import pickle
import os
from fastapi import FastAPI
from pydantic import BaseModel, Field

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

print("Loading model from MLflow Model Registry...")

# Load the champion model from the registry
# This automatically gets whichever version has the @champion alias
try:
    model = mlflow.sklearn.load_model("models:/fraud-detection-model@champion")
    print("Successfully loaded champion model from MLflow!")
except Exception as e:
    print(f"Error loading from MLflow: {e}")
    print("Make sure you've assigned the @champion alias to a model in the MLflow UI")
    raise

# Load the encoder (saved as an artifact)
# In a real system, you might also version this in MLflow
with open("encoder.pkl", "rb") as f:
    encoder = pickle.load(f)
print("Encoder loaded successfully!")

app = FastAPI(
    title="Fraud Detection API (MLflow)",
    description="""
    Fraud detection API that loads models from MLflow Model Registry.
    
    This version always serves the model with the @champion alias.
    To update the model:
    1. Train a new model with train_mlflow.py
    2. Compare metrics in MLflow UI
    3. Promote the best model to Production
    4. Restart this API
    
    To roll back: Move the @champion alias to a previous version in MLflow UI.
    """,
    version="2.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount in dollars", example=150.00)
    hour: int = Field(..., description="Hour of the day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Monday, 6=Sunday)", example=3)
    merchant_category: str = Field(..., description="Type of merchant", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_source: str = "MLflow Production"

@app.post("/predict", response_model=PredictionResponse)
def predict(tx: Transaction):
    """Predict whether a transaction is fraudulent using the champion model."""
    data = tx.dict()
    
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        data["merchant_encoded"] = 0
    
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        model_source="MLflow Production"
    )

@app.get("/health")
def health():
    return {"status": "healthy", "model_source": "MLflow Registry"}

@app.get("/model-info")
def model_info():
    """Get information about the currently loaded model."""
    return {
        "registry": "MLflow",
        "model_name": "fraud-detection-model",
        "alias": "champion",
        "tracking_uri": "http://localhost:5000"
    }
</code></pre>
<p>Stop your old API (Ctrl+C) and start this new one:</p>
<pre><code class="language-python">uvicorn src.serve_mlflow:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>Now deploying a new model is a <strong>controlled, auditable process</strong>:</p>
<ol>
<li><p><strong>Train new model</strong> → Automatically registered as new version</p>
</li>
<li><p><strong>Compare metrics</strong> → Use MLflow UI to compare with current Production</p>
</li>
<li><p><strong>Set as champion</strong> → Assign @champion alias in MLflow UI</p>
</li>
<li><p><strong>Restart API</strong> → Loads new Production model</p>
</li>
<li><p><strong>Roll back if needed</strong> → Move @champion alias to previous version</p>
</li>
</ol>
<p><strong>Checkpoint:</strong></p>
<ul>
<li><p>MLflow UI (<code>http://localhost:5000</code>) should show the "fraud-detection" experiment with 5 runs</p>
</li>
<li><p>The "Models" tab should show "fraud-detection-model" with 5 versions</p>
</li>
<li><p>One version should have @champion alias</p>
</li>
<li><p>The API should load and serve @champion model</p>
</li>
</ul>
<h2 id="heading-4-ensure-feature-consistency-with-feast"><strong>4. Ensure Feature Consistency with Feast</strong></h2>
<p>⚠️ <strong>First time hearing about feature stores?</strong> Don't worry.<br>You don't need to master every Feast detail on the first read.<br>Focus on <em>why</em> feature consistency matters — you can revisit the implementation later.<br><strong>Key takeaway:</strong> Training and serving must compute features the same way, or your model silently fails.</p>
<p><strong>What breaks without this:</strong> Your model sees different feature values in production than it saw during training. Accuracy drops silently. This is called "training-serving skew" and it's one of the most common causes of ML system failures.</p>
<p>One subtle but critical issue in ML systems is <strong>training-serving skew</strong> – when data transformations at training time differ from inference time. Even small discrepancies can severely degrade performance.</p>
<p><strong>Why This Matters:</strong> Imagine you're computing "average transaction amount per merchant category" as a feature. During training, you compute it using pandas in a notebook. During serving, you compute it using SQL in a different system. Small differences in how these computations handle edge cases (nulls, rounding, time windows) cause the model to see different features in production than it was trained on.</p>
<p>The result? <strong>Silent failures</strong> where accuracy drops but nothing errors out. Your model is making predictions based on features it's never seen before, and you have no idea.</p>
<p>In our naive implementation, we did handle one simple case: we saved the <code>LabelEncoder</code> to ensure <code>merchant_category</code> is encoded the same way in training and serving. But imagine if we had more complex feature engineering:</p>
<ul>
<li><p>Rolling averages over time windows</p>
</li>
<li><p>User-level aggregations</p>
</li>
<li><p>Cross-feature interactions</p>
</li>
<li><p>Real-time features from streaming data</p>
</li>
</ul>
<p>Maintaining consistency manually becomes impossible.</p>
<h3 id="heading-41-what-is-feast-and-why-use-it"><strong>4.1 What is Feast and Why Use It?</strong></h3>
<p>In production ML platforms, teams use a <strong>feature store</strong> to guarantee feature consistency between training and serving. <strong>Feast</strong> is one popular open-source option.</p>
<p>In this tutorial, we use Feast not because you <em>must</em>, but because it makes the training-serving contract explicit and teachable. The principles apply whether you use Feast, Tecton, Featureform, or a custom solution.</p>
<p>Feast provides:</p>
<table>
<thead>
<tr>
<th><strong>Capability</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Single source of truth</strong></td>
<td>Define features once, use everywhere</td>
</tr>
<tr>
<td><strong>Offline/online consistency</strong></td>
<td>Same features for training and serving</td>
</tr>
<tr>
<td><strong>Point-in-time correctness</strong></td>
<td>Prevents data leakage in training</td>
</tr>
<tr>
<td><strong>Low-latency serving</strong></td>
<td>Millisecond feature retrieval</td>
</tr>
<tr>
<td><strong>Feature versioning</strong></td>
<td>Track changes to feature definitions</td>
</tr>
</tbody></table>
<p><strong>How Feast works:</strong></p>
<ol>
<li><p><strong>Define features</strong> in Python code (feature definitions)</p>
</li>
<li><p><strong>Materialize features</strong> from your data sources to the online store</p>
</li>
<li><p><strong>Retrieve features</strong> using the same API for both training (offline) and serving (online)</p>
</li>
</ol>
<p>This ensures that training and serving use <strong>exactly the same feature computation logic</strong>.</p>
<h3 id="heading-42-install-and-initialize-feast"><strong>4.2 Install and Initialize Feast</strong></h3>
<p>We already installed Feast via requirements.txt. Now let's initialize a feature repository.</p>
<pre><code class="language-python"># Navigate to the feature_repo directory
cd feature_repo

# Initialize Feast (this creates template files)
feast init . --minimal

# Go back to project root
cd ..
</code></pre>
<p>This creates the basic Feast structure:</p>
<pre><code class="language-python">feature_repo/
├── feature_store.yaml    # Feast configuration
└── __init__.py
</code></pre>
<h3 id="heading-43-define-feature-definitions"><strong>4.3 Define Feature Definitions</strong></h3>
<p>First, let's create the Feast configuration file:</p>
<pre><code class="language-python"># feature_repo/feature_store.yaml
project: fraud_detection
registry: ../data/registry.db
provider: local
online_store:
  type: sqlite
  path: ../data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 3
</code></pre>
<p>This configuration:</p>
<ul>
<li><p>Names our project "fraud_detection"</p>
</li>
<li><p>Uses SQLite for the online store (for production, you'd use Redis or DynamoDB)</p>
</li>
<li><p>Uses local files for the offline store (for production, you'd use BigQuery or Snowflake)</p>
</li>
</ul>
<p>Now create the feature definitions:</p>
<pre><code class="language-python"># feature_repo/features.py
"""
Feast feature definitions for fraud detection.

This file defines:
- Entities: The keys we use to look up features (merchant_category)
- Data Sources: Where the raw feature data comes from (Parquet file)
- Feature Views: The features themselves and their schemas

The key insight: These definitions are the SINGLE SOURCE OF TRUTH.
Both training and serving use these exact definitions.
"""
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# =============================================================================
# ENTITIES
# =============================================================================
# An entity is the "key" we use to look up features.
# For merchant-level features, the entity is merchant_category.

merchant = Entity(
    name="merchant_category",
    description="Merchant category for the transaction (for example, 'online', 'grocery')",
    value_type=ValueType.STRING,
)

# =============================================================================
# DATA SOURCES
# =============================================================================
# Data sources tell Feast where to find the raw feature data.
# For local development, we use a Parquet file.
# For production, this could be BigQuery, Snowflake, S3, etc.

merchant_stats_source = FileSource(
    name="merchant_stats_source",
    path="../data/merchant_features.parquet",  # We'll create this file
    timestamp_field="event_timestamp",       # Required for point-in-time joins
)

# =============================================================================
# FEATURE VIEWS
# =============================================================================
# A Feature View defines a group of related features.
# It specifies:
# - Which entity the features are for
# - The schema (names and types of features)
# - Where the data comes from
# - How long features are valid (TTL)

merchant_stats_fv = FeatureView(
    name="merchant_stats",
    description="Aggregated statistics per merchant category",
    entities=[merchant],
    ttl=timedelta(days=7),  # Features are valid for 7 days
    schema=[
        Field(name="avg_amount", dtype=Float32, description="Average transaction amount"),
        Field(name="transaction_count", dtype=Int64, description="Number of transactions"),
        Field(name="fraud_rate", dtype=Float32, description="Historical fraud rate"),
    ],
    source=merchant_stats_source,
    online=True,  # Enable online serving (low-latency retrieval)
)
</code></pre>
<h3 id="heading-44-materialize-features-to-online-store"><strong>4.4 Materialize Features to Online Store</strong></h3>
<p>Now we need to:</p>
<ol>
<li><p>Compute the features from our training data</p>
</li>
<li><p>Save them in a format Feast can read</p>
</li>
<li><p>Apply the Feast definitions</p>
</li>
<li><p>Materialize features to the online store</p>
</li>
</ol>
<p>Create <code>src/prepare_feast_features.py</code>:</p>
<pre><code class="language-python"># src/prepare_feast_features.py
"""
Prepare feature data for Feast.

This script:
1. Computes aggregated merchant features from training data
2. Saves them in Parquet format (Feast's offline store format)
3. Applies Feast feature definitions
4. Materializes features to the online store for low-latency serving

Run this whenever your training data changes or you want to refresh features.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import subprocess
import os

def compute_merchant_features(df: pd.DataFrame) -&gt; pd.DataFrame:
    """
    Compute aggregated features by merchant category.
    
    THIS IS THE SINGLE SOURCE OF TRUTH FOR FEATURE COMPUTATION.
    
    Both training and serving will use features computed by this exact logic.
    Any change here automatically applies everywhere.
    
    Args:
        df: Transaction DataFrame with columns: amount, merchant_category, is_fraud
        
    Returns:
        DataFrame with computed features per merchant category
    """
    print("Computing merchant-level features...")
    
    # Group by merchant category and compute aggregates
    stats = df.groupby('merchant_category').agg({
        'amount': ['mean', 'count'],
        'is_fraud': 'mean'
    }).reset_index()
    
    # Flatten column names
    stats.columns = ['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']
    
    # Add timestamp for Feast (required for point-in-time correct joins)
    stats['event_timestamp'] = datetime.now()
    
    # Convert types to match Feast schema
    stats['avg_amount'] = stats['avg_amount'].astype('float32')
    stats['transaction_count'] = stats['transaction_count'].astype('int64')
    stats['fraud_rate'] = stats['fraud_rate'].astype('float32')
    
    return stats

def main():
    print("="*60)
    print("FEAST FEATURE PREPARATION")
    print("="*60)
    
    # Load training data
    print("\n1. Loading training data...")
    train_df = pd.read_csv('data/train.csv')
    print(f"   Loaded {len(train_df):,} transactions")
    
    # Compute merchant features
    print("\n2. Computing merchant features...")
    merchant_features = compute_merchant_features(train_df)
    
    print("\n   Computed features:")
    print(merchant_features.to_string(index=False))
    
    # Save as Parquet (required format for Feast file source)
    print("\n3. Saving features to Parquet...")
    os.makedirs('data', exist_ok=True)
    output_path = 'data/merchant_features.parquet'
    merchant_features.to_parquet(output_path, index=False)
    print(f"   Saved to {output_path}")
    
    # Apply Feast feature definitions
    print("\n4. Applying Feast feature definitions...")
    try:
        result = subprocess.run(
            ['feast', 'apply'],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Feature definitions applied successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error applying Feast: {e.stderr}")
        raise
    
    # Materialize features to online store
    print("\n5. Materializing features to online store...")
    try:
        result = subprocess.run(
            ['feast', 'materialize-incremental', datetime.now().isoformat()],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Features materialized successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error materializing: {e.stderr}")
        raise
    
    print("\n" + "="*60)
    print("FEAST FEATURE PREPARATION COMPLETE!")
    print("="*60)
    print("\nYou can now:")
    print("  - Retrieve features for training: get_training_features()")
    print("  - Retrieve features for serving: get_online_features()")
    print("  - View feature stats: feast feature-views list")

if __name__ == "__main__":
    main()
</code></pre>
<p>Run the feature preparation:</p>
<pre><code class="language-python">python src/prepare_feast_features.py
</code></pre>
<p>You should see:</p>
<pre><code class="language-python">============================================================
FEAST FEATURE PREPARATION
============================================================

1. Loading training data... 8,000 transactions
2. Computing merchant features...
   grocery: avg=$31.24, fraud_rate=0.85%
   online: avg=$98.45, fraud_rate=4.87%
   restaurant: avg=$28.12, fraud_rate=0.50%
   retail: avg=$45.67, fraud_rate=1.02%
   travel: avg=$156.23, fraud_rate=4.18%
3. Saving to data/merchant_features.parquet ✓
4. Applying Feast definitions... ✓
5. Materializing to online store... ✓

FEAST FEATURE PREPARATION COMPLETE!
</code></pre>
<h3 id="heading-45-retrieve-features-for-training-and-serving"><strong>4.5 Retrieve Features for Training and Serving</strong></h3>
<p>Now let's create utilities to retrieve features consistently for both training and serving:</p>
<pre><code class="language-python"># src/feast_features.py
"""
Feast feature retrieval for training and serving.

This module provides functions to retrieve features from Feast:
- get_training_features(): For offline training (historical features)
- get_online_features(): For real-time serving (low-latency)

IMPORTANT: Both functions use the SAME feature definitions,
ensuring consistency between training and serving.
"""
import pandas as pd
from feast import FeatureStore
from datetime import datetime

# Initialize Feast store (points to our feature_repo)
store = FeatureStore(repo_path="feature_repo")

def get_training_features(df: pd.DataFrame) -&gt; pd.DataFrame:
    """
    Get features for training using Feast's offline store.
    
    Uses point-in-time correct joins to prevent data leakage.
    This means features are looked up as of the time each transaction occurred,
    not as of "now" - preventing you from accidentally using future data.
    
    Args:
        df: DataFrame with at least 'merchant_category' column
        
    Returns:
        DataFrame with original columns plus Feast features
    """
    print("Retrieving training features from Feast offline store...")
    
    # Prepare entity dataframe with timestamps
    # Each row needs: entity key(s) + event_timestamp
    entity_df = df[['merchant_category']].copy()
    entity_df['event_timestamp'] = datetime.now()  # See note below
    entity_df = entity_df.drop_duplicates()
    
    # ⚠️ Simplification: For clarity, we use the current timestamp here.
    # In real systems, this would be the actual event time of each transaction.
    
    # Retrieve historical features
    # Feast handles the point-in-time join automatically
    training_data = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
    ).to_df()
    
    # Merge features back with original dataframe
    result = df.merge(
        training_data[['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']],
        on='merchant_category',
        how='left'
    )
    
    print(f"Retrieved features for {len(entity_df)} unique merchants")
    return result

def get_online_features(merchant_category: str) -&gt; dict:
    """
    Get features for real-time serving using Feast's online store.
    
    This is optimized for low-latency retrieval (milliseconds).
    Use this in your prediction API for real-time inference.
    
    Args:
        merchant_category: The merchant category to look up
        
    Returns:
        Dictionary with feature names and values
    """
    # Retrieve from online store (low-latency)
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": merchant_category}],
    ).to_dict()
    
    # Format the response
    return {
        'merchant_avg_amount': feature_vector['avg_amount'][0],
        'merchant_tx_count': feature_vector['transaction_count'][0],
        'merchant_fraud_rate': feature_vector['fraud_rate'][0],
    }

def get_online_features_batch(merchant_categories: list) -&gt; pd.DataFrame:
    """
    Get features for multiple merchants at once (batch serving).
    
    More efficient than calling get_online_features() in a loop.
    
    Args:
        merchant_categories: List of merchant categories to look up
        
    Returns:
        DataFrame with features for each merchant
    """
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": mc} for mc in merchant_categories],
    ).to_df()
    
    return feature_vector

if __name__ == "__main__":
    # Test the feature retrieval functions
    print("="*60)
    print("TESTING FEAST FEATURE RETRIEVAL")
    print("="*60)
    
    # Test offline retrieval (for training)
    print("\n1. Testing OFFLINE feature retrieval (for training)...")
    train_df = pd.read_csv('data/train.csv').head(10)
    enriched = get_training_features(train_df)
    print("\n   Sample enriched training data:")
    print(enriched[['amount', 'merchant_category', 'avg_amount', 'fraud_rate']].head())
    
    # Test online retrieval (for serving)
    print("\n2. Testing ONLINE feature retrieval (for serving)...")
    for category in ['online', 'grocery', 'travel', 'restaurant', 'retail']:
        features = get_online_features(category)
        print(f"   {category}: avg_amount=${features['merchant_avg_amount']:.2f}, "
              f"fraud_rate={features['merchant_fraud_rate']:.2%}")
    
    # Test batch retrieval
    print("\n3. Testing BATCH online retrieval...")
    batch_features = get_online_features_batch(['online', 'grocery', 'travel'])
    print(batch_features)
    
    print("\n" + "="*60)
    print("FEAST FEATURE RETRIEVAL TEST COMPLETE!")
    print("="*60)
</code></pre>
<p>Test the feature retrieval:</p>
<pre><code class="language-python">python src/feast_features.py
</code></pre>
<p>You should see:</p>
<pre><code class="language-python">============================================================
TESTING FEAST FEATURE RETRIEVAL
============================================================

1. Testing OFFLINE feature retrieval (for training)...
Retrieving training features from Feast offline store...
Retrieved features for 5 unique merchants

   Sample enriched training data:
   amount merchant_category  avg_amount  fraud_rate
    45.23           grocery       31.24      0.0085
   123.45            online       98.45      0.0487
    ...

2. Testing ONLINE feature retrieval (for serving)...
   online: avg_amount=$98.45, fraud_rate=4.87%
   grocery: avg_amount=$31.24, fraud_rate=0.85%
   travel: avg_amount=$156.23, fraud_rate=4.18%
   restaurant: avg_amount=$28.12, fraud_rate=0.50%
   retail: avg_amount=$45.67, fraud_rate=1.02%

3. Testing BATCH online retrieval...
  merchant_category  avg_amount  transaction_count  fraud_rate
               online       98.45               1234      0.0487
              grocery       31.24               2345      0.0085
               travel      156.23                478      0.0418
</code></pre>
<h3 id="heading-why-feast-over-custom-code"><strong>Why Feast Over Custom Code?</strong></h3>
<table>
<thead>
<tr>
<th><strong>Aspect</strong></th>
<th><strong>Custom Code</strong></th>
<th><strong>Feast</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Consistency</strong></td>
<td>Manual effort to keep in sync</td>
<td>Automatic - same definitions everywhere</td>
</tr>
<tr>
<td><strong>Point-in-time correctness</strong></td>
<td>Must implement yourself</td>
<td>Built-in</td>
</tr>
<tr>
<td><strong>Online serving</strong></td>
<td>Must build your own cache</td>
<td>Built-in online store</td>
</tr>
<tr>
<td><strong>Feature versioning</strong></td>
<td>Not supported</td>
<td>Built-in</td>
</tr>
<tr>
<td><strong>Scalability</strong></td>
<td>Limited</td>
<td>Production-ready (BigQuery, Redis, etc.)</td>
</tr>
<tr>
<td><strong>Team collaboration</strong></td>
<td>Difficult</td>
<td>Feature registry with documentation</td>
</tr>
<tr>
<td><strong>Monitoring</strong></td>
<td>Manual</td>
<td>Built-in feature statistics</td>
</tr>
</tbody></table>
<p>💡 <strong>Mental Model</strong>: Treat feature definitions like database schemas.<br>You wouldn't compute a column one way in your application and a different way in your reports. Features deserve the same discipline — define once, use everywhere.</p>
<p><strong>Checkpoint:</strong> After running <code>prepare_feast_</code><a href="http://features.py"><code>features.py</code></a>, you should have:</p>
<ul>
<li><p><code>data/merchant_features.parquet</code> (computed features)</p>
</li>
<li><p><code>data/registry.db</code> (Feast registry)</p>
</li>
<li><p><code>data/online_store.db</code> (SQLite online store)</p>
</li>
</ul>
<p>Running <code>python src/feast_</code><a href="http://features.py"><code>features.py</code></a> should successfully retrieve features for all merchant categories.</p>
<h2 id="heading-5-add-data-validation-with-great-expectations"><strong>5. Add Data Validation with Great Expectations</strong></h2>
<p><strong>What breaks without this:</strong> Your API accepts garbage input (negative amounts, invalid hours) and returns meaningless predictions. Worse, you have no idea it happened.</p>
<p>Recall that our API currently trusts input blindly. We saw how garbage data produces a prediction with no warning. <strong>Great Expectations</strong> is an open-source tool for data quality testing – defining rules (expectations) and testing data against them.</p>
<p><strong>Why This Matters:</strong> Data validation acts as a gatekeeper. Bad data is rejected <strong>before</strong> it can harm predictions. As the saying goes, "Garbage in, garbage out" – feeding unreliable data yields unreliable results. With validation, we transform this to "Garbage in, <strong>error out</strong>" – much better for debugging and reliability.</p>
<h3 id="heading-51-define-expectations"><strong>5.1 Define Expectations</strong></h3>
<p>What are reasonable expectations for our transaction data? Based on domain knowledge:</p>
<table>
<thead>
<tr>
<th><strong>Field</strong></th>
<th><strong>Expectation</strong></th>
<th><strong>Reason</strong></th>
</tr>
</thead>
<tbody><tr>
<td><code>amount</code></td>
<td>Positive (&gt; 0)</td>
<td>Negative transactions don't make sense</td>
</tr>
<tr>
<td><code>amount</code></td>
<td>Below $50,000</td>
<td>Extremely large amounts are outliers/errors</td>
</tr>
<tr>
<td><code>hour</code></td>
<td>0-23 inclusive</td>
<td>Valid hours in a day</td>
</tr>
<tr>
<td><code>day_of_week</code></td>
<td>0-6 inclusive</td>
<td>Valid days (Mon=0, Sun=6)</td>
</tr>
<tr>
<td><code>merchant_category</code></td>
<td>One of known categories</td>
<td>Must match training data</td>
</tr>
<tr>
<td>All fields</td>
<td>Not null</td>
<td>Required for prediction</td>
</tr>
</tbody></table>
<p>Create <code>src/data_validation.py</code>:</p>
<pre><code class="language-python"># src/data_validation.py
"""
Data validation for fraud detection.

This module provides functions to validate input data BEFORE making predictions.
Invalid data is rejected with clear error messages.

The key insight: It's better to reject bad input than to make garbage predictions.
"""
import pandas as pd
from typing import Dict, List, Any, Optional

# Define the valid merchant categories (must match training data!)
VALID_CATEGORIES = ["grocery", "restaurant", "retail", "online", "travel"]

def validate_transaction(data: Dict[str, Any]) -&gt; Dict[str, Any]:
    """
    Validate a single transaction for fraud prediction.
    
    Checks all business rules and data quality requirements.
    Returns a dictionary with 'valid' (bool) and 'errors' (list).
    
    Args:
        data: Dictionary with transaction fields
        
    Returns:
        {"valid": bool, "errors": list of error messages}
        
    Example:
        &gt;&gt;&gt; validate_transaction({"amount": -100, "hour": 25, ...})
        {"valid": False, "errors": ["amount must be positive", "hour must be 0-23"]}
    """
    errors = []
    
    # ==========================================================================
    # Amount Validation
    # ==========================================================================
    amount = data.get("amount")
    if amount is None:
        errors.append("amount is required")
    elif not isinstance(amount, (int, float)):
        errors.append(f"amount must be a number (got {type(amount).__name__})")
    elif amount &lt;= 0:
        errors.append("amount must be positive")
    elif amount &gt; 50000:
        errors.append(f"amount exceeds maximum allowed value of \(50,000 (got \){amount:,.2f})")
    
    # ==========================================================================
    # Hour Validation
    # ==========================================================================
    hour = data.get("hour")
    if hour is None:
        errors.append("hour is required")
    elif not isinstance(hour, int):
        errors.append(f"hour must be an integer (got {type(hour).__name__})")
    elif not (0 &lt;= hour &lt;= 23):
        errors.append(f"hour must be between 0 and 23 (got {hour})")
    
    # ==========================================================================
    # Day of Week Validation
    # ==========================================================================
    day = data.get("day_of_week")
    if day is None:
        errors.append("day_of_week is required")
    elif not isinstance(day, int):
        errors.append(f"day_of_week must be an integer (got {type(day).__name__})")
    elif not (0 &lt;= day &lt;= 6):
        errors.append(f"day_of_week must be between 0 (Monday) and 6 (Sunday) (got {day})")
    
    # ==========================================================================
    # Merchant Category Validation
    # ==========================================================================
    category = data.get("merchant_category")
    if category is None:
        errors.append("merchant_category is required")
    elif not isinstance(category, str):
        errors.append(f"merchant_category must be a string (got {type(category).__name__})")
    elif category not in VALID_CATEGORIES:
        errors.append(
            f"merchant_category must be one of {VALID_CATEGORIES} (got '{category}')"
        )
    
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }

def validate_batch(df: pd.DataFrame) -&gt; Dict[str, Any]:
    """
    Validate a batch of transactions using Great Expectations.
    
    This is useful for validating training data or batch prediction requests.
    Uses Great Expectations for more sophisticated validation.
    
    Args:
        df: DataFrame with transaction data
        
    Returns:
        Dictionary with validation results
    """
    import great_expectations as gx
    
    # Convert to Great Expectations dataset
    ge_df = gx.from_pandas(df)
    
    results = []
    
    # Amount expectations
    r = ge_df.expect_column_values_to_be_between(
        'amount', min_value=0.01, max_value=50000, mostly=0.99
    )
    results.append(('amount_range', r.success, r.result))
    
    # Hour expectations
    r = ge_df.expect_column_values_to_be_between(
        'hour', min_value=0, max_value=23
    )
    results.append(('hour_range', r.success, r.result))
    
    # Day of week expectations
    r = ge_df.expect_column_values_to_be_between(
        'day_of_week', min_value=0, max_value=6
    )
    results.append(('day_range', r.success, r.result))
    
    # Merchant category expectations
    r = ge_df.expect_column_values_to_be_in_set(
        'merchant_category', VALID_CATEGORIES
    )
    results.append(('category_valid', r.success, r.result))
    
    # No nulls in critical fields
    for col in ['amount', 'hour', 'day_of_week', 'merchant_category']:
        r = ge_df.expect_column_values_to_not_be_null(col)
        results.append((f'{col}_not_null', r.success, r.result))
    
    # Summarize results
    passed = sum(1 for _, success, _ in results if success)
    total = len(results)
    
    return {
        'success': passed == total,
        'passed': passed,
        'total': total,
        'pass_rate': passed / total,
        'details': {name: {'passed': success, 'result': result} 
                   for name, success, result in results}
    }

if __name__ == "__main__":
    print("="*60)
    print("TESTING DATA VALIDATION")
    print("="*60)
    
    # Test single transaction validation
    print("\n1. Single Transaction Validation")
    print("-"*40)
    
    test_cases = [
        {
            "name": "Valid transaction",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Negative amount",
            "data": {"amount": -100.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Invalid hour",
            "data": {"amount": 50.0, "hour": 25, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Unknown merchant",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "unknown"}
        },
        {
            "name": "Everything wrong",
            "data": {"amount": -999, "hour": 99, "day_of_week": 15, "merchant_category": "fake"}
        },
    ]
    
    for tc in test_cases:
        result = validate_transaction(tc["data"])
        status = "PASS" if result["valid"] else "FAIL"
        print(f"\n{tc['name']}: {status}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  - {error}")
    
    # Test batch validation
    print("\n\n2. Batch Validation with Great Expectations")
    print("-"*40)
    
    train_df = pd.read_csv('data/train.csv')
    results = validate_batch(train_df)
    
    print(f"\nTraining data validation: {results['passed']}/{results['total']} checks passed")
    print(f"Pass rate: {results['pass_rate']:.1%}")
    
    if not results['success']:
        print("\nFailed checks:")
        for name, detail in results['details'].items():
            if not detail['passed']:
                print(f"  - {name}")
</code></pre>
<h3 id="heading-when-to-use-which-validation-approach"><strong>When to Use Which Validation Approach</strong></h3>
<table>
<thead>
<tr>
<th><strong>Approach</strong></th>
<th><strong>Use Case</strong></th>
<th><strong>Latency</strong></th>
<th><strong>When to Use</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Custom Python</strong> (<code>validate_transaction</code>)</td>
<td>Real-time API requests</td>
<td>&lt;1ms</td>
<td>Every prediction request</td>
</tr>
<tr>
<td><strong>Great Expectations</strong></td>
<td>Batch data quality</td>
<td>Seconds</td>
<td>Training data, periodic audits, CI/CD</td>
</tr>
</tbody></table>
<p>We use <strong>both</strong> in this tutorial because they serve different purposes:</p>
<ul>
<li><p>Custom validation is your <strong>runtime gatekeeper</strong> — fast enough for every request</p>
</li>
<li><p>Great Expectations is your <strong>batch auditor</strong> — thorough checks on datasets</p>
</li>
</ul>
<h3 id="heading-52-integrate-validation-into-fastapi"><strong>5.2 Integrate Validation into FastAPI</strong></h3>
<p>Now let's update our API to reject invalid input with clear error messages:</p>
<pre><code class="language-python"># src/serve_validated.py
"""
Serve fraud detection model with input validation.

This version adds data validation BEFORE making predictions:
- Invalid inputs are rejected with HTTP 400 and clear error messages
- Valid inputs are processed and predictions returned

This is much safer than the naive version which accepted garbage.
"""
import pickle
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.data_validation import validate_transaction

# Load model
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)

app = FastAPI(
    title="Fraud Detection API (Validated)",
    description="""
    Fraud detection API with input validation.
    
    All inputs are validated before prediction:
    - amount: Must be positive and below $50,000
    - hour: Must be 0-23
    - day_of_week: Must be 0-6
    - merchant_category: Must be one of: grocery, restaurant, retail, online, travel
    
    Invalid inputs return HTTP 400 with detailed error messages.
    """,
    version="3.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount (must be positive)", example=150.00)
    hour: int = Field(..., description="Hour of day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Mon, 6=Sun)", example=3)
    merchant_category: str = Field(..., description="Merchant type", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    validation_passed: bool = True

class ValidationErrorResponse(BaseModel):
    detail: dict

@app.post("/predict", response_model=PredictionResponse, responses={400: {"model": ValidationErrorResponse}})
def predict(tx: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Input is validated before prediction. Invalid inputs return HTTP 400.
    """
    data = tx.dict()
    
    # VALIDATE INPUT BEFORE MAKING PREDICTION
    validation = validate_transaction(data)
    
    if not validation["valid"]:
        raise HTTPException(
            status_code=400,
            detail={
                "message": "Validation failed",
                "errors": validation["errors"],
                "input": data
            }
        )
    
    # Input is valid - make prediction
    data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        validation_passed=True
    )

@app.get("/health")
def health():
    return {"status": "healthy", "validation": "enabled"}
</code></pre>
<p>Start the validated API:</p>
<pre><code class="language-python">uvicorn src.serve_validated:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>Now test with bad data:</p>
<pre><code class="language-python">curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}'
</code></pre>
<p>Response (HTTP 400):</p>
<pre><code class="language-python">{
  "detail": {
    "message": "Validation failed",
    "errors": [
      "amount must be positive",
      "hour must be between 0 and 23 (got 25)",
      "day_of_week must be between 0 (Monday) and 6 (Sunday) (got 10)",
      "merchant_category must be one of ['grocery', 'restaurant', 'retail', 'online', 'travel'] (got 'fake')"
    ],
    "input": {"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}
  }
}
</code></pre>
<p><strong>This is a huge improvement!</strong> Instead of silently accepting garbage and returning meaningless predictions, we now:</p>
<ul>
<li><p>Reject invalid input immediately</p>
</li>
<li><p>Provide clear, actionable error messages</p>
</li>
<li><p>Return the original input for debugging</p>
</li>
<li><p>Use proper HTTP status codes (400 for client error)</p>
</li>
</ul>
<p><strong>Checkpoint:</strong> Your validated API should:</p>
<ul>
<li><p>Accept valid transactions and return predictions</p>
</li>
<li><p>Reject invalid transactions with HTTP 400 and detailed error messages</p>
</li>
<li><p>Show validation errors for each invalid field</p>
</li>
</ul>
<h2 id="heading-6-monitor-model-performance-and-data-drift"><strong>6. Monitor Model Performance and Data Drift</strong></h2>
<p><strong>What breaks without this:</strong> Your model's accuracy drops from 98% to 70% over two months. Nobody notices until customers complain. By then, significant damage has occurred.</p>
<p>Even with a great model and clean input data, <strong>time can be an enemy</strong>. Model performance can decline as real-world data evolves – this is known as <strong>model drift</strong> or <strong>model decay</strong>.</p>
<p><strong>Why This Matters:</strong> In traditional software, you monitor CPU, memory, error rates, and response times. In ML, you must <strong>also</strong> monitor:</p>
<ul>
<li><p>Data quality (are inputs within expected ranges?)</p>
</li>
<li><p>Model performance (is accuracy holding up?)</p>
</li>
<li><p>Data drift (has input distribution changed?)</p>
</li>
<li><p>Prediction drift (has the distribution of predictions changed?)</p>
</li>
</ul>
<p>Without monitoring, your model could be silently failing for weeks before anyone notices. By then, significant damage may have occurred – fraud slipping through, good customers blocked, revenue lost.</p>
<h3 id="heading-61-the-four-pillars-of-ml-observability"><strong>6.1 The Four Pillars of ML Observability</strong></h3>
<table>
<thead>
<tr>
<th><strong>Pillar</strong></th>
<th><strong>What to Monitor</strong></th>
<th><strong>Why It Matters</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Data Quality</strong></td>
<td>Are inputs valid? Nulls? Outliers?</td>
<td>Bad data causes bad predictions</td>
</tr>
<tr>
<td><strong>Model Performance</strong></td>
<td>Accuracy, precision, recall, F1</td>
<td>Is the model still working?</td>
</tr>
<tr>
<td><strong>Data Drift</strong></td>
<td>Has input distribution changed from training?</td>
<td>Model may not generalize to new data</td>
</tr>
<tr>
<td><strong>Prediction Drift</strong></td>
<td>Has prediction distribution changed?</td>
<td>May indicate data or concept drift</td>
</tr>
</tbody></table>
<h3 id="heading-62-build-a-drift-monitor-with-evidently"><strong>6.2 Build a Drift Monitor with Evidently</strong></h3>
<p><strong>Evidently</strong> is an open-source library specifically designed for ML monitoring. It can detect drift, generate reports, and integrate with monitoring systems.</p>
<p>Create <code>src/monitoring.py</code>:</p>
<pre><code class="language-python"># src/monitoring.py
"""
Model monitoring with Evidently.

This module provides tools to:
1. Detect data drift between training and production data
2. Generate detailed HTML reports
3. Track drift over time
4. Alert when drift exceeds thresholds

In production, you would run drift checks periodically (hourly, daily)
and alert when significant drift is detected.
"""
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric
)
from datetime import datetime
from typing import List, Dict, Any, Optional

class DriftMonitor:
    """
    Monitor for detecting data drift between reference (training) and current data.
    
    Implementation Note: We use two approaches here:
    1. Scipy's KS-test — A lightweight statistical method that works anywhere (our fallback)
    2. Evidently — A full-featured library with beautiful reports (our primary tool)
    
    The KS-test is included as defensive coding — if Evidently fails to generate 
    a report, we still get drift detection.
    
    Usage:
        monitor = DriftMonitor(training_data)
        result = monitor.check_drift(production_data)
        if result['drift_detected']:
            alert("Drift detected!")
    """
    
    def __init__(self, reference_data: pd.DataFrame, feature_columns: Optional[List[str]] = None):
        """
        Initialize the drift monitor with reference (training) data.
        
        Args:
            reference_data: The training data to compare against
            feature_columns: Columns to monitor (default: all numeric columns)
        """
        self.reference = reference_data
        self.feature_columns = feature_columns or reference_data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        self.history: List[Dict[str, Any]] = []
        
        print(f"Drift monitor initialized with {len(self.reference):,} reference samples")
        print(f"Monitoring columns: {self.feature_columns}")
    
    def check_drift(self, current_data: pd.DataFrame, threshold: float = 0.1) -&gt; Dict[str, Any]:
        """
        Check for drift between reference and current data.
        
        Args:
            current_data: Current/production data to check
            threshold: Drift share threshold for alerting (default 10%)
            
        Returns:
            Dictionary with drift results
        """
        from scipy import stats
        
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        # Simple statistical drift detection using KS test
        drifted_columns = []
        for col in self.feature_columns:
            statistic, p_value = stats.ks_2samp(
                ref_subset[col].dropna(),
                cur_subset[col].dropna()
            )
            if p_value &lt; 0.05:  # 5% significance level
                drifted_columns.append(col)
        
        n_features = len(self.feature_columns)
        n_drifted = len(drifted_columns)
        drift_share = n_drifted / n_features if n_features &gt; 0 else 0
        
        result = {
            'timestamp': datetime.now().isoformat(),
            'drift_detected': n_drifted &gt; 0,
            'drift_share': drift_share,
            'drifted_columns': drifted_columns,
            'n_features': n_features,
            'n_drifted': n_drifted,
            'current_samples': len(current_data),
            'threshold': threshold,
            'alert': drift_share &gt; threshold
        }
        
        self.history.append(result)
        
        return result
    
    def generate_report(self, current_data: pd.DataFrame, output_path: str = "drift_report.html"):
        """
        Generate a detailed HTML drift report using Evidently.
        
        Opens in browser for visual inspection of drift patterns.
        """
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        try:
            report = Report(metrics=[DataDriftPreset()])
            report.run(reference_data=ref_subset, current_data=cur_subset)
            
            # Save HTML report
            with open(output_path, 'w') as f:
                f.write(report.show(mode='inline').data)
            
            print(f"Drift report saved to {output_path}")
            print(f"Open this file in a browser to view detailed visualizations.")
        except Exception as e:
            print(f"Could not generate Evidently report: {e}")
            print(f"Using simplified drift detection instead.")
    
    def get_alerts(self, threshold: float = 0.1) -&gt; List[Dict[str, Any]]:
        """
        Get all alerts from history where drift exceeded threshold.
        """
        return [
            {
                'timestamp': r['timestamp'],
                'severity': 'HIGH' if r['drift_share'] &gt; 0.3 else 'MEDIUM',
                'drift_share': r['drift_share'],
                'message': f"Drift detected: {r['drift_share']:.1%} of features drifted",
                'drifted_columns': r['drifted_columns']
            }
            for r in self.history
            if r['drift_share'] &gt; threshold
        ]
    
    def summary(self) -&gt; Dict[str, Any]:
        """Get summary statistics from monitoring history."""
        if not self.history:
            return {"message": "No drift checks performed yet"}
        
        drift_shares = [r['drift_share'] for r in self.history]
        alerts = [r for r in self.history if r['alert']]
        
        return {
            'total_checks': len(self.history),
            'total_alerts': len(alerts),
            'avg_drift_share': np.mean(drift_shares),
            'max_drift_share': np.max(drift_shares),
            'first_check': self.history[0]['timestamp'],
            'last_check': self.history[-1]['timestamp']
        }


def simulate_drift_scenarios():
    """
    Demonstrate drift detection with different scenarios.
    
    This simulates what happens when production data differs from training data.
    """
    from src.generate_data import generate_transactions
    
    print("="*70)
    print("DRIFT DETECTION SIMULATION")
    print("="*70)
    
    # Load reference (training) data
    print("\n1. Loading reference data (training set)...")
    reference = pd.read_csv('data/train.csv')
    feature_cols = ['amount', 'hour', 'day_of_week']
    
    # Initialize drift monitor
    monitor = DriftMonitor(reference, feature_cols)
    
    # Scenario 1: Similar data (should show minimal drift)
    print("\n" + "-"*70)
    print("SCENARIO 1: Test data (similar distribution)")
    print("-"*70)
    test_data = pd.read_csv('data/test.csv')
    result = monitor.check_drift(test_data)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 2: Fraud spike (10% fraud instead of 2%)
    print("\n" + "-"*70)
    print("SCENARIO 2: Fraud spike (10% fraud rate instead of 2%)")
    print("-"*70)
    fraud_spike = generate_transactions(n_samples=2000, fraud_ratio=0.10, seed=101)
    result = monitor.check_drift(fraud_spike)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 3: Amount inflation (everything costs more)
    print("\n" + "-"*70)
    print("SCENARIO 3: Amount inflation (2x multiplier)")
    print("-"*70)
    inflated = test_data.copy()
    inflated['amount'] = inflated['amount'] * 2
    result = monitor.check_drift(inflated)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 4: Time shift (more late-night transactions)
    print("\n" + "-"*70)
    print("SCENARIO 4: Time shift (mostly late-night transactions)")
    print("-"*70)
    night_shift = test_data.copy()
    night_shift['hour'] = np.random.choice([0, 1, 2, 3, 22, 23], size=len(night_shift))
    result = monitor.check_drift(night_shift)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Generate detailed report for the most drifted scenario
    print("\n" + "-"*70)
    print("GENERATING DETAILED REPORT")
    print("-"*70)
    monitor.generate_report(night_shift, "drift_report.html")
    
    # Print summary
    print("\n" + "-"*70)
    print("MONITORING SUMMARY")
    print("-"*70)
    summary = monitor.summary()
    print(f"  Total checks: {summary['total_checks']}")
    print(f"  Total alerts: {summary['total_alerts']}")
    print(f"  Average drift share: {summary['avg_drift_share']:.1%}")
    print(f"  Maximum drift share: {summary['max_drift_share']:.1%}")
    
    # Print alerts
    alerts = monitor.get_alerts()
    if alerts:
        print(f"\n  Alerts ({len(alerts)}):")
        for alert in alerts:
            print(f"    [{alert['severity']}] {alert['message']}")
    
    print("\n" + "="*70)
    print("DRIFT DETECTION SIMULATION COMPLETE")
    print("="*70)
    print("\nOpen drift_report.html in your browser to see detailed visualizations!")


if __name__ == "__main__":
    simulate_drift_scenarios()
</code></pre>
<p>Run the drift simulation:</p>
<pre><code class="language-python">python src/monitoring.py
</code></pre>
<p>You'll see output showing how drift detection works in different scenarios. Then open <code>drift_report.html</code> in your browser to see beautiful visualizations of the drift patterns.</p>
<h3 id="heading-63-production-monitoring-strategy"><strong>6.3 Production Monitoring Strategy</strong></h3>
<p>In a production environment, you would:</p>
<ol>
<li><p><strong>Log all predictions</strong> to a database or data warehouse</p>
</li>
<li><p><strong>Run drift checks periodically</strong> (hourly for high-traffic systems, daily for lower traffic)</p>
</li>
<li><p><strong>Set up alerts</strong> when drift exceeds thresholds (integrate with PagerDuty, Slack, etc.)</p>
</li>
<li><p><strong>Trigger retraining</strong> if drift is severe or sustained</p>
</li>
<li><p><strong>Create dashboards</strong> to track drift over time (Grafana, Datadog, etc.)</p>
</li>
</ol>
<p><strong>Checkpoint:</strong> Running <code>python src/</code><a href="http://monitoring.py"><code>monitoring.py</code></a> should:</p>
<ul>
<li><p>Show minimal drift for similar data (test set)</p>
</li>
<li><p>Show significant drift for modified data (fraud spike, inflation, time shift)</p>
</li>
<li><p>Generate an HTML report that you can view in your browser</p>
</li>
</ul>
<h2 id="heading-7-automate-testing-and-deployment-with-cicd"><strong>7. Automate Testing and Deployment with CI/CD</strong></h2>
<p><strong>What breaks without this:</strong> A typo in your code breaks the API. You deploy on Friday at 5 PM. Nobody notices until Monday. Fraud losses spike over the weekend.</p>
<p><strong>CI/CD</strong> (Continuous Integration/Continuous Deployment) ensures reliable, repeatable releases. As JFrog notes: <em>"A strong CI/CD pipeline enables ML teams to build robust, bug-free models more quickly and efficiently."</em></p>
<p><strong>Why This Matters:</strong> In ML, changes aren't just code – they're also data and models. CI/CD ensures that when you change training logic, data preprocessing, or hyperparameters, tests verify the change doesn't break anything before it reaches production. It's the difference between deploying with confidence and deploying with crossed fingers.</p>
<h3 id="heading-71-write-tests-for-data-and-model"><strong>7.1 Write Tests for Data and Model</strong></h3>
<p>Create <code>tests/test_data_and_</code><a href="http://model.py"><code>model.py</code></a>:</p>
<pre><code class="language-python"># tests/test_data_and_model.py
"""
Tests for data quality and model performance.

These tests run in CI/CD to ensure:
1. Data meets quality requirements
2. Model meets performance thresholds
3. No regressions are introduced

Run with: pytest tests/test_data_and_model.py -v
"""
import pandas as pd
import pickle
import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

class TestDataQuality:
    """Tests for training data quality."""
    
    @pytest.fixture
    def train_data(self):
        return pd.read_csv("data/train.csv")
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_train_data_has_expected_columns(self, train_data):
        """Training data must have all required columns."""
        required_columns = {"amount", "hour", "day_of_week", "merchant_category", "is_fraud"}
        actual_columns = set(train_data.columns)
        missing = required_columns - actual_columns
        assert not missing, f"Missing columns: {missing}"
    
    def test_train_data_not_empty(self, train_data):
        """Training data must have rows."""
        assert len(train_data) &gt; 0, "Training data is empty"
        assert len(train_data) &gt;= 1000, f"Training data too small: {len(train_data)} rows"
    
    def test_no_negative_amounts(self, train_data):
        """Transaction amounts must be non-negative."""
        negative_count = (train_data["amount"] &lt; 0).sum()
        assert negative_count == 0, f"Found {negative_count} negative amounts"
    
    def test_amounts_reasonable(self, train_data):
        """Transaction amounts should be within reasonable bounds."""
        max_amount = train_data["amount"].max()
        assert max_amount &lt;= 100000, f"Max amount {max_amount} exceeds reasonable limit"
    
    def test_hours_valid(self, train_data):
        """Hours must be 0-23."""
        invalid = train_data[(train_data["hour"] &lt; 0) | (train_data["hour"] &gt; 23)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid hours"
    
    def test_days_valid(self, train_data):
        """Days of week must be 0-6."""
        invalid = train_data[(train_data["day_of_week"] &lt; 0) | (train_data["day_of_week"] &gt; 6)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid days"
    
    def test_merchant_categories_valid(self, train_data):
        """Merchant categories must be from known set."""
        valid_categories = {"grocery", "restaurant", "retail", "online", "travel"}
        actual_categories = set(train_data["merchant_category"].unique())
        invalid = actual_categories - valid_categories
        assert not invalid, f"Invalid merchant categories: {invalid}"
    
    def test_fraud_ratio_reasonable(self, train_data):
        """Fraud ratio should be realistic (between 0.1% and 50%)."""
        fraud_ratio = train_data["is_fraud"].mean()
        assert 0.001 &lt;= fraud_ratio &lt;= 0.5, f"Fraud ratio {fraud_ratio:.2%} is unrealistic"
    
    def test_no_nulls_in_critical_columns(self, train_data):
        """Critical columns must not have null values."""
        critical = ["amount", "hour", "day_of_week", "merchant_category", "is_fraud"]
        for col in critical:
            null_count = train_data[col].isnull().sum()
            assert null_count == 0, f"Column {col} has {null_count} null values"


class TestModelPerformance:
    """Tests for model performance thresholds."""
    
    @pytest.fixture
    def model_and_encoder(self):
        with open("models/model.pkl", "rb") as f:
            return pickle.load(f)
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_model_loads_successfully(self, model_and_encoder):
        """Model file must load without errors."""
        model, encoder = model_and_encoder
        assert model is not None, "Model is None"
        assert encoder is not None, "Encoder is None"
    
    def test_model_can_predict(self, model_and_encoder, test_data):
        """Model must be able to make predictions."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        predictions = model.predict(X)
        assert len(predictions) == len(X), "Prediction count mismatch"
    
    def test_accuracy_threshold(self, model_and_encoder, test_data):
        """Model accuracy must be at least 90%."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        accuracy = model.score(X, y)
        assert accuracy &gt;= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"
    
    def test_f1_threshold(self, model_and_encoder, test_data):
        """Model F1-score must be at least 0.3 (sanity check for imbalanced data)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred)
        assert f1 &gt;= 0.3, f"F1-score {f1:.2f} below 0.3 threshold"
    
    def test_precision_not_zero(self, model_and_encoder, test_data):
        """Model precision must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        precision = precision_score(y, y_pred, zero_division=0)
        assert precision &gt; 0, "Model has zero precision (predicts no fraud)"
    
    def test_recall_not_zero(self, model_and_encoder, test_data):
        """Model recall must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        recall = recall_score(y, y_pred, zero_division=0)
        assert recall &gt; 0, "Model has zero recall (misses all fraud)"
</code></pre>
<p>Create <code>tests/test_</code><a href="http://api.py"><code>api.py</code></a>:</p>
<pre><code class="language-python"># tests/test_api.py
"""
Tests for the FastAPI prediction service.

These tests ensure the API:
1. Returns correct responses for valid inputs
2. Rejects invalid inputs with proper error messages
3. Health check works

Run with: pytest tests/test_api.py -v
Note: Requires the API to be running on localhost:8000
"""
import pytest
import httpx

BASE_URL = "http://localhost:8000"

class TestPredictionEndpoint:
    """Tests for the /predict endpoint."""
    
    def test_valid_prediction_returns_200(self):
        """Valid input should return HTTP 200 with prediction."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        assert "is_fraud" in data
        assert "fraud_probability" in data
        assert isinstance(data["is_fraud"], bool)
        assert 0 &lt;= data["fraud_probability"] &lt;= 1
    
    def test_high_risk_transaction(self):
        """High-risk transaction should have higher fraud probability."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 500.0,
            "hour": 3,  # Late night
            "day_of_week": 1,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        # High-risk transactions should have elevated probability
        # (not asserting exact value as model may vary)
        assert data["fraud_probability"] &gt;= 0.0
    
    def test_negative_amount_rejected(self):
        """Negative amount should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": -100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
        assert "errors" in response.json()["detail"]
    
    def test_invalid_hour_rejected(self):
        """Invalid hour should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 25,  # Invalid
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_invalid_merchant_rejected(self):
        """Unknown merchant category should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "unknown_category"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_missing_field_rejected(self):
        """Missing required field should be rejected."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14
            # Missing day_of_week and merchant_category
        }, timeout=10)
        
        assert response.status_code == 422  # Pydantic validation error


class TestHealthEndpoint:
    """Tests for the /health endpoint."""
    
    def test_health_returns_200(self):
        """Health endpoint should return 200."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        assert response.status_code == 200
    
    def test_health_returns_healthy_status(self):
        """Health endpoint should indicate healthy status."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        data = response.json()
        assert data["status"] == "healthy"
</code></pre>
<p>Run tests locally:</p>
<pre><code class="language-python"># Run data and model tests (API not needed)
pytest tests/test_data_and_model.py -v

# Run API tests (requires API to be running)
pytest tests/test_api.py -v
</code></pre>
<h3 id="heading-72-github-actions-workflow"><strong>7.2 GitHub Actions Workflow</strong></h3>
<p>⚠️ <strong>Note for Production Teams</strong><br>In real ML teams, you typically don't retrain full models inside CI — it's slow and resource-intensive.<br>Here we do it to keep everything local, reproducible, and self-contained for learning.<br>Production pipelines usually separate training (scheduled jobs) from testing (CI/CD).</p>
<p>Create <code>.github/workflows/ci.yml</code>:</p>
<pre><code class="language-python"># .github/workflows/ci.yml
name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      
      - name: Generate training data
        run: python src/generate_data.py
      
      - name: Train model
        run: python src/train_naive.py
      
      - name: Run data quality tests
        run: pytest tests/test_data_and_model.py -v --tb=short
      
      - name: Build Docker image
        run: docker build -t fraud-detection-api .
      
      - name: Run container for API tests
        run: |
          docker run -d -p 8000:8000 --name test-api fraud-detection-api
          sleep 10  # Wait for API to start
          curl -f http://localhost:8000/health || exit 1
      
      - name: Run API tests
        run: pytest tests/test_api.py -v --tb=short
      
      - name: Cleanup
        if: always()
        run: docker stop test-api || true
</code></pre>
<h3 id="heading-73-dockerize-the-application"><strong>7.3 Dockerize the Application</strong></h3>
<p>Create <code>Dockerfile</code>:</p>
<pre><code class="language-python"># Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update &amp;&amp; apt-get install -y \
    curl \
    &amp;&amp; rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ src/
COPY models/ models/
COPY data/ data/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the API
CMD ["uvicorn", "src.serve_validated:app", "--host", "0.0.0.0", "--port", "8000"]
</code></pre>
<p>Create <code>.dockerignore</code>:</p>
<pre><code class="language-python"># .dockerignore
venv/
__pycache__/
*.pyc
.git/
.github/
mlruns/
*.db
*.html
.pytest_cache/
</code></pre>
<p>Build and run locally:</p>
<pre><code class="language-python"># Build the Docker image
docker build -t fraud-detection-api .

# Run the container
docker run -p 8000:8000 fraud-detection-api

# Test it
curl http://localhost:8000/health
</code></pre>
<p><strong>Checkpoint:</strong></p>
<ul>
<li><p>All tests pass: <code>pytest tests/test_data_and_</code><a href="http://model.py"><code>model.py</code></a> <code>-v</code></p>
</li>
<li><p>Docker image builds successfully</p>
</li>
<li><p>Container runs and responds to health checks</p>
</li>
</ul>
<h2 id="heading-8-incident-response-playbook"><strong>8. Incident Response Playbook</strong></h2>
<p>When things go wrong in production (and they will), you need a plan. This section provides playbooks for common ML incidents.</p>
<h3 id="heading-scenario-false-positive-spike"><strong>Scenario: False Positive Spike</strong></h3>
<p><strong>Symptoms:</strong> Your fraud model suddenly flags 40% of legitimate transactions as fraud, blocking customers and overwhelming your manual review team.</p>
<p><strong>Severity:</strong> HIGH - Direct customer impact</p>
<p><strong>Phase 1: Mitigation (0-5 minutes)</strong></p>
<ol>
<li><p><strong>Acknowledge the incident</strong> - Notify stakeholders that you're aware and responding</p>
</li>
<li><p><strong>Roll back to previous model</strong> - In MLflow UI, move the @champion alias to the previous model version</p>
</li>
<li><p><strong>Restart the API</strong> - <code>docker restart fraud-api</code> or redeploy</p>
</li>
<li><p><strong>Verify</strong> - Check that false positive rate has returned to normal</p>
</li>
<li><p><strong>Communicate</strong> - "Issue detected and mitigated. Investigating root cause."</p>
</li>
</ol>
<p><strong>Phase 2: Diagnosis (5-60 minutes)</strong></p>
<ol>
<li><p><strong>Check drift report</strong> - Run <code>python src/</code><a href="http://monitoring.py"><code>monitoring.py</code></a> with recent production data</p>
</li>
<li><p><strong>Check data validation logs</strong> - Did upstream data format change?</p>
</li>
<li><p><strong>Check recent deployments</strong> - Was there a new model or code deployed recently?</p>
</li>
<li><p><strong>Compare metrics</strong> - What's different between the rolled-back and problematic model?</p>
</li>
</ol>
<p><strong>Example root causes:</strong></p>
<ul>
<li><p>Upstream system sent amounts in cents instead of dollars</p>
</li>
<li><p>New merchant category appeared that wasn't in training data</p>
</li>
<li><p>Holiday shopping patterns differed significantly from training data</p>
</li>
</ul>
<p><strong>Phase 3: Remediation (1-24 hours)</strong></p>
<ol>
<li><p><strong>Fix the root cause</strong> - Add validation for the edge case, or update training data</p>
</li>
<li><p><strong>Retrain if needed</strong> - Include new patterns in training data</p>
</li>
<li><p><strong>Add test case</strong> - Prevent this from happening again</p>
</li>
<li><p><strong>Document</strong> - Add to runbook for future reference</p>
</li>
</ol>
<h3 id="heading-scenario-gradual-performance-decay"><strong>Scenario: Gradual Performance Decay</strong></h3>
<p><strong>Symptoms:</strong> Monitoring shows fraud recall dropping 2% per week over a month. No sudden failures, just slow degradation.</p>
<p><strong>Severity:</strong> MEDIUM - Gradual impact, time to respond</p>
<p><strong>Response:</strong></p>
<ol>
<li><p><strong>Investigate drift report</strong> - Look for gradual distribution changes</p>
<pre><code class="language-python">python src/monitoring.py
</code></pre>
</li>
<li><p><strong>Collect recent labeled data</strong> - Get confirmed fraud cases from the past month</p>
</li>
<li><p><strong>Analyze patterns</strong> - What's different about recent fraud?</p>
<ul>
<li><p>New attack vectors?</p>
</li>
<li><p>Different time patterns?</p>
</li>
<li><p>New merchant categories?</p>
</li>
</ul>
</li>
<li><p><strong>Retrain on combined data</strong> - Include both old and new patterns</p>
<pre><code class="language-python">python src/train_mlflow.py
</code></pre>
</li>
<li><p><strong>Deploy via canary</strong> - Route 10% of traffic to the new model first</p>
<ul>
<li><p>Monitor metrics for 1-2 days</p>
</li>
<li><p>If metrics improve, increase to 50%, then 100%</p>
</li>
<li><p>If metrics worsen, roll back</p>
</li>
</ul>
</li>
<li><p><strong>Set up recurring retraining</strong> - Schedule weekly or monthly retraining</p>
</li>
</ol>
<h3 id="heading-scenario-upstream-data-schema-change"><strong>Scenario: Upstream Data Schema Change</strong></h3>
<p><strong>Symptoms:</strong> API starts returning 500 errors. Logs show <code>KeyError: 'merchant_category'</code>.</p>
<p><strong>Severity:</strong> HIGH - Service is down</p>
<p><strong>Response:</strong></p>
<ol>
<li><p><strong>Check error logs</strong> - Identify the exact error</p>
<pre><code class="language-python">KeyError: 'merchant_category'
</code></pre>
</li>
<li><p><strong>Check upstream data</strong> - Did the field name change?</p>
<ul>
<li><p><code>merchant_category</code> -&gt; <code>category</code></p>
</li>
<li><p><code>amount</code> -&gt; <code>transaction_amount</code></p>
</li>
</ul>
</li>
<li><p><strong>Immediate fix</strong> - Add field name mapping</p>
<pre><code class="language-python"># Quick fix in API
if 'category' in data and 'merchant_category' not in data:
    data['merchant_category'] = data['category']
</code></pre>
</li>
<li><p><strong>Long-term fix</strong> - Add validation that catches schema changes</p>
<pre><code class="language-python">required_fields = ['amount', 'hour', 'day_of_week', 'merchant_category']
missing = [f for f in required_fields if f not in data]
if missing:
    raise ValidationError(f"Missing fields: {missing}")
</code></pre>
</li>
<li><p><strong>Add integration test</strong> - Test with upstream system in CI/CD</p>
</li>
</ol>
<h2 id="heading-9-how-to-put-it-all-together"><strong>9.</strong> How to Put It All Together</h2>
<p>Let's step back and appreciate what we've built. Our initial naive system has transformed into a <strong>local ML platform</strong> with production-grade components.</p>
<blockquote>
<p>💡 <strong>Mental Model</strong>: Each tool in this stack is a "catch net" for a specific failure mode:</p>
<ul>
<li><p>MLflow catches "which model is this?"</p>
</li>
<li><p>Feast catches "are features consistent?"</p>
</li>
<li><p>Great Expectations catches "is this data valid?"</p>
</li>
<li><p>Evidently catches "has the world changed?"</p>
</li>
<li><p>CI/CD catches "did we break something?"</p>
</li>
</ul>
<p>Together, they form defense-in-depth for ML systems.</p>
</blockquote>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Tool</strong></th>
<th><strong>Problem Solved</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Experiment Tracking</strong></td>
<td>MLflow</td>
<td>Every run logged, reproducible</td>
</tr>
<tr>
<td><strong>Model Registry</strong></td>
<td>MLflow</td>
<td>Versioned models, rollback capability</td>
</tr>
<tr>
<td><strong>Feature Store</strong></td>
<td>Feast</td>
<td>Consistent features, no training-serving skew</td>
</tr>
<tr>
<td><strong>Data Validation</strong></td>
<td>Great Expectations</td>
<td>Bad data rejected with clear errors</td>
</tr>
<tr>
<td><strong>Monitoring</strong></td>
<td>Evidently</td>
<td>Drift detected before it causes problems</td>
</tr>
<tr>
<td><strong>Containerization</strong></td>
<td>Docker</td>
<td>Environment consistency everywhere</td>
</tr>
<tr>
<td><strong>CI/CD</strong></td>
<td>GitHub Actions</td>
<td>Automated testing and safe deployments</td>
</tr>
</tbody></table>
<h3 id="heading-the-complete-workflow"><strong>The Complete Workflow</strong></h3>
<p>Here's how all the pieces work together in practice:</p>
<ol>
<li><p><strong>Data arrives</strong> - New transaction data comes in from upstream systems</p>
</li>
<li><p><strong>Validation gate</strong> - Great Expectations rules check data quality. Bad data is rejected with clear error messages before it can cause harm.</p>
</li>
<li><p><strong>Feature computation</strong> - Feast computes features using the same definitions for both training and serving. No more training-serving skew.</p>
</li>
<li><p><strong>Training</strong> - When you retrain, MLflow logs all parameters, metrics, and artifacts. Every experiment is reproducible and comparable.</p>
</li>
<li><p><strong>Model registry</strong> - Trained models are automatically versioned. You can compare metrics, promote the best to Production, and roll back if needed.</p>
</li>
<li><p><strong>Serving</strong> - FastAPI loads the @champion model from MLflow. Each request is validated, features are retrieved from Feast, and predictions are returned.</p>
</li>
<li><p><strong>Monitoring</strong> - Evidently checks for drift periodically. If input distributions change significantly, alerts are triggered.</p>
</li>
<li><p><strong>Retraining loop</strong> - When drift is detected, you retrain on new data, compare metrics, and promote if better. The cycle continues.</p>
</li>
<li><p><strong>CI/CD safety net</strong> - All code changes go through automated tests. Docker ensures environment consistency. Nothing reaches production without passing the pipeline.</p>
</li>
</ol>
<h2 id="heading-10-whats-next-scale-to-production"><strong>10. What's Next: Scale to Production</strong></h2>
<p>This project runs locally, but the principles and tools extend directly to production deployments. Here's how each component scales:</p>
<h3 id="heading-scaling-feast-for-production"><strong>Scaling Feast for Production</strong></h3>
<p>We used Feast with local SQLite stores. For production:</p>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Local</strong></th>
<th><strong>Production</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Online Store</td>
<td>SQLite</td>
<td>Redis, DynamoDB, or PostgreSQL</td>
</tr>
<tr>
<td>Offline Store</td>
<td>Parquet files</td>
<td>BigQuery, Snowflake, or Redshift</td>
</tr>
<tr>
<td>Feature Server</td>
<td>Embedded</td>
<td>Dedicated Feast serving cluster</td>
</tr>
</tbody></table>
<p>Benefits at scale:</p>
<ul>
<li><p>Sub-10ms feature retrieval</p>
</li>
<li><p>Horizontal scaling for high throughput</p>
</li>
<li><p>Feature monitoring and statistics</p>
</li>
<li><p>Point-in-time joins at petabyte scale</p>
</li>
</ul>
<h3 id="heading-scaling-mlflow-for-production"><strong>Scaling MLflow for Production</strong></h3>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Local</strong></th>
<th><strong>Production</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Backend Store</td>
<td>SQLite</td>
<td>PostgreSQL or MySQL</td>
</tr>
<tr>
<td>Artifact Store</td>
<td>Local filesystem</td>
<td>S3, GCS, or Azure Blob</td>
</tr>
<tr>
<td>Tracking Server</td>
<td>Single instance</td>
<td>Load-balanced cluster</td>
</tr>
</tbody></table>
<h3 id="heading-kubernetes-deployment"><strong>Kubernetes Deployment</strong></h3>
<p>When you outgrow Docker Compose:</p>
<ul>
<li><p><strong>KServe or Seldon</strong> for serverless model serving with auto-scaling</p>
</li>
<li><p><strong>Horizontal Pod Autoscaler</strong> to scale based on CPU/memory/custom metrics</p>
</li>
<li><p><strong>Canary deployments</strong> to safely roll out new models (route 10% traffic first)</p>
</li>
<li><p><strong>GPU scheduling</strong> for inference-heavy models</p>
</li>
</ul>
<h3 id="heading-advanced-monitoring"><strong>Advanced Monitoring</strong></h3>
<p>Expand observability with:</p>
<ul>
<li><p><strong>Prometheus + Grafana</strong> for real-time dashboards</p>
</li>
<li><p><strong>OpenTelemetry</strong> for distributed tracing</p>
</li>
<li><p><strong>PagerDuty/Slack integration</strong> for alerts</p>
</li>
<li><p><strong>Labeled data collection</strong> for continuous model evaluation</p>
</li>
</ul>
<h3 id="heading-ab-testing-and-multi-armed-bandits"><strong>A/B Testing and Multi-Armed Bandits</strong></h3>
<p>How to Use the Model Registry:</p>
<ul>
<li><p>Serve <strong>multiple models</strong> concurrently (champion vs challengers)</p>
</li>
<li><p><strong>Route traffic</strong> dynamically based on context</p>
</li>
<li><p><strong>Collect metrics</strong> for each model variant</p>
</li>
<li><p><strong>Automatically promote</strong> the best performer</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Congratulations on building a production-ready ML system on your local machine!</p>
<p>What we assembled here is a microcosm of real-world ML platforms:</p>
<ul>
<li><p>We started with just a model saved to a pickle file</p>
</li>
<li><p>We ended up with <strong>MLOps best practices</strong>: experiment tracking, model versioning, feature stores, data validation, monitoring, containerization, and CI/CD</p>
</li>
</ul>
<p><strong>The tools we used are production-grade:</strong></p>
<ul>
<li><p><strong>MLflow</strong> powers ML platforms at companies like Microsoft, Facebook, and Databricks</p>
</li>
<li><p><strong>Feast</strong> is used by companies like Gojek, Shopify, and Robinhood</p>
</li>
<li><p><strong>FastAPI</strong> is one of the fastest Python web frameworks</p>
</li>
<li><p><strong>Great Expectations</strong> is used at companies like GitHub and Shopify</p>
</li>
<li><p><strong>Evidently</strong> is used for monitoring ML in production at scale</p>
</li>
</ul>
<p><strong>The principles apply at any scale:</strong></p>
<ul>
<li><p>Always track experiments</p>
</li>
<li><p>Always version models</p>
</li>
<li><p>Always validate data</p>
</li>
<li><p>Always monitor for drift</p>
</li>
<li><p>Always containerize for consistency</p>
</li>
<li><p>Always automate testing</p>
</li>
</ul>
<h3 id="heading-next-steps-you-can-try"><strong>Next Steps You Can Try</strong></h3>
<ol>
<li><p><strong>Deploy to the cloud</strong> - Push your Docker container to AWS ECS, Google Cloud Run, or Azure Container Instances</p>
</li>
<li><p><strong>Add model explainability</strong> - Use SHAP or LIME to explain individual predictions</p>
</li>
<li><p><strong>Implement A/B testing</strong> - Serve multiple models and compare performance</p>
</li>
<li><p><strong>Add feature importance monitoring</strong> - Track how feature importance changes over time</p>
</li>
<li><p><strong>Set up real-time alerting</strong> - Connect Evidently to Slack or PagerDuty</p>
</li>
<li><p><strong>Implement continuous training</strong> - Automatically retrain when drift is detected</p>
</li>
<li><p><strong>Add bias and fairness monitoring</strong> - Ensure your model treats all groups fairly</p>
</li>
</ol>
<p>Remember that productionizing ML is an <strong>iterative process</strong>. There's always another layer of robustness to add, another edge case to handle, another metric to track. But with the foundation you've built here, you're well on your way to taking models from promising notebook experiments to deployed, monitored, and maintainable production applications.</p>
<p>Happy building, and may your models be accurate and your pipelines resilient!</p>
<h2 id="heading-get-the-complete-code">Get the Complete Code</h2>
<p>The entire project from this handbook is available as a public GitHub repository:</p>
<p><strong>🔗</strong> <a href="http://github.com/sandeepmb/freecodecamp-local-ml-platform"><strong>github.com/sandeepmb/freecodecamp-local-ml-platform</strong></a></p>
<p>The repository includes:</p>
<ul>
<li><p>All source code (<code>src/</code> directory)</p>
</li>
<li><p>Test files (<code>tests/</code> directory)</p>
</li>
<li><p>Feast feature definitions (<code>feature_repo/</code>)</p>
</li>
<li><p>Docker and CI/CD configuration</p>
</li>
<li><p>Ready-to-run scripts</p>
</li>
</ul>
<p><strong>Quick Start:</strong></p>
<pre><code class="language-bash">git clone https://github.com/sandeepmb/freecodecamp-local-ml-platform.git
cd freecodecamp-local-ml-platform
python -m venv venv &amp;&amp; source venv/bin/activate
pip install -r requirements.txt
python src/generate_data.py
python src/train_naive.py
</code></pre>
<hr>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p><a href="https://mlflow.org/docs/latest/">MLflow Documentation</a> - Experiment tracking and model registry</p>
</li>
<li><p><a href="https://docs.feast.dev/">Feast Documentation</a> - Feature store</p>
</li>
<li><p><a href="https://docs.feast.dev/getting-started/quickstart">Feast Quickstart</a> - Getting started with Feast</p>
</li>
<li><p><a href="https://fastapi.tiangolo.com/">FastAPI Documentation</a> - Modern Python web framework</p>
</li>
<li><p><a href="https://greatexpectations.io/">Great Expectations</a> - Data validation</p>
</li>
<li><p><a href="https://docs.evidentlyai.com/">Evidently AI Documentation</a> - ML monitoring</p>
</li>
<li><p><a href="https://jfrog.com/learn/mlops/cicd-for-machine-learning/">CI/CD for Machine Learning (JFrog)</a> - CI/CD best practices</p>
</li>
<li><p><a href="https://www.qwak.com/post/training-serving-skew-in-machine-learning">Training-Serving Skew Explained</a> - Understanding skew</p>
</li>
<li><p><a href="https://docs.docker.com/">Docker Documentation</a> - Containerization</p>
</li>
<li><p><a href="https://docs.github.com/en/actions">GitHub Actions Documentation</a> - CI/CD automation</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Ship a Production-Ready RAG App with FAISS (Guardrails, Evals, and Fallbacks) ]]>
                </title>
                <description>
                    <![CDATA[ Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways. They answer questions they should not, they bre ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-rag-app-faiss-fastapi/</link>
                <guid isPermaLink="false">69b841572ad6ae5184d54317</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ FastAPI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ RAG  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ vector database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ faiss ]]>
                    </category>
                
                    <category>
                        <![CDATA[ api ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Chidozie Managwu ]]>
                </dc:creator>
                <pubDate>Mon, 16 Mar 2026 17:43:51 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/f9da3ad9-e285-4ce1-acb7-ad119579971c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways.</p>
<p>They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests.</p>
<p>In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks).</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a href="#heading-why-rag-alone-does-not-equal-productionready">Why RAG Alone Does Not Equal Production-Ready</a></p>
</li>
<li><p><a href="#heading-the-architecture-you-are-building">The Architecture You Are Building</a></p>
</li>
<li><p><a href="#heading-project-setup-and-structure">Project Setup and Structure</a></p>
</li>
<li><p><a href="#heading-how-to-build-the-rag-layer-with-faiss">How to Build the RAG Layer with FAISS</a></p>
</li>
<li><p><a href="#heading-how-to-add-the-llm-call-with-structured-output">How to Add the LLM Call with Structured Output</a></p>
</li>
<li><p><a href="#heading-how-to-add-guardrails-retrieval-gate-and-fallbacks">How to Add Guardrails: Retrieval Gate and Fallbacks</a></p>
</li>
<li><p><a href="#heading-fast-api-app-creating-the-answer-endpoint">FastAPI App: Creating the /answer Endpoint</a></p>
</li>
<li><p><a href="#heading-how-to-add-beginnerfriendly-evals">How to Add Beginner-Friendly Evals</a></p>
</li>
<li><p><a href="#heading-what-to-improve-next-realistic-upgrades">What to Improve Next: Realistic Upgrades</a></p>
</li>
</ol>
<h2 id="heading-why-rag-alone-does-not-equal-production-ready">Why RAG Alone Does Not Equal Production-Ready</h2>
<p>Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment.</p>
<p>Production issues usually arise from the silent failures in the system surrounding the model:</p>
<ul>
<li><p><strong>Weak retrieval:</strong> If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate.</p>
</li>
<li><p><strong>Lack of visibility:</strong> Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer.</p>
</li>
<li><p><strong>Fragility:</strong> A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks.</p>
</li>
<li><p><strong>No regression testing:</strong> In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it.</p>
</li>
</ul>
<p>We’ll solve each of these issues systematically in this guide.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues.</p>
<h3 id="heading-knowledge">Knowledge</h3>
<p>You should be comfortable with:</p>
<ul>
<li><p><strong>Python fundamentals</strong> (functions, modules, virtual environments)</p>
</li>
<li><p><strong>Basic HTTP + JSON</strong> (requests, response payloads)</p>
</li>
<li><p><strong>APIs with FastAPI</strong> (what an endpoint is and how to run a server)</p>
</li>
<li><p><strong>High-level LLM concepts</strong> (prompting, temperature, structured outputs)</p>
</li>
</ul>
<h3 id="heading-tools-accounts">Tools + Accounts</h3>
<p>You’ll need:</p>
<ul>
<li><p><strong>Python 3.10+</strong></p>
</li>
<li><p>A working <strong>OpenAI-compatible API key</strong> (OpenAI or any provider that supports the same request/response shape)</p>
</li>
<li><p>A local environment where you can run a FastAPI app (Mac/Linux/Windows)</p>
</li>
</ul>
<h3 id="heading-what-this-tutorial-covers-and-what-it-doesnt">What This Tutorial Covers (and What It Doesn’t)</h3>
<p>We’ll build a production-minded baseline:</p>
<ul>
<li><p>A <strong>FAISS-backed retriever</strong> with a persisted index + metadata</p>
</li>
<li><p>A <strong>retrieval gate</strong> to prevent “forced hallucination”</p>
</li>
<li><p><strong>Structured JSON outputs</strong> so your backend is stable</p>
</li>
<li><p><strong>Fallback behavior</strong> for timeouts and provider errors</p>
</li>
<li><p>A small <strong>eval harness</strong> to prevent regressions</p>
</li>
</ul>
<p>We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end.</p>
<h2 id="heading-the-architecture-you-are-building">The Architecture You Are Building</h2>
<p>The flow of our application follows a disciplined path so every answer is grounded in evidence:</p>
<ol>
<li><p><strong>User query:</strong> The user submits a question via a FastAPI endpoint.</p>
</li>
<li><p><strong>Retrieval:</strong> The system embeds the question and retrieves the top-k most similar document chunks.</p>
</li>
<li><p><strong>The retrieval gate:</strong> We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query.</p>
</li>
<li><p><strong>Augmentation and generation:</strong> If the gate passes, we send a context-augmented prompt to the LLM.</p>
</li>
<li><p><strong>Structured response:</strong> The model returns a JSON object containing the answer, sources used, and a confidence level.</p>
</li>
</ol>
<h2 id="heading-project-setup-and-structure">Project Setup and Structure</h2>
<p>To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application.</p>
<h3 id="heading-project-structure">Project Structure</h3>
<pre><code class="language-python">.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py
</code></pre>
<h3 id="heading-install-dependencies">Install Dependencies</h3>
<p>First, create a virtual environment to isolate your project:</p>
<pre><code class="language-python">python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv
</code></pre>
<h3 id="heading-configure-the-environment">Configure the Environment</h3>
<p>Create a <code>.env</code> file in the root directory. We are targeting OpenAI-compatible providers:</p>
<pre><code class="language-python">OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-mini
</code></pre>
<p>Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example <code>X-API-Key</code>), and the way you extract embeddings and final message content in <code>embed_texts()</code> and <code>call_llm()</code>.</p>
<h2 id="heading-how-to-build-the-rag-layer-with-faiss">How to Build the RAG Layer with FAISS</h2>
<p>In <code>rag.py</code>, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.</p>
<h3 id="heading-what-is-faiss-and-what-does-it-do">What is FAISS (and What Does It Do)?</h3>
<p><strong>FAISS</strong> (Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask:</p>
<blockquote>
<p>“Given this question embedding, which document chunks are closest to it?”</p>
</blockquote>
<p>In this tutorial, we use <code>IndexFlatIP</code> inner product and normalise vectors with <code>faiss.normalize_L2(...)</code>. With normalised vectors, the inner product behaves like <strong>cosine similarity</strong>, giving us a stable score we can use for a retrieval gate.</p>
<h3 id="heading-chunking-strategy-with-overlap">Chunking Strategy With Overlap</h3>
<p>We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context.</p>
<h3 id="heading-implementation-of-ragpy">Implementation of <code>rag.py</code></h3>
<pre><code class="language-python">import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -&gt; List[str]:
    chunks = []
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text[i : i + size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List[str]) -&gt; np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32")
    return vectors

def build_index() -&gt; None:
    all_chunks: List[str] = []
    metadata: List[Dict] = []

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -&gt; List[Dict]:
    index, metadata = load_index()

    q_emb = embed_texts([query])
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = []
    for score, idx in zip(scores[0], ids[0]):
        if idx == -1:
            continue
        m = metadata[idx]
        results.append(
            {"score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)}
        )
    return results
</code></pre>
<h2 id="heading-how-to-add-the-llm-call-with-structured-output">How to Add the LLM Call with Structured Output</h2>
<p>A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash.</p>
<p>We solve this with <strong>structured output</strong>: instruct the model to return a strict JSON object, then parse it safely.</p>
<h3 id="heading-implementation-of-llmpy">Implementation of <code>llm.py</code></h3>
<pre><code class="language-python">import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -&gt; Dict[str, Any]:
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()["choices"][0]["message"]["content"]

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", [])
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "error_type": "unknown_error",
        }
</code></pre>
<h2 id="heading-how-to-add-guardrails-retrieval-gate-and-fallbacks">How to Add Guardrails: Retrieval Gate and Fallbacks</h2>
<p>Guardrails are interceptors. They sit between the user and the model to prevent predictable failures.</p>
<h3 id="heading-the-retrieval-gate-how-it-works-and-how-to-add-it">The Retrieval Gate: How It Works and How to Add It</h3>
<p>In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks.</p>
<p>The solution is the retrieval gate:</p>
<ol>
<li><p>Retrieve top-k chunks and get the <strong>top similarity score</strong></p>
</li>
<li><p>If the score is below a threshold (for example <code>0.30</code>), refuse immediately</p>
</li>
<li><p>Only call the LLM when retrieval is strong enough to ground the answer</p>
</li>
</ol>
<p>A threshold of <code>0.30</code> is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).</p>
<h3 id="heading-fallbacks-and-why-they-matter">Fallbacks and Why They Matter</h3>
<p>Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful.</p>
<p>In this tutorial, fallbacks are implemented inside <code>call_llm()</code> so your FastAPI layer stays simple.</p>
<h2 id="heading-fastapi-app-creating-the-answer-endpoint">FastAPI App: Creating the /answer Endpoint</h2>
<p>The <code>app.py</code> file is the conductor. It ties retrieval, guardrails, prompting, and generation together.</p>
<h3 id="heading-implementation-of-apppy">Implementation of <code>app.py</code></h3>
<pre><code class="language-python">from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results[0]["score"] if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score &lt; 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": [],
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join([f"Source {r['source']}: {r['text']}" for r in results])
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response["latency_sec"] = round(time.time() - start_time, 2)
    response["retrieval"] = {"top_score": top_score, "k": 5}
    return response
</code></pre>
<h2 id="heading-centralized-prompt-template-promptspy">Centralized Prompt – Template: prompts.py</h2>
<p>A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate.</p>
<h3 id="heading-example-promptspy">Example <code>prompts.py</code></h3>
<pre><code class="language-python">SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""
</code></pre>
<h2 id="heading-how-to-add-beginner-friendly-evals">How to Add Beginner-Friendly Evals</h2>
<p>In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions.</p>
<p>Instead of “does it output exactly this string,” you test:</p>
<ul>
<li><p>Should the app <strong>refuse</strong> when the retrieval is weak?</p>
</li>
<li><p>When it answers, does it include <strong>sources</strong>?</p>
</li>
<li><p>Is the behaviour stable across prompt tweaks and model changes?</p>
</li>
</ul>
<h3 id="heading-step-1-create-evalsevalsetjson">Step 1: Create <code>evals/eval_set.json</code></h3>
<p>This should contain both positive and negative cases.</p>
<pre><code class="language-json">[
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
]
</code></pre>
<h3 id="heading-step-2-create-evalsrunevalspy">Step 2: Create <code>evals/run_evals.py</code></h3>
<p>This runner calls your API endpoint (end-to-end) and checks expected behaviours.</p>
<pre><code class="language-python">import json
import requests

API_URL = "http://127.0.0.1:8000/answer"

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case["question"]}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case["expect_refusal"])

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case['id']}")
        else:
            failed += 1
            print(f"FAIL {case['id']} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()
</code></pre>
<h3 id="heading-how-to-use-evals-in-practice">How to Use Evals in Practice</h3>
<p>Run your server:</p>
<pre><code class="language-python">uvicorn app:app --reload
</code></pre>
<p>In another terminal, run evals:</p>
<pre><code class="language-python">python evals/run_evals.py
</code></pre>
<p>If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour.</p>
<h2 id="heading-what-to-improve-next-realistic-upgrades">What to Improve Next: Realistic Upgrades</h2>
<p>Building a reliable RAG app is iterative. Here are realistic next steps:</p>
<ul>
<li><p><strong>Semantic chunking:</strong> Break text based on meaning instead of character count.</p>
</li>
<li><p><strong>Reranking:</strong> Use a cross-encoder reranker to reorder the top-k chunks for higher precision.</p>
</li>
<li><p><strong>Metadata filtering:</strong> Filter results by category, date, or department to reduce false positives.</p>
</li>
<li><p><strong>Better citations:</strong> Store chunk IDs and show exactly which chunk(s) the answer came from.</p>
</li>
<li><p><strong>Observability:</strong> Add request IDs, structured logs, and traces so “what happened?” is answerable.</p>
</li>
<li><p><strong>Async + background indexing:</strong> Move index building to a background job and keep the API responsive.</p>
</li>
</ul>
<h2 id="heading-final-thoughts-production-ready-is-a-set-of-habits">Final Thoughts: Production-Ready Is a Set of Habits</h2>
<p>Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe.</p>
<ul>
<li><p><strong>Retrieval quality is measurable:</strong> Use similarity scores to gate your LLM.</p>
</li>
<li><p><strong>Refusal is a feature:</strong> It is better to say “I do not know” than to lie.</p>
</li>
<li><p><strong>Fallbacks are mandatory:</strong> Design for the moment the API goes down.</p>
</li>
<li><p><strong>Evals prevent regressions:</strong> Never deploy a change without running your tests.</p>
</li>
</ul>
<h2 id="heading-about-me">About Me</h2>
<p>I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products.</p>
<p>My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Containerize Your MLOps Pipeline from Training to Serving ]]>
                </title>
                <description>
                    <![CDATA[ Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deplo ]]>
                </description>
                <link>https://www.freecodecamp.org/news/containerize-mlops-pipeline-from-training-to-serving/</link>
                <guid isPermaLink="false">69b33f5993256dfc5313bee2</guid>
                
                    <category>
                        <![CDATA[ Docker ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ production ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ NVIDIA ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Balajee Asish Brahmandam ]]>
                </dc:creator>
                <pubDate>Thu, 12 Mar 2026 22:34:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/156eaca3-8884-4f57-9010-9766278dbf5a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Last year, our ML team shipped a fraud detection model that worked perfectly in a Jupyter notebook. Precision was excellent. Recall numbers looked great. Everyone was excited – until we tried to deploy it.</p>
<p>The model depended on a specific version of scikit-learn that conflicted with the production Python environment. The feature engineering pipeline required a NumPy build compiled against OpenBLAS, but the deployment servers ran MKL. A preprocessing step used a system library that existed on the data scientist's MacBook but not on the Ubuntu deployment target.</p>
<p>Three weeks of debugging later, we had the model running in production. Three weeks. For a model that was technically finished.</p>
<p>That experience is what pushed me to containerize our entire MLOps pipeline end to end. Not because Docker is trendy in ML circles, but because the alternative (hand-tuning environments, writing installation scripts that break on the next OS update, praying that what worked in training works in production) was costing us more time than the actual model development.</p>
<p>In this tutorial, you'll learn how to structure training and serving containers with multi-stage builds, how to set up experiment tracking with MLflow, how to version your training data with DVC, how to configure GPU passthrough for training, and how to tie it all together into a single Compose file with profiles. This is based on a year of running containerized ML pipelines across three teams.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Docker Engine 24+ or Docker Desktop 4.20+ with Compose v2.22.0+</p>
</li>
<li><p>For GPU training, you'll need the <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html">NVIDIA Container Toolkit</a> installed on the host and a compatible GPU driver. Run <code>nvidia-smi</code> to verify your GPU is visible, and <code>docker compose version</code> to check your Compose version.</p>
</li>
<li><p>Familiarity with Python, basic Docker concepts, and ML workflows (training, evaluation, serving) is assumed.</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-the-mlops-lifecycle-where-containers-fit">The MLOps Lifecycle: Where Containers Fit</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-build-the-training-container">How to Build the Training Container</a></p>
<ul>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-separate-training-from-serving-requirements">Separate Training from Serving Requirements</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-cuda-and-driver-compatibility">CUDA and Driver Compatibility</a></p>
</li>
</ul>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-set-up-experiment-tracking-with-mlflow">How to Set Up Experiment Tracking with MLflow</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-version-training-data-with-dvc">How to Version Training Data with DVC</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-build-the-serving-container">How to Build the Serving Container</a></p>
<ul>
<li><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-decouple-models-from-containers">Decouple Models from Containers</a></li>
</ul>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-configure-gpu-passthrough-for-training">How to Configure GPU Passthrough for Training</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-how-to-tie-it-all-together-with-compose-profiles">How to Tie It All Together with Compose Profiles</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-reproducibility-the-whole-point">Reproducibility: The Whole Point</a></p>
</li>
<li><p><a href="https://claude.ai/chat/742d453d-7543-4904-805f-61c5320b4fdb#heading-where-this-breaks-down">Where This Breaks Down</a></p>
</li>
</ul>
<h2 id="heading-the-mlops-lifecycle-where-containers-fit">The MLOps Lifecycle: Where Containers Fit</h2>
<p>If you've built a machine learning model, you know the process has a lot of stages. But if you're coming from a software engineering background (or you're a data scientist who mostly works in notebooks), it helps to see the full picture of what an MLOps pipeline looks like and where Docker fits into each stage.</p>
<p>An MLOps pipeline is a chain of interdependent stages:</p>
<ol>
<li><p><strong>Data ingestion and validation.</strong> Raw data comes in from databases, APIs, or file systems. You clean it, validate it, and store it in a format your model can use.</p>
</li>
<li><p><strong>Feature engineering.</strong> You transform raw data into features the model can learn from. This might be as simple as normalizing numbers or as complex as generating embeddings.</p>
</li>
<li><p><strong>Experiment tracking.</strong> You log every training run's configuration (hyperparameters, data version, code version) and results (accuracy, loss, evaluation metrics) so you can compare experiments and reproduce the best ones.</p>
</li>
<li><p><strong>Model training.</strong> The model learns from your features. This is the compute-heavy part that often needs GPUs.</p>
</li>
<li><p><strong>Evaluation.</strong> You measure the trained model against test data to see if it's good enough to deploy.</p>
</li>
<li><p><strong>Packaging and serving.</strong> You wrap the trained model in an API so other systems can send it data and get predictions back.</p>
</li>
<li><p><strong>Monitoring.</strong> You watch the model in production to catch problems like data drift (when the real-world data starts looking different from the training data) or performance degradation.</p>
</li>
</ol>
<p>Each stage has different computational needs. Training might require GPUs and terabytes of memory. Serving needs low latency and horizontal scaling. Feature engineering might need distributed processing tools like Spark or Dask.</p>
<p>The thing that changed our approach: you don't containerize the entire pipeline as one monolithic image. You containerize each stage independently, with shared interfaces between them.</p>
<p>Think of it like microservices applied to ML infrastructure. Each container does one thing, does it well, and communicates with the others through well-defined interfaces: model artifacts stored in a registry, metrics logged to MLflow, data versioned in object storage.</p>
<p>This gives you the flexibility to:</p>
<ul>
<li><p>Scale training on expensive GPU instances while running serving on cheaper CPU nodes</p>
</li>
<li><p>Update your feature engineering code without rebuilding your training environment</p>
</li>
<li><p>Version each stage independently in your container registry</p>
</li>
<li><p>Let data scientists and ML engineers work on training while platform engineers optimize serving</p>
</li>
</ul>
<h2 id="heading-how-to-build-the-training-container">How to Build the Training Container</h2>
<p>The training container is where most teams start, and where most teams make their first mistake.</p>
<p>The temptation is to create one massive image with every possible library, every CUDA version, every data processing tool. I've seen training images exceed 15GB. They take twenty minutes to build, ten minutes to push, and break whenever someone adds a new dependency.</p>
<p>Here's the pattern that works: use multi-stage builds to separate the build environment from the runtime environment, and use cache mounts to avoid re-downloading packages on every build.</p>
<p>If you're new to these concepts: a <strong>multi-stage build</strong> lets you use one Docker image to build your software and a different, smaller image to run it. You copy only the final artifacts from the build stage to the runtime stage, leaving behind compilers, build tools, and other things you don't need in production.</p>
<p>A <strong>cache mount</strong> tells Docker to keep a directory (like pip's download cache) between builds, so it doesn't re-download packages that haven't changed.</p>
<p>Here's the training Dockerfile:</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

# System dependencies (rarely change)
RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Dependencies (change occasionally)
COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

# Training code (changes frequently)
COPY src/ /app/src/
COPY configs/ /app/configs/

WORKDIR /app
ENTRYPOINT ["python", "-m", "src.train"]
</code></pre>
<p>Notice the layer ordering. Docker builds images in layers, and it caches each layer. If a layer hasn't changed, Docker reuses the cached version instead of rebuilding it. But here's the catch: if one layer changes, Docker rebuilds that layer and every layer after it.</p>
<p>That's why we put things in order of how often they change:</p>
<ol>
<li><p><strong>System packages at the top</strong> (they almost never change). Installing <code>python3.11</code> and <code>git</code> takes time, but you only do it once.</p>
</li>
<li><p><strong>Python dependencies in the middle</strong> (they change when you add or update a library). This layer rebuilds when <code>requirements-train.txt</code> changes.</p>
</li>
<li><p><strong>Your actual code at the bottom</strong> (changes on every commit). This is the layer that rebuilds most often.</p>
</li>
</ol>
<p>With this ordering, a code change only rebuilds the final layer, not the entire image. If you put <code>COPY src/</code> before <code>pip install</code>, every code change would trigger a full reinstall of all Python packages. That's the mistake I see most often in ML Dockerfiles.</p>
<p>The <code>--mount=type=cache,target=/root/.cache/pip</code> line on the <code>pip install</code> command tells Docker to persist pip's download cache between builds. When you do update requirements, pip checks the cache first and only downloads packages that are new or changed. On a project with hundreds of ML dependencies (PyTorch alone pulls in dozens of sub-packages), this saves five to ten minutes per build.</p>
<h3 id="heading-separate-training-from-serving-requirements">Separate Training from Serving Requirements</h3>
<p>Your training environment needs libraries that your serving environment does not. Training needs experiment tracking tools like MLflow, data processing libraries like pandas and polars, visualization libraries for debugging, and hyperparameter tuning frameworks. Serving needs a lightweight inference runtime, an API framework like FastAPI, health check endpoints, and minimal overhead.</p>
<p>It's a good idea to maintain separate requirements files:</p>
<pre><code class="language-plaintext"># requirements-train.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
pandas==2.2.3
polars==1.20.0
dvc[s3]==3.59.1
optuna==4.2.0
matplotlib==3.10.0

# requirements-serve.txt
torch==2.5.1
scikit-learn==1.6.1
mlflow==2.19.0
fastapi==0.115.0
uvicorn[standard]==0.34.0
pydantic==2.10.0
</code></pre>
<p>The overlap is smaller than you'd think. <code>torch</code> and <code>scikit-learn</code> appear in both because the model needs them for inference. Everything else in the training file is baggage that slows down serving deployments and increases the attack surface.</p>
<h3 id="heading-cuda-and-driver-compatibility">CUDA and Driver Compatibility</h3>
<p>One thing that will bite you if you ignore it: the CUDA runtime version inside your container must be compatible with the GPU driver version on the host. The rule is that the host driver must be equal to or newer than the CUDA version in the container. For example, CUDA 12.6 requires driver version 560.28+ on Linux.</p>
<p>Make sure you check your host driver version before choosing your base image:</p>
<pre><code class="language-bash"># On the host machine
nvidia-smi
# Look for "Driver Version: 560.35.03" and "CUDA Version: 12.6"

# The CUDA version shown by nvidia-smi is the maximum CUDA version
# your driver supports, not the version installed
</code></pre>
<p>If your host driver is 535.x, don't use a <code>cuda:12.6</code> base image. Use <code>cuda:12.2</code> or upgrade the driver. Mismatched versions produce cryptic errors like <code>CUDA error: no kernel image is available for execution on the device</code> that are painful to debug.</p>
<p>Pin your base images to specific tags (not <code>latest</code>) and document the minimum driver version in your README. When you deploy to new hardware, the driver version check should be part of your provisioning checklist.</p>
<h2 id="heading-how-to-set-up-experiment-tracking-with-mlflow">How to Set Up Experiment Tracking with MLflow</h2>
<p>If you've ever trained a model and thought "wait, which hyperparameters gave me that good result last week?", you need experiment tracking. Without it, ML development turns into a mess of Jupyter notebooks, screenshots of metrics, and spreadsheets that nobody keeps up to date.</p>
<p><a href="https://mlflow.org/">MLflow</a> is the most widely adopted open-source tool for this. It logs three things for every training run: <strong>parameters</strong> (learning rate, batch size, number of epochs), <strong>metrics</strong> (accuracy, loss, F1 score), and <strong>artifacts</strong> (the trained model file, plots, evaluation reports). It stores all of this in a database and gives you a web UI to compare runs side by side.</p>
<p>Running MLflow as a containerized service means the tracking server is persistent and shared across your team, not running on one person's laptop:</p>
<pre><code class="language-yaml">services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: &gt;
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

volumes:
  mlflow-artifacts:
  postgres-data:
</code></pre>
<p>Let me break down what's happening here.</p>
<p>The <code>mlflow</code> service runs the MLflow tracking server. It stores experiment metadata (parameters, metrics) in a Postgres database and saves artifacts (model files, plots) to a Docker volume.</p>
<p>The <code>depends_on</code> with <code>condition: service_healthy</code> tells Compose to wait until Postgres is actually ready to accept connections before starting MLflow. Without this, MLflow would crash on startup because the database isn't ready yet.</p>
<p>The <code>db</code> service runs Postgres with a health check that uses <code>pg_isready</code>, a built-in Postgres utility that checks if the database is accepting connections. The <code>start_period</code> gives Postgres 10 seconds to initialize before health checks start counting failures.</p>
<p>Your training code connects to MLflow by setting one environment variable:</p>
<pre><code class="language-python">import os
import mlflow

# This tells MLflow where to log experiments
# When running inside Docker Compose, "mlflow" resolves to the mlflow container
os.environ["MLFLOW_TRACKING_URI"] = "http://mlflow:5000"

# Example: logging a training run
with mlflow.start_run(run_name="fraud-detector-v2"):
    # Log hyperparameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("batch_size", 64)
    mlflow.log_param("epochs", 50)

    # ... train your model here ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.log_metric("precision", 0.93)
    mlflow.log_metric("recall", 0.89)

    # Log the trained model as an artifact
    mlflow.sklearn.log_model(model, "model")
    # Or for PyTorch: mlflow.pytorch.log_model(model, "model")
</code></pre>
<p>After the run completes, open <code>http://localhost:5000</code> in your browser. You'll see a table of all your runs with their parameters and metrics. Click any run to see details, compare it with other runs, or download the model artifact. No more "I think experiment 7 was the good one" conversations.</p>
<p>A note on the password in the YAML: for local development this is fine. For staging and production, use Docker secrets or inject the credentials from your CI environment. Don't commit real database passwords to your repo.</p>
<h2 id="heading-how-to-version-training-data-with-dvc">How to Version Training Data with DVC</h2>
<p>Models are reproducible only if you can also reproduce the data they were trained on. This is a problem Git can't solve on its own, because training datasets are often gigabytes or terabytes in size and Git isn't designed for large binary files.</p>
<p><a href="https://dvc.org/">DVC (Data Version Control)</a> fills this gap. It works like Git, but for data. Here's the concept: instead of storing your 10GB training dataset in Git, DVC stores a small text file (a <code>.dvc</code> file) that acts as a pointer to the actual data. The real data lives in cloud storage (S3, Google Cloud Storage, Azure Blob). When you check out a specific Git commit, DVC knows which version of the data goes with that commit and can pull it from remote storage.</p>
<p>The workflow on your local machine looks like this:</p>
<pre><code class="language-bash"># Initialize DVC in your project (one time)
dvc init

# Add your training data to DVC tracking
dvc add data/training_data.parquet
# This creates data/training_data.parquet.dvc (small pointer file)
# and adds training_data.parquet to .gitignore

# Push the actual data to remote storage
dvc push

# Commit the pointer file to Git
git add data/training_data.parquet.dvc .gitignore
git commit -m "Add training data v1"
</code></pre>
<p>Now your Git repo contains the pointer file, and the real data lives in S3. When someone else (or a container) needs the data, they run <code>dvc pull</code> and DVC downloads it from remote storage.</p>
<p>The training Dockerfile includes DVC, and the entrypoint pulls the correct data version before training begins:</p>
<pre><code class="language-dockerfile"># syntax=docker/dockerfile:1.4
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04 AS base

RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \
    python3.11 python3.11-venv python3-pip git curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-train.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-train.txt

COPY src/ /app/src/
COPY configs/ /app/configs/

# DVC tracking files (these are small text files in Git)
COPY data/*.dvc /app/data/
COPY .dvc/ /app/.dvc/

WORKDIR /app
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]
</code></pre>
<p>The entrypoint script pulls the data and then starts training:</p>
<pre><code class="language-bash">#!/bin/bash
set -e

echo "Pulling training data from remote storage..."
dvc pull data/

echo "Starting training run..."
python -m src.train "$@"
</code></pre>
<p>For DVC to pull from S3, the container needs AWS credentials. You can pass them as environment variables in your Compose file or mount them from the host:</p>
<pre><code class="language-yaml">training:
  build: { context: ., dockerfile: Dockerfile.train }
  environment:
    - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
    - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    - AWS_DEFAULT_REGION=us-east-1
</code></pre>
<p>Combined with MLflow's experiment logging, you get a complete provenance chain: this model was trained on this version of the data (tracked by DVC), with these parameters (logged in MLflow), producing these metrics.</p>
<p>You can reproduce any past experiment by checking out the Git commit and running the training container.</p>
<h2 id="heading-how-to-build-the-serving-container">How to Build the Serving Container</h2>
<p>"Serving" means wrapping your trained model in an API so other systems can send it data and get predictions back. For example, a fraud detection model might expose a <code>/predict</code> endpoint that accepts transaction data and returns a fraud probability.</p>
<p>The serving container has different priorities than the training container. Training optimizes for flexibility and raw compute. Serving optimizes for speed, small size, and reliability:</p>
<pre><code class="language-dockerfile">FROM python:3.11-slim AS serving

WORKDIR /app

# Install curl for healthcheck
RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends curl &amp;&amp; \
    rm -rf /var/lib/apt/lists/*

COPY requirements-serve.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements-serve.txt

COPY src/serving/ /app/src/serving/

HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.serving.app:app", "--host", "0.0.0.0"]
</code></pre>
<p>A few things to understand if you're new to this:</p>
<p><code>uvicorn</code> is a lightweight Python web server that runs <a href="https://fastapi.tiangolo.com/">FastAPI</a> applications. FastAPI is a framework for building APIs in Python. Together, they let you turn your model into a web service that responds to HTTP requests.</p>
<p><code>HEALTHCHECK</code> tells Docker to periodically check if your container is actually working, not just running. Every 30 seconds, Docker runs the <code>curl</code> command against the <code>/health</code> endpoint. If it fails three times in a row, Docker marks the container as unhealthy. This matters because your model server might be running but not ready (maybe the model file is still downloading), and you don't want to send traffic to a server that can't respond.</p>
<p><code>start-period</code> of 60 seconds is important for ML serving containers. Model loading can take time, especially for large models (loading a 2GB model from a registry takes a while). Without <code>start-period</code>, the health check would start failing immediately, count those failures toward the retry limit, and the orchestrator might kill the container before the model finishes loading. The start period gives the container grace time to initialize.</p>
<p>Notice we're using <code>python:3.11-slim</code> here, not the NVIDIA CUDA image. Most trained models can run inference on CPU. If you need GPU inference (for example, running a large language model or doing real-time video processing), use the CUDA base image instead, but be aware that it makes the serving container much larger.</p>
<p>If you want to skip the <code>curl</code> dependency, use Python's built-in <code>urllib</code> for the health check:</p>
<pre><code class="language-dockerfile">HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
</code></pre>
<h3 id="heading-decouple-models-from-containers">Decouple Models from Containers</h3>
<p>This is one of the most important patterns in this article, and the one beginners most often get wrong.</p>
<p>The temptation is to copy your trained model file (the <code>.pkl</code>, <code>.pt</code>, or <code>.onnx</code> file that contains the learned weights) directly into the Docker image during the build. Don't do this. When you embed model files in your Docker image, every model update requires a new image build and push. For a 2GB model, that means rebuilding the container, uploading 2GB to a registry, and redeploying, even though only the model changed and the code is identical.</p>
<p>Instead, have your serving container download the model from a model registry (like MLflow) or cloud storage (like S3) at startup. The container image stays small and generic. Model updates become a configuration change (pointing to a new model version) rather than a deployment.</p>
<p>Here's a full serving app using FastAPI with the modern lifespan pattern. If you've used Flask, FastAPI is similar but faster and with built-in request validation:</p>
<pre><code class="language-python">import os
from contextlib import asynccontextmanager

import mlflow
from fastapi import FastAPI

# MODEL_URI points to a specific model version in MLflow's registry
# Format: "models:/&lt;model-name&gt;/&lt;stage&gt;" where stage is Staging or Production
MODEL_URI = os.environ.get("MODEL_URI", "models:/fraud-detector/production")
model = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    # This runs once when the server starts up
    global model
    print(f"Loading model from {MODEL_URI}...")
    model = mlflow.pyfunc.load_model(MODEL_URI)
    print("Model loaded successfully.")
    yield
    # This runs when the server shuts down
    print("Shutting down model server.")


app = FastAPI(lifespan=lifespan)


@app.get("/health")
async def health():
    """Used by Docker HEALTHCHECK to verify the server is ready."""
    if model is None:
        return {"status": "loading"}, 503
    return {"status": "healthy"}


@app.post("/predict")
async def predict(features: dict):
    """Accept features as JSON, return model prediction."""
    import pandas as pd

    # Convert the input dict into a DataFrame (what most sklearn/mlflow models expect)
    df = pd.DataFrame([features])
    prediction = model.predict(df)
    return {"prediction": prediction.tolist()}
</code></pre>
<p>When a client sends a POST request to <code>/predict</code> with JSON like <code>{"amount": 500, "merchant_category": "electronics", "hour": 23}</code>, the model returns a prediction. The <code>/health</code> endpoint returns 503 while the model is loading and 200 once it's ready, which is exactly what the Docker <code>HEALTHCHECK</code> checks for.</p>
<p>Promoting a new model version means updating the <code>MODEL_URI</code> environment variable and restarting the container. The MLflow model registry supports stage transitions (Staging, Production, Archived), so you can promote a model in the MLflow UI and then point your serving container at the new version.</p>
<p>For zero-downtime model updates, implement a reload endpoint that swaps models without restarting:</p>
<pre><code class="language-python">@app.post("/admin/reload")
async def reload_model():
    global model
    model = mlflow.pyfunc.load_model(MODEL_URI)
    return {"status": "reloaded"}
</code></pre>
<h2 id="heading-how-to-configure-gpu-passthrough-for-training">How to Configure GPU Passthrough for Training</h2>
<p>By default, Docker containers can't see the GPU hardware on the host machine. "GPU passthrough" means giving a container access to the host's GPUs so that libraries like PyTorch and TensorFlow can use them for accelerated computation.</p>
<p>This requires two things on the host (the machine running Docker, not inside the container):</p>
<ol>
<li><p><strong>NVIDIA GPU drivers</strong> installed and working. Verify with <code>nvidia-smi</code>. If that command shows your GPUs, you're good.</p>
</li>
<li><p><strong>NVIDIA Container Toolkit</strong> installed. This is the bridge between Docker and the GPU drivers. Install it from the <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html">NVIDIA docs</a> and verify with <code>docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi</code>. If you see your GPU listed, the toolkit is working.</p>
</li>
</ol>
<p>Once the host is set up, GPU access in Docker Compose looks like this:</p>
<pre><code class="language-yaml">services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
</code></pre>
<p>The <code>deploy.resources.reservations.devices</code> block is saying: "this container needs all available NVIDIA GPUs." Inside the container, PyTorch and TensorFlow will see the GPUs and use them automatically. You can verify by adding <code>print(torch.cuda.is_available())</code> to your training script, which should print <code>True</code>.</p>
<p>If you're running Compose v2.30.0+, you can use the shorter <code>gpus</code> syntax:</p>
<pre><code class="language-yaml">services:
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    gpus: all
    volumes:
      - ./data:/app/data
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
</code></pre>
<p>For multi-GPU training with frameworks like PyTorch's DistributedDataParallel, you can assign specific GPUs using <code>device_ids</code>. This matters when running multiple training jobs at the same time:</p>
<pre><code class="language-yaml">services:
  training-job-1:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1

  training-job-2:
    build: { context: ., dockerfile: Dockerfile.train }
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["2", "3"]
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
</code></pre>
<p>Note that <code>CUDA_VISIBLE_DEVICES</code> inside the container is relative to the devices assigned by Docker, not the host GPU indices. Both containers see their GPUs as device 0 and 1, even though they're using different physical GPUs.</p>
<h2 id="heading-how-to-tie-it-all-together-with-compose-profiles">How to Tie It All Together with Compose Profiles</h2>
<p>If you're new to Compose profiles: by default, <code>docker compose up</code> starts every service defined in your <code>docker-compose.yml</code>. But you don't always want everything running. Your MLflow server and serving API should run all the time, but the training container should only launch when you're actually training a model (and it needs a GPU, which you might not have on your laptop).</p>
<p>Profiles solve this. When you add <code>profiles: ["train"]</code> to a service, that service is excluded from <code>docker compose up</code> by default. It only starts when you explicitly activate the profile with <code>docker compose --profile train</code>. This means one file defines your entire ML infrastructure, but you control what runs and when.</p>
<p>Here's the complete <code>docker-compose.yml</code> that ties every piece from this article together:</p>
<pre><code class="language-yaml">services:
  # --- Always-on infrastructure ---
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: mlflow
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mlflow"]
      interval: 5s
      timeout: 2s
      retries: 5
      start_period: 10s
    volumes:
      - postgres-data:/var/lib/postgresql/data

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.19.0
    command: &gt;
      mlflow server
      --backend-store-uri postgresql://mlflow:secret@db/mlflow
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow-artifacts:/mlflow/artifacts
    depends_on:
      db: { condition: service_healthy }

  serving:
    build: { context: ., dockerfile: Dockerfile.serve }
    ports:
      - "8000:8000"
    environment:
      - MODEL_URI=models:/fraud-detector/production
      - MLFLOW_TRACKING_URI=http://mlflow:5000
    depends_on:
      mlflow: { condition: service_started }

  # --- Training (on-demand) ---
  training:
    build: { context: ., dockerfile: Dockerfile.train }
    profiles: ["train"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./configs:/app/configs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      mlflow: { condition: service_started }

volumes:
  postgres-data:
  mlflow-artifacts:
</code></pre>
<p>The day-to-day workflow with this file:</p>
<pre><code class="language-bash"># Step 1: Start the infrastructure (MLflow + Postgres + serving API)
# The -d flag runs everything in the background
docker compose up -d

# Step 2: Open the MLflow UI to see past experiments
open http://localhost:5000    # macOS
# xdg-open http://localhost:5000  # Linux

# Step 3: Check that the serving API is healthy
curl http://localhost:8000/health
# Should return: {"status":"healthy"}

# Step 4: Run a training job (pulls data via DVC, logs to MLflow)
# This only starts the "training" service because of the profile flag
docker compose --profile train run training

# Step 5: Watch training progress in the MLflow UI at localhost:5000
# You'll see metrics updating in real time if your training code logs them

# Step 6: After training completes, promote the model in MLflow UI
# Click the model, go to "Register Model", set stage to "Production"

# Step 7: Restart the serving container to pick up the new model version
docker compose restart serving

# Step 8: Test the new model
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "hour": 23}'
</code></pre>
<p>This single-file approach means a new team member can clone the repo, run <code>docker compose up -d</code>, and have the complete ML infrastructure running locally within minutes. The same containers deploy to staging and production with only environment variable changes (database credentials, model URIs, GPU allocation).</p>
<h2 id="heading-reproducibility-the-whole-point">Reproducibility: The Whole Point</h2>
<p>Everything in this article serves one goal: reproducibility. The ability to take any commit hash, build the same containers, pull the same data, and produce the same model.</p>
<p>Here are the practices that make this work:</p>
<h3 id="heading-pin-everything">Pin Everything</h3>
<p>Pin your base images to specific digests, not just tags. Pin your Python packages to exact versions with <code>pip freeze &gt; requirements.txt</code>. Use fixed random seeds in your training code and log them in MLflow.</p>
<h3 id="heading-log-everything">Log Everything</h3>
<p>Every training run should log the exact library versions (<code>pip freeze</code>), the Git commit hash, the DVC data version, all hyperparameters, and all evaluation metrics to MLflow. You can automate this:</p>
<pre><code class="language-python">import subprocess
import mlflow

with mlflow.start_run():
    # Log environment info automatically
    pip_freeze = subprocess.check_output(["pip", "freeze"]).decode()
    mlflow.log_text(pip_freeze, "pip_freeze.txt")

    git_hash = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    mlflow.log_param("git_commit", git_hash)

    # ... rest of training ...
</code></pre>
<h3 id="heading-version-everything">Version Everything</h3>
<p>Git for code, DVC for data, MLflow for experiments, Docker digests for environments. The combination creates a complete provenance chain. When a stakeholder asks why a model made a particular prediction, you can trace it back to the exact code, data, and hyperparameters that produced it. For regulated industries like finance and healthcare, that traceability is a compliance requirement, not a nice-to-have.</p>
<h2 id="heading-where-this-breaks-down">Where This Breaks Down</h2>
<p>This approach works well for small-to-medium teams running on single hosts or small clusters. Here's where you'll hit walls:</p>
<p><strong>Large datasets.</strong> Don't mount multi-terabyte datasets into containers. Use object storage (S3, GCS) and stream data during training. DVC handles the versioning, but the data itself should live outside Docker entirely.</p>
<p><strong>GPU driver mismatches.</strong> Your container's CUDA version must be compatible with the host driver. Test on identical hardware and driver versions to what you'll run in production. Document the minimum driver version in your README.</p>
<p><strong>Multi-node training.</strong> When you need to distribute training across multiple machines, you'll outgrow Compose. Kubernetes with Kubeflow or KServe is the standard path for distributed training and auto-scaled serving.</p>
<p><strong>Serving at scale.</strong> A single container running uvicorn handles moderate traffic. For high-throughput inference, you'll need a load balancer, multiple replicas, and potentially a dedicated serving framework like NVIDIA Triton Inference Server or TensorFlow Serving. Compose can run multiple replicas with <code>docker compose up --scale serving=3</code>, but it doesn't give you the routing, health-based load balancing, or rolling updates that a real orchestrator provides.</p>
<p><strong>Secrets in production.</strong> The Compose file above uses plaintext passwords for local development. In production, use Docker secrets, HashiCorp Vault, or your cloud provider's secret manager. Never commit credentials to your repo.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Containerizing your MLOps pipeline turns fragile, environment-dependent models into reproducible, deployable artifacts. Multi-stage builds keep images lean. MLflow gives you experiment tracking and model lineage. DVC links code to data. GPU passthrough preserves training performance. A single Compose file with profiles ties the whole workflow together.</p>
<p>That fraud detection model I mentioned at the start? We eventually containerized the entire pipeline around it. The next model we shipped went from "notebook finished" to "running in production" in two days instead of three weeks. Most of that time was spent on evaluation and review, not fighting environments.</p>
<p>Containerization doesn't make your models better. It gets the infrastructure out of the way so you can focus on the work that does.</p>
<p>But even with these caveats, containerized MLOps eliminates the most common source of ML project delays: environment mismatch between development and production. The three weeks we spent debugging that fraud detection model deployment? That doesn't happen anymore.</p>
<p>If you found this useful, you can find me writing about MLOps, containerized workflows, and production AI systems on my blog.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
