<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ product experimentation - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ product experimentation - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 25 May 2026 15:47:57 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/product-experimentation/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation for Collaborative AI Features: Cluster Randomization for LLM-Based Tools in Python ]]>
                </title>
                <description>
                    <![CDATA[ Every product experimentation team running causal inference on LLM-based collaborative features eventually hits the same wall: your users aren't independent. Your team ships an AI meeting summarizer t ]]>
                </description>
                <link>https://www.freecodecamp.org/news/cluster-randomization-for-llm-based-tools-in-python/</link>
                <guid isPermaLink="false">6a10ab6c1f237623ea28e372</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cluster randomization ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Fri, 22 May 2026 19:15:56 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/35d6e16c-0c87-4160-9c02-2eb0db8505d7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every product experimentation team running causal inference on LLM-based collaborative features eventually hits the same wall: your users aren't independent. Your team ships an AI meeting summarizer to half the enterprise accounts on your platform. The rollout's clean, half on and half off, and you wait for the control group's task completion to stay flat while the treated group's creeps up. Two weeks in, the control group's numbers are moving too. Not as much, but visibly. The feature's confirmed off for those accounts, and you've checked the rollout config twice. Something's still contaminating your control.</p>
<p>You know what it is before you dig into the logs. The AI meeting summaries land in shared Slack channels, the AI-drafted docs show up in shared Google Drive folders, and the AI code review suggestions appear in pull requests that both treated and control engineers read. Behavior changes for the treated users, and a slice of that behavior bleeds back into your control group through the collaboration graph.</p>
<p>This is the collaborator contamination trap. It shows up in every generative AI product that touches shared artifacts: AI meeting notes that teammates read, AI-drafted documents that coworkers edit, AI code suggestions that reviewers evaluate, AI-generated email threads that the whole team replies to. User-level randomization assumes one user's treatment assignment leaves every other user's outcome alone. In a collaborative workspace, that assumption is wrong by design, and the product experiment folds the feature's real effect together with the spillover it creates inside the control group.</p>
<p>Running a collaborative AI feature behind a user-level A/B test is a product experiment that violates the Stable Unit Treatment Value Assumption (SUTVA). The fix is cluster randomization: flip the coin at the workspace level, so entire teams are in or out together, then model the cross-workspace spillover directly.</p>
<p>This tutorial walks through the full pipeline (cluster assignment, a biased, naive user-level OLS, cluster-weighted least squares for honest standard errors, a two-exposure decomposition that identifies direct and spillover effects separately, and cluster-bootstrap confidence intervals) on a 50,000-user synthetic SaaS dataset in which the ground-truth causal effects are known. You'll estimate them, quantify uncertainty, and see where the approach silently breaks.</p>
<blockquote>
<p><strong>Companion code:</strong> every code block runs end-to-end in the companion notebook at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/05_cluster_randomization">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/05_cluster_randomization</a>. The notebook (<code>cluster_randomization_demo.ipynb</code>) has all outputs pre-executed, so you can read along on GitHub before running anything locally.</p>
</blockquote>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-user-level-ab-randomization-breaks-under-collaboration">Why user-level A/B randomization breaks under collaboration</a></p>
</li>
<li><p><a href="#heading-what-cluster-randomization-actually-does">What cluster randomization actually does</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting up the working example</a></p>
<ul>
<li><p><a href="#heading-step-1-build-the-cluster-assignment-and-spillover-exposure">Step 1: Build the cluster assignment and spillover exposure</a></p>
</li>
<li><p><a href="#heading-step-2-naive-user-level-ols-biased-and-overconfident">Step 2: Naive user-level OLS (biased and overconfident)</a></p>
</li>
<li><p><a href="#heading-step-3-cluster-weighted-least-squares-honest-standard-error">Step 3: Cluster-weighted least squares (honest standard error)</a></p>
</li>
<li><p><a href="#heading-step-4-two-exposure-decomposition-unbiased-direct-and-spillover">Step 4: Two-exposure decomposition (unbiased direct and spillover)</a></p>
</li>
<li><p><a href="#heading-step-5-cluster-bootstrap-confidence-intervals">Step 5: Cluster-bootstrap confidence intervals</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-when-cluster-randomization-fails">When cluster randomization fails</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to do next</a></p>
</li>
</ul>
<h2 id="heading-why-user-level-ab-randomization-breaks-under-collaboration">Why User-Level A/B Randomization Breaks Under Collaboration</h2>
<p>The math of an A/B test is elegant because one user's treatment assignment has no bearing on another user's outcome. Flip a coin; half your users get the AI feature, and the coin flip breaks every possible confound by construction. Collaboration breaks that guarantee in three ways.</p>
<p><strong>Shared artifacts travel.</strong> The AI summary lands in a channel every teammate reads, the AI-drafted doc goes into a folder every teammate edits, and the AI code review suggestion sits on a pull request every reviewer evaluates. Control users consume those artifacts, whether or not the feature is switched on for them, and the behavioral effects of reading AI-assisted content leak into their outcomes.</p>
<p><strong>Shared workflows create interference.</strong> A treated user who relies on the AI summarizer writes shorter follow-up notes, assuming teammates have read the summary. A control user on the same team receives those shorter notes and spends less time reading them, which changes their session length. That means the treated user's assignment has shifted the control user's outcome, which is exactly what SUTVA forbids.</p>
<p><strong>Network adoption follows collaboration.</strong> Power users on treated teams experiment with the feature first, then nudge teammates in other workspaces through cross-team channels. If your treated group produces AI-assisted content that your control group reads and copies, the control group is partially treated without ever flipping a switch.</p>
<p>All three mechanisms produce the same symptom: the raw user-level comparison understates the feature's direct effect because the control group is no longer a pure counterfactual. On the synthetic dataset in this tutorial, the ground-truth direct effect is +0.80 min of session time for treated users, and the ground-truth spillover effect is +0.20 min for control users who collaborate across workspaces. A naive user-level OLS recovers +0.6723, a 16 percent underestimate of the direct effect, and reports a standard error that is roughly 19 times too small because it treats 50,000 users as independent, even though the treatment was randomized only across 50 clusters. That's not a small error. It's the kind that ships a broken feature launch decision.</p>
<h2 id="heading-what-cluster-randomization-actually-does">What Cluster Randomization Actually Does</h2>
<p>Cluster randomization flips the assignment coin at the workspace level so entire teams land in the same arm, confining most interference to where it belongs and making the residual cross-workspace leakage something you can model directly.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/e47df38e-0f83-4b51-99e2-aee7da95a82e.png" alt="e47df38e-0f83-4b51-99e2-aee7da95a82e" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><em>Figure 1(image ab: Schematic of the SUTVA violation that cluster randomization targets. Every user in a treated workspace (top row, red) sees the AI feature. Every user in a control workspace (bottom row) should see nothing, but collaborators (orange) read AI artifacts that travel through shared Slack, documents, and code reviews. Those spillover-exposed users are partially treated. Cluster randomization doesn't make interference disappear; it confines it to within workspace boundaries, leaving the remaining cross-workspace leakage as an identifiable component that a two-exposure model can estimate directly.</em></p>
<p>If a workspace is treated, every user inside it gets the feature. If it's a control workspace, nobody inside it does. Interference within a workspace is fine because all teammates share the same assignment, and the workspace-level mean captures the full treatment package. The design aims to control interference across workspaces.</p>
<p>The estimator works under a stack of assumptions, and each one has a name worth knowing because the failure modes at the end of this tutorial map directly to specific violations.</p>
<ul>
<li><p><strong>Cluster-level random assignment.</strong> Treatment is assigned at the cluster level by a genuinely random mechanism. Which workspaces land in the treated arm is independent of workspace-level potential outcomes.</p>
</li>
<li><p><strong>Partial interference.</strong> Interference happens inside clusters but not across them (<a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC2600548/">Hudgens et al.</a>). A treated user in workspace A can affect her teammate in workspace A, but can't affect a user in workspace B. This is the assumption cluster randomization is built around.</p>
</li>
<li><p><strong>Cluster-level SUTVA.</strong> A workspace's treatment is a single, well-defined package. There's one version of the feature, and within-cluster heterogeneity in exposure is absorbed into the cluster-level effect.</p>
</li>
<li><p><strong>Exchangeability of clusters.</strong> Before the coin flip, the treated and control workspaces are exchangeable. Randomization achieves this by construction.</p>
</li>
<li><p><strong>Sufficient cluster count.</strong> Cluster-robust inference relies on a central limit theorem across clusters. Practitioners often use K ≥ 30 as a working floor, though the appropriate threshold depends on cluster-size heterogeneity and the choice of test statistic. Fewer clusters demand a different inference tool, such as randomization inference or a cluster wild bootstrap.</p>
</li>
</ul>
<p>Partial interference is the underlying assumption of load-bearing here. The whole point of cluster randomization is that cross-cluster spillover is smaller and slower than within-cluster spillover, so treating an entire team contains most of the interference where it's supposed to be (<a href="https://arxiv.org/abs/1305.6979">Ugander et al.</a>). When cross-cluster spillover is meaningful, a two-exposure model directly identifies and estimates that leakage.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You'll need Python 3.11 or newer, comfort with pandas and linear regression, and rough familiarity with ordinary least squares.</p>
<p>Install the packages for this tutorial:</p>
<pre><code class="language-shell">pip install numpy pandas statsmodels scipy matplotlib
</code></pre>
<p><strong>Here's what's happening:</strong> five packages cover the full pipeline. Pandas loads the data and builds the cluster assignment. NumPy handles array arithmetic and bootstrap draws. Statsmodels fits every regression: naive OLS, cluster-weighted least squares, and the two-exposure model with cluster-robust standard errors. Scipy supports the kernel density diagnostic plot, and matplotlib renders it.</p>
<p>Clone the companion repo to get the synthetic dataset:</p>
<pre><code class="language-shell">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<p><strong>Here's what's happening:</strong> the clone pulls the companion repo, and <code>generate_data.py</code> produces the shared 50,000-user dataset used across the series. Seed 42 keeps the data reproducible. The 50,000-user scale gives enough users per workspace (about 1,000 each) for the cluster-level inference to behave asymptotically. The output CSV lands at <code>data/synthetic_llm_logs.csv</code>.</p>
<h2 id="heading-setting-up-the-working-example">Setting up the Working Example</h2>
<p>The synthetic dataset simulates a SaaS product with 50,000 users spread across 50 workspaces. The collaborative AI feature ships at full coverage to 25 randomly selected workspaces and stays off for the other 25.</p>
<p>A control user is spillover-exposed when they collaborate across workspaces. In this tutorial, <code>opt_in_agent_mode == 1</code> serves as a behavioral proxy for that cross-workspace activity: users who actively opt into AI tooling are the ones reading teammate-authored documents, Slack threads, and pull requests where treated-workspace AI output surfaces. In a production deployment, you'd replace this proxy with an observed collaboration graph such as shared-channel membership, doc co-authorship, or reviewer overlap. Because <code>opt_in_agent_mode</code> reflects a voluntary behavioral choice with no random component, the spillover coefficient in a real experiment would absorb selection differences between opting-in and non-opting-in control users. A production spillover flag should be grounded in the observed collaboration graph; behavioral proxies introduce selection bias that the two-exposure model can't correct.</p>
<p>This tutorial constructs <code>session_minutes_obs</code> from scratch by layering known ground-truth effects onto workspace-level baselines. The CSV's <code>session_minutes</code> column is intentionally set aside. That separation lets you verify that every estimator recovers the effects baked in.</p>
<p>The ground-truth effects baked into the scenario are a +0.80-minute direct effect on treated users and a +0.20-minute spillover effect on spillover-exposed control users. Knowing both values is what lets you verify that your estimator recovers them.</p>
<h2 id="heading-step-1-build-the-cluster-assignment-and-spillover-exposure">Step 1: Build the Cluster Assignment and Spillover Exposure</h2>
<p>The first code block loads the data, assigns workspaces to treatment at the cluster level, flags spillover-exposed users, and constructs an observed outcome where the ground truth is known. The outcome starts from a workspace-level baseline so within-workspace correlation is genuine. It then adds the direct effect for treated users, the spillover effect for exposed control users, and Gaussian noise.</p>
<pre><code class="language-python">import numpy as np
import pandas as pd

DIRECT_EFFECT = 0.80
SPILLOVER_EFFECT = 0.20
DATA_SEED = 42
OUTCOME_NOISE_SD = 0.30

df = pd.read_csv("data/synthetic_llm_logs.csv")
rng = np.random.default_rng(DATA_SEED)

df["treated_workspace"] = (df["workspace_id"] &lt; 25).astype(int)
df["treated_user"] = df["treated_workspace"]
df["spillover_exposed"] = (
    (df["treated_workspace"] == 0) &amp; (df["opt_in_agent_mode"] == 1)
).astype(int)

ws_baseline = pd.DataFrame({
    "workspace_id": np.arange(50),
    "ws_baseline": rng.normal(5.0, 0.30, size=50),
})
df = df.merge(ws_baseline, on="workspace_id")
noise = rng.normal(0, OUTCOME_NOISE_SD, size=len(df))
df["session_minutes_obs"] = (
    df["ws_baseline"]
    + DIRECT_EFFECT * df["treated_user"]
    + SPILLOVER_EFFECT * df["spillover_exposed"]
    + noise
)
df["exposure"] = np.select(
    [df["treated_user"] == 1, df["spillover_exposed"] == 1],
    ["direct", "spillover"],
    default="pure_control",
)

print(f"Total users:             {len(df):,}")
print(f"Treated workspaces:      {df[df.treated_workspace == 1].workspace_id.nunique()}")
print(f"Control workspaces:      {df[df.treated_workspace == 0].workspace_id.nunique()}")
print(f"Treated users:           {df.treated_user.sum():,}")
print(f"Pure-control users:      {(df.exposure == 'pure_control').sum():,}")
print(f"Spillover-exposed users: {(df.exposure == 'spillover').sum():,}")
ws_sizes = df.groupby("workspace_id").size()
print(f"Workspace size: min={ws_sizes.min()} median={int(ws_sizes.median())} max={ws_sizes.max()}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Total users:             50,000
Treated workspaces:      25
Control workspaces:      25
Treated users:           24,937
Pure-control users:      18,319
Spillover-exposed users: 6,744
Workspace size: min=923 median=1002 max=1052
</code></pre>
<p><strong>Here's what's happening:</strong> Workspace IDs 0 through 24 become the treated cluster and 25 through 49 become the control cluster, giving you 24,937 treated users and 25,063 control users. Among the controls, 6,744 are flagged as spillover-exposed because they opted into agent mode and sit in a control workspace where they'd plausibly read treated-workspace output through cross-team channels. The remaining 18,319 are pure-control users, untouched by the feature. Workspace sizes range from 923 to 1,052 users, which is close enough to be balanced, so that cluster-weighted and unweighted estimators will behave similarly. The observed outcome <code>session_minutes_obs</code> captures the known ground truth: a treated user adds 0.80 min to their workspace baseline, a spillover-exposed user adds 0.20 min, and every user is subject to Gaussian noise with standard deviation 0.30 min.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/0bc04600-b054-405a-b3cf-d57cc8e24e4d.png" alt="0bc04600-b054-405a-b3cf-d57cc8e24e4d" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><em>Figure 2 (image above): The three exposure groups on the 50,000-user dataset. The top panel shows the observed-outcome distribution for each group, with dashed vertical lines at the group means (5.06 min pure control, 5.27 min spillover-exposed, 5.79 min treated). The spillover distribution sits between the pure-control and treated distributions, which is the contamination a naive user-level estimator would fold into the control baseline. The bottom panel translates the same groups into raw counts: 18,319 pure-control users, 6,744 spillover-exposed control users, and 24,937 treated users. Where Figure 1 schematically showed the SUTVA violation, this figure shows it at the data scale, and the three-group structure is exactly what Step 4's two-exposure model will identify.</em></p>
<h2 id="heading-step-2-naive-user-level-ols-biased-and-overconfident">Step 2: Naive User-Level OLS (Biased and Overconfident)</h2>
<p>The naive analysis ignores clustering entirely and regresses the observed outcome on each user's treatment assignment, reporting a standard error as if every user were an independent draw. Two things go wrong at once.</p>
<pre><code class="language-python">import statsmodels.formula.api as smf

naive = smf.ols("session_minutes_obs ~ treated_user", data=df).fit()
print(f"Naive estimate:  {naive.params['treated_user']:+.4f} min")
print(f"Naive SE:        {naive.bse['treated_user']:.4f}  (under-reported)")
ci = naive.conf_int().loc["treated_user"].tolist()
print(f"Naive 95% CI:    [{ci[0]:+.4f}, {ci[1]:+.4f}]")
print(f"Ground truth:    +0.80")
print(f"Bias:            {naive.params['treated_user'] - 0.80:+.4f} min")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Naive estimate:  +0.6723 min
Naive SE:        0.0034  (under-reported)
Naive 95% CI:    [+0.6656, +0.6790]
Ground truth:    +0.80
Bias:            -0.1277 min
</code></pre>
<p><strong>Here's what's happening:</strong> the point estimate lands at +0.6723, 16 percent below the ground-truth direct effect of +0.80. The bias has two components. First, spillover contamination: 6,744 control users who read treated-workspace output lie above the pure-control baseline, raising the control mean and compressing the naive treated-minus-control gap. Second, workspace baseline imbalance: with only 50 clusters, random assignment doesn't guarantee that treated and control workspace pools draw equal mean baselines. This dataset's specific seed produces a treated-pool baseline slightly below the control-pool baseline, adding additional downward pressure on the estimate. The lesson generalizes: at small K, balance checks on observable workspace characteristics before the experiment are the only defense against pre-existing between-arm differences that no standard-error correction can fix.</p>
<p>The standard error is the more alarming number. At 0.0034, it reflects variation across 50,000 users treated as independent observations, and the resulting 95% confidence interval [+0.6656, +0.6790] excludes the ground truth entirely, at roughly one-twentieth the width the design actually supports. An SE 19 times too small inflates the t-statistic by the same factor, making the naive regression's p-value appear orders of magnitude more significant than the design justifies. A stakeholder reading this report would walk away confident that the direct effect is somewhere near 0.67 min. Wrong number, wrong precision.</p>
<h2 id="heading-step-3-cluster-weighted-least-squares-honest-standard-error">Step 3: Cluster-Weighted Least Squares (Honest Standard Error)</h2>
<p>The fix for the standard error is to aggregate to 50 workspace means, then regress those means on the workspace-level treatment indicator weighted by workspace size. Inference is now based on K = 50 observations.</p>
<pre><code class="language-python">import statsmodels.api as sm

ws = (
    df.groupby("workspace_id")
    .agg(ws_mean=("session_minutes_obs", "mean"),
         ws_size=("user_id", "count"),
         treated=("treated_workspace", "max"))
    .reset_index()
)
X_ws = sm.add_constant(ws["treated"])
wls = sm.WLS(ws["ws_mean"], X_ws, weights=ws["ws_size"]).fit()
wls_ci = wls.conf_int().loc["treated"].tolist()
print(f"WLS cluster-mean contrast: {wls.params['treated']:+.4f} min")
print(f"WLS SE:          {wls.bse['treated']:.4f}  (based on K=50 clusters)")
print(f"WLS 95% CI:      [{wls_ci[0]:+.4f}, {wls_ci[1]:+.4f}]")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">WLS cluster-mean contrast: +0.6723 min
WLS SE:                    0.0652  (based on K=50 clusters)
WLS 95% CI:                [+0.5412, +0.8035]
</code></pre>
<p><strong>Here's what's happening:</strong> the cluster-mean contrast is identical to the naive estimate at +0.6723, because weighted workspace means are a different aggregation of the same user-level data. What changed is the standard error. At 0.0652, it's roughly 19 times larger than the naive 0.0034 and reflects genuine variation across 50 cluster means (statsmodels WLS uses t(48) critical values in place of z=1.96, which is why the CI bounds differ slightly from a hand calculation with z). The 95% confidence interval expands to [+0.5412, +0.8035], which barely covers the ground truth. WLS has fixed the inference problem, so the standard error now reflects the actual design, but it hasn't fixed the identification problem. Control workspace means still includes spillover-exposed users, so this estimate is a contaminated contrast you can't interpret as a clean ATE. The next step separates the two.</p>
<h2 id="heading-step-4-two-exposure-decomposition-unbiased-direct-and-spillover">Step 4: Two-Exposure Decomposition (Unbiased Direct and Spillover)</h2>
<p>The two-exposure model treats each user's exposure as a three-category variable (direct, spillover, or pure control) and regresses the outcome on the two non-baseline categories (<a href="https://projecteuclid.org/journals/annals-of-applied-statistics/volume-11/issue-4/Estimating-average-causal-effects-under-general-interference-with-application-to/10.1214/16-AOAS1005.full">Aronow et al.</a>). Pure control is the omitted reference, so both coefficients are directly interpretable: one is the direct effect of the feature, the other is the spillover effect on control users who collaborate across workspaces.</p>
<pre><code class="language-python">df["is_direct"] = (df["exposure"] == "direct").astype(int)
df["is_spillover"] = (df["exposure"] == "spillover").astype(int)
two_exp = smf.ols(
    "session_minutes_obs ~ is_direct + is_spillover",
    data=df,
).fit(cov_type="cluster", cov_kwds={"groups": df["workspace_id"]})
direct = two_exp.params["is_direct"]
spillover = two_exp.params["is_spillover"]
direct_ci = two_exp.conf_int().loc["is_direct"].tolist()
spillover_ci = two_exp.conf_int().loc["is_spillover"].tolist()
print(f"Direct effect:     {direct:+.4f} min  (ground truth = +0.80)")
print(f"  SE:              {two_exp.bse['is_direct']:.4f}")
print(f"  95% CI:          [{direct_ci[0]:+.4f}, {direct_ci[1]:+.4f}]")
print(f"Spillover effect:  {spillover:+.4f} min  (ground truth = +0.20)")
print(f"  SE:              {two_exp.bse['is_spillover']:.4f}")
print(f"  95% CI:          [{spillover_ci[0]:+.4f}, {spillover_ci[1]:+.4f}]")
spillover_share = (df["exposure"] == "spillover").mean()
projected = direct + spillover_share * spillover
print(f"Spillover share of all users: {spillover_share:.4f}")
print(f"Projected total under full rollout: {projected:+.4f} min")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Direct effect:     +0.7284 min  (ground truth = +0.80)
  SE:              0.0647
  95% CI:          [+0.6016, +0.8552]
Spillover effect:  +0.2083 min  (ground truth = +0.20)
  SE:              0.0038
  95% CI:          [+0.2008, +0.2158]
Spillover share of all users: 0.1349
Projected total under full rollout: +0.7565 min
</code></pre>
<p><strong>Here's what's happening:</strong> fitting on the three-category exposure with cluster-robust standard errors keyed to <code>workspace_id</code> yields two clean coefficients. The direct effect is +0.7284, with a 95% CI of [+0.6016, +0.8552], which includes the ground-truth value of +0.80. The spillover effect is +0.2083, with a 95% CI of [+0.2008, +0.2158], which tightly covers the ground-truth +0.20. The spillover SE (0.0038) looks small for cluster-robust inference because the simulated spillover effect is uniform across all 25 control clusters; in real data with heterogeneous spillover intensity, you'll see the cluster-robust SE grow meaningfully larger. The projected total of +0.7565 min accounts for the spillover effect, based on the fraction of users expected to be spillover-exposed at a given deployment scale (0.1349 in this dataset). In a production deployment, you'd replace that fraction with whatever share your collaboration graph predicts will be spillover-exposed under your rollout plan. The projection is a design parameter in your rollout, so state the assumed share explicitly when you report the number.</p>
<h2 id="heading-step-5-cluster-bootstrap-confidence-intervals">Step 5: Cluster-Bootstrap Confidence Intervals</h2>
<p>The cluster bootstrap resamples entire workspaces to test whether Step 4's analytic confidence intervals hold without assuming the central limit theorem has fully kicked in at K = 50. Analytic standard errors for a cluster design work well when K is large, and workspaces are roughly equal in size; the bootstrap confirms this holds in practice for your actual data. Resampling individual users would undercount variance because users in the same workspace share the cluster assignment and the workspace-level baseline; the cluster bootstrap preserves that correlation structure.</p>
<pre><code class="language-python">def naive_point(d):
    return smf.ols(
        "session_minutes_obs ~ treated_user", data=d
    ).fit().params["treated_user"]

def wls_point(d):
    w = (d.groupby("workspace_id").agg(
            ws_mean=("session_minutes_obs", "mean"),
            ws_size=("user_id", "count"),
            treated=("treated_workspace", "max")).reset_index())
    X = sm.add_constant(w["treated"])
    return sm.WLS(w["ws_mean"], X, weights=w["ws_size"]).fit().params["treated"]

def two_exp_point(d):
    fit = smf.ols(
        "session_minutes_obs ~ is_direct + is_spillover", data=d
    ).fit(cov_type="cluster", cov_kwds={"groups": d["workspace_id"]})
    return fit.params["is_direct"], fit.params["is_spillover"]

rng_boot = np.random.default_rng(7)
ws_ids = df["workspace_id"].unique()
k = len(ws_ids)
reps = {"naive": [], "cluster_wls": [], "direct": [], "spillover": []}
for _ in range(500):
    draw = rng_boot.choice(ws_ids, size=k, replace=True)
    sample = pd.concat(
        [df[df["workspace_id"] == wid] for wid in draw],
        ignore_index=True,
    )
    reps["naive"].append(naive_point(sample))
    reps["cluster_wls"].append(wls_point(sample))
    d_b, s_b = two_exp_point(sample)
    reps["direct"].append(d_b)
    reps["spillover"].append(s_b)

for key, truth in [("naive", 0.80), ("cluster_wls", 0.80),
                   ("direct", 0.80), ("spillover", 0.20)]:
    arr = np.array(reps[key])
    lo, hi = np.percentile(arr, [2.5, 97.5])
    covers = "covers" if lo &lt;= truth &lt;= hi else "misses"
    print(f"{key:&lt;13} 95% CI: [{lo:+.4f}, {hi:+.4f}]   ({covers} {truth:+.2f})")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">naive         95% CI: [+0.5386, +0.7966]   (misses +0.80)
cluster_wls   95% CI: [+0.5386, +0.7966]   (misses +0.80)
direct        95% CI: [+0.5931, +0.8519]   (covers +0.80)
spillover     95% CI: [+0.2008, +0.2164]   (covers +0.20)
</code></pre>
<p><strong>Here's what's happening:</strong> drawing 50 workspaces with replacement and refitting each estimator 500 times gives you a bootstrap distribution for every point estimate. The naive OLS and cluster WLS estimators produce identical bootstrap intervals because they share the same point estimate under workspace-level resampling, and both intervals exclude the ground-truth +0.80 because both are biased by the two sources identified in Step 2 (spillover contamination and the workspace baseline imbalance). The direct-effect interval from the two-exposure model is [0.5931, 0.8519], which includes 0.80. The spillover interval is [+0.2008, +0.2164], which tightly covers +0.20. The cluster bootstrap confirms what the analytic cluster-robust standard errors in Step 4 already showed: inference holds up without relying on asymptotic approximations at K = 50. Running this takes about one minute on a laptop.</p>
<h2 id="heading-when-cluster-randomization-fails">When Cluster Randomization Fails</h2>
<p>Cluster randomization solves the SUTVA problem when its assumptions hold, and it produces biased estimates that look clean when they don't. Three failure modes map to a named identification assumption; a fourth addresses estimator efficiency when cluster sizes are unequal.</p>
<p><strong>Too few clusters (violates sufficient cluster count).</strong> Cluster-robust standard errors rely on a central limit theorem across clusters, and practitioners often use K ≥ 30 as a working floor, though the appropriate threshold depends on heterogeneity in cluster sizes and the choice of test statistic (<a href="https://doi.org/10.1002/jae.2600">MacKinnon &amp; Webb, 2017</a>). A collaborative AI feature rolled out to four customer accounts doesn't clear that bar. Cluster-robust standard errors with K = 4 are anticonservative, and the resulting confidence intervals are too narrow. When K is small, randomization inference or a cluster wild bootstrap gives you valid p-values.</p>
<p><strong>Cluster boundary does not contain the interference graph (violates partial interference).</strong> Cluster randomization assumes interference is confined within workspaces. If your users collaborate heavily across workspaces through Slack Connect channels, external shared documents, or customer community forums, partial interference is a fiction, and spillover bleeds across every cluster boundary. The two-exposure model can absorb modest cross-cluster leakage because the spillover coefficient captures whatever spillover your exposure flag measures. When leakage is structural, you need the observed collaboration graph and a graph-cluster randomization design that builds clusters from the collaboration structure itself (<a href="https://arxiv.org/abs/1305.6979">Ugander et al.</a>).</p>
<p><strong>Heterogeneous cluster sizes that bias the aggregation (estimator efficiency).</strong> Equal-weighted cluster means treat a 50-user workspace the same as a 5,000-user workspace, which is a poor efficiency trade when the variance of a workspace's mean depends on the number of users in it. The fix is weighted least squares by workspace size, or a mixed-effects model with workspace random intercepts. This is an efficiency concern with no bearing on identification, and that distinction matters: the point estimate stays consistent under either weighting choice.</p>
<p><strong>Post-hoc cluster construction (violates exchangeability).</strong> Building cluster assignments after observing outcomes is the cleanest way to turn a valid design into p-hacking. You've got to define and commit your clusters before the randomization, ideally in a pre-registered analysis plan. Any post-hoc adjustment to cluster boundaries (dropping a workspace with extreme outcomes, merging small workspaces into a composite, redefining spillover exposure after inspecting the data) reintroduces selection bias that no standard-error correction can fix.</p>
<p>Two additional threats deserve attention in real deployments.</p>
<p><strong>Cluster-level SUTVA fails under partial feature adoption.</strong> The cluster-level SUTVA assumption requires that a workspace's treatment is a single, well-defined package. That breaks down when a feature rolls out at different adoption rates within a single workspace, or when multiple feature versions coexist (advanced for power users, basic for casual users). In that case, the cluster-level "treatment" conflates multiple effects, and the estimand is no longer interpretable.</p>
<p><strong>Workspace-level confounders when randomization isn't mechanical.</strong> In enterprise deployments, workspace selection into the treated arm is often not fully random. Beta programs attract tech-forward accounts; customer success teams influence which clients get early access. When exchangeability is violated before the coin flip, cluster-robust standard errors cannot correct for pre-existing systematic differences between the treated and control workspace pools. A balance check on observable workspace characteristics (size, industry, baseline engagement) and regression adjustment at the cluster level are the standard remedies.</p>
<p>These failure modes stay invisible in your regression coefficients. They surface later, in the gap between the offline estimate and the production rollout. Cluster counts, collaboration graph audits, and a written pre-registration are your only real defenses.</p>
<h2 id="heading-what-to-do-next">What To Do Next</h2>
<p>Cluster randomization is the right tool when collaboration within a workspace creates spillover effects that break user-level SUTVA, and when your clusters are natural and observable (workspaces, teams, accounts, physical stores). If the interference you care about spans geographic markets or occurs over time inside a two-sided marketplace where drivers and riders clear as a whole, switchback experiments that randomize time slots fit better. If your treatment is assigned at the individual level but you suspect unobserved cross-user confounders, an instrumental variable analysis with a design-based instrument provides a cleaner identification strategy. When interference is known and complex, graph-cluster randomization with Horvitz-Thompson weighted exposure estimators gives you unbiased effect estimates without forcing every cluster boundary to contain every interference path.</p>
<p>The companion notebook for this tutorial lives at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/05_cluster_randomization">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/05_cluster_randomization</a>. Clone the repo, generate the synthetic dataset, and run <code>cluster_randomization_demo.ipynb</code> (or <code>cluster_randomization_demo.py</code>) to reproduce every code block, every number, and every figure from this tutorial.</p>
<p>When a collaborative AI feature ships to teams who share their work, the user-level A/B estimate is almost always wrong. Cluster randomization plus a two-exposure model gives you the direct effect and the spillover effect separately, and the cluster bootstrap gives you an interval you can defend when a stakeholder asks how much of the lift comes from the feature and how much comes from teammates talking to each other.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation with Synthetic Control: Causal Inference for Global LLM Rollouts in Python ]]>
                </title>
                <description>
                    <![CDATA[ Every product experimentation team doing causal inference on LLM-based features eventually hits the same wall: when the provider ships a new model version, there's no holdout. Your infrastructure team ]]>
                </description>
                <link>https://www.freecodecamp.org/news/product-experimentation-with-synthetic-control-causal-inference-for-global-llm-rollouts-in-python/</link>
                <guid isPermaLink="false">6a02b2a8937b84f7790d481e</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ synthetic-control ]]>
                    </category>
                
                    <category>
                        <![CDATA[ generative ai ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Tue, 12 May 2026 04:55:04 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/06d252e7-e613-46c7-b5ce-c5daa14cec21.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every product experimentation team doing causal inference on LLM-based features eventually hits the same wall: when the provider ships a new model version, there's no holdout.</p>
<p>Your infrastructure team upgrades every workspace from Claude 4.5 to Claude 4.6 overnight. All 50 production workspaces get the new model at the same time. A week later, task completion climbs across the board. The head of product calls it a win.</p>
<p>But you know something's off. No holdout group ran 4.5 through the upgrade week. The naïve before/after picks up whatever else changed that week alongside the model: a new onboarding flow, a seasonal uptick, a high-profile customer onboarding.</p>
<p>This is the Global Rollout Problem. It appears whenever a team ships a model upgrade to the entire user base simultaneously. For product teams running generative AI features, it's one of the most common measurement traps in the stack. Staged rollouts buy you a control group, global rollouts eliminate it.</p>
<p>In 2026, global model upgrades are the norm: every API provider pushes new versions, and every team using Claude, GPT, or Gemini has experienced the sudden jump from one version to the next with no opt-out.</p>
<p>Synthetic control is the tool that data scientists use when the control group is missing. You build a weighted combination of untreated units (other workspaces or regions that weren't upgraded at the same time) whose pre-upgrade behavior matches that of the treated unit. Compare the treated unit to its synthetic twin after the upgrade, and the gap is the causal estimate, conditional on three identification assumptions that we'll name explicitly.</p>
<p>In this tutorial, you'll build a synthetic control from scratch in Python using <code>scipy.optimize</code>, apply it to a 50,000-user synthetic SaaS dataset, and validate with a placebo permutation test, leave-one-out donor sensitivity, and a cluster bootstrap 95% confidence interval.</p>
<p><strong>Companion code:</strong> every code block runs end-to-end in the companion notebook at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control</a>. The notebook (<code>synthetic_control_demo.ipynb</code>) has all outputs pre-executed, so you can read along on GitHub before running anything locally.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-global-rollouts-break-naive-measurement">Why Global Rollouts Break Naïve Measurement</a></p>
</li>
<li><p><a href="#heading-what-synthetic-control-actually-does">What Synthetic Control Actually Does</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-fit-donor-weights-with-slsqp">Step 1: Fit Donor Weights with SLSQP</a></p>
</li>
<li><p><a href="#heading-step-2-plot-treated-vs-synthetic-control-trajectories">Step 2: Plot Treated vs Synthetic Control Trajectories</a></p>
</li>
<li><p><a href="#heading-step-3-in-space-placebo-permutation-test">Step 3: In-Space Placebo Permutation Test</a></p>
</li>
<li><p><a href="#heading-step-4-leave-one-out-donor-sensitivity">Step 4: Leave-One-Out Donor Sensitivity</a></p>
</li>
<li><p><a href="#heading-step-5-cluster-bootstrap-95-confidence-intervals">Step 5: Cluster Bootstrap 95% Confidence Intervals</a></p>
</li>
<li><p><a href="#heading-when-synthetic-control-fails">When Synthetic Control Fails</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-global-rollouts-break-naive-measurement">Why Global Rollouts Break Naïve Measurement</h2>
<p>The math of an A/B test is elegant because of one assumption: treatment assignment is independent of everything else. Flip a coin: half your workspaces get Claude 4.6, and half stay on 4.5. The coin flip breaks every possible confound. The global rollout world has no coin.</p>
<p>Three mechanisms make the naive before/after misleading.</p>
<ol>
<li><p><strong>Co-occurring product changes:</strong> Shipping a model upgrade rarely happens in isolation. The same week, the onboarding team ships a redesigned tutorial, the pricing team runs a promotion, or customer success reaches out to enterprise accounts about the new capabilities. Your before/after picks up the sum.</p>
</li>
<li><p><strong>Seasonal and market drift:</strong> Weekly usage patterns, monthly billing cycles, and quarterly procurement cycles all move outcome metrics. A 3 pp lift in week 20 looks like the model upgrade, but in fact, users returned from spring break.</p>
</li>
<li><p><strong>Peer-company dynamics:</strong> A competitor releases a buggy update, and your users migrate over for a week. Your task completion rate spikes because the new users had easier queries, with zero contribution from the model itself.</p>
</li>
</ol>
<p>All three produce the same symptom: a raw before/after that folds the upgrade's causal effect together with the causal effect of every other week-20 event.</p>
<p>In this tutorial's dataset, the naïve gap is +0.0515, nearly equal to the ground-truth +0.05. That coincidence is the scariest failure mode: the naive number sometimes lands correctly by accident, and without a counterfactual, you can't tell luck from truth.</p>
<h2 id="heading-what-synthetic-control-actually-does">What Synthetic Control Actually Does</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/d06bde67-30dd-4bc4-b019-5189ac5424a7.png" alt="d06bde67-30dd-4bc4-b019-5189ac5424a7" style="display:block;margin:0 auto" width="1517" height="887" loading="lazy">

<p><em>Figure 1 (above): Schematic of the synthetic control construction. The gray curves are donor workspaces that remain on the old model. The dashed navy curve is the weighted combination of donors that best tracks the treated unit (red) during the pre-treatment window marked by the blue bracket below the x-axis.</em></p>
<p><em>After the treatment date (week 20, dotted vertical line), the weights stay frozen, and the dashed curve projects forward as the counterfactual, while the treated unit moves upward. The gap between the two curves in the post-treatment window is the causal-effect estimate.</em></p>
<p><em>The key design choice the figure illustrates is that weights are fit once, using only pre-treatment data, and never refit using post-treatment data.</em></p>
<p>Synthetic control finds a weighted combination of untreated units whose outcome trajectory closely matches the treated unit's in the pre-treatment period. Once the weights are fixed, you project the synthetic unit's trajectory forward into the post-treatment period and read off the gap between the two lines.</p>
<p>In your AI product context: if wave-2 workspaces didn't get the model upgrade at the same time as wave-1 workspaces, each wave-2 workspace is a candidate donor. The optimizer finds the combination of wave-2 workspaces whose weighted pre-upgrade trajectory best matches wave 1's. After week 20 (when wave 1 was upgraded), the gap between wave 1 and its synthetic twin is the causal-effect estimate, provided that the following three identification assumptions hold.</p>
<p>These identification assumptions work together.</p>
<ul>
<li><p>First, <strong>pre-period fit</strong> (the convex-hull condition): the treated unit's pre-treatment trajectory must lie inside the convex hull of the donor trajectories, which is what the non-negativity and sum-to-1 constraints enforce.</p>
</li>
<li><p>Second, <strong>no interference for donors</strong> (SUTVA for the donor pool): the treatment on the treated unit must not affect the donors. Shared API rate-limit pools or users migrating between workspaces both break this.</p>
</li>
<li><p>Third, <strong>stable donor composition</strong>: the donors must not experience structural breaks unrelated to the treatment during the post-period. Violate any one, and the gap is biased even when the pre-period fit looks perfect. The failure modes section walks through each.</p>
</li>
</ul>
<p>One geometric note: with T₀ pre-treatment periods and J donors, pre-period overfitting becomes serious when J approaches T₀. This tutorial runs with T₀ = 20 and J = 25, which sits in the danger zone. The LOO sensitivity step later is the right diagnostic for whether the fit reflects genuine comparability or overfitting.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You'll need Python 3.11 or newer, comfort with pandas and numpy, and familiarity with basic constrained optimization.</p>
<p>Install the packages for this tutorial:</p>
<pre><code class="language-shell">pip install numpy pandas scipy matplotlib
</code></pre>
<p><strong>Here's what's happening:</strong> four packages cover the full pipeline. Pandas loads the user-level log, NumPy handles panel arithmetic, SciPy provides the SLSQP solver to enforce the convex-combination constraint on the donor weights, and matplotlib renders the trajectory plot and the placebo distribution.</p>
<p>Clone the companion repo to get the synthetic dataset:</p>
<pre><code class="language-shell">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<p><strong>Here's what's happening:</strong> the clone pulls the companion repo, and <code>generate_data.py</code> produces the shared synthetic dataset used across the series. Seed 42 keeps the dataset reproducible, and 50,000 users give a clean signal for the estimator in this tutorial. The output CSV lands at <code>data/synthetic_llm_logs.csv</code>.</p>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The synthetic dataset simulates a SaaS product with 50,000 users spread across 50 workspaces. Workspaces 0 through 24 are in wave 1, which received the model upgrade at week 20. Workspaces 25 through 49 are in wave 2, which stayed on the old model through week 29.</p>
<p>The ground-truth causal effect baked into the data generator is a +5 percentage-point increase in task completion for wave-1 users in the post-treatment period. You know the truth, so you can check what the synthetic control recovers.</p>
<p>Load the data and aggregate to a workspace-by-week panel:</p>
<pre><code class="language-python">import numpy as np
import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")

PRE = 20         # weeks 0-19 are pre-treatment
WINDOW = 30      # analysis window weeks 0-29

df_window = df[df.signup_week &lt; WINDOW].copy()

panel = (
    df_window.groupby(["workspace_id", "signup_week"])
    ["task_completed"].mean().reset_index()
)
panel.columns = ["workspace_id", "week", "task_completed"]

pivot = panel.pivot(
    index="week", columns="workspace_id", values="task_completed"
)
pivot = pivot.interpolate(method="linear", axis=0).ffill().bfill()

ws_wave = df.groupby("workspace_id").wave.first()
wave1_ws = sorted(ws_wave[ws_wave == 1].index.tolist())
wave2_ws = sorted(ws_wave[ws_wave == 2].index.tolist())

treated_series = pivot[wave1_ws].mean(axis=1).values
donor_matrix = pivot[wave2_ws].values

print(f"Treated series shape: {treated_series.shape}")
print(f"Donor matrix shape:   {donor_matrix.shape}")
print(f"Users per workspace-week: ~{len(df_window) / (50 * WINDOW):.1f}")
print(f"Pre-period treated mean  (weeks 0-19):  {treated_series[:PRE].mean():.4f}")
print(f"Post-period treated mean (weeks 20-29): {treated_series[PRE:].mean():.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Treated series shape: (30,)
Donor matrix shape:   (30, 25)
Users per workspace-week: ~19.2
Pre-period treated mean  (weeks 0-19):  0.5927
Post-period treated mean (weeks 20-29): 0.6421
</code></pre>
<p><strong>Here's what's happening:</strong> you restrict to the 30-week window, aggregate user rows to a workspace-by-week panel, and reshape so rows are weeks and columns are workspaces. Interpolation fills any missing cells (each cell averages about 19 users). The treated series is the mean across all 25 wave-1 workspaces, pooling roughly 480 users per week to smooth cell-level noise.</p>
<p>The donor matrix keeps each wave-2 workspace as a separate column: 25 time series, each covering weeks 0 through 29. The pre-period treated mean of 0.5927 and the post-period mean of 0.6421 yield a raw before/after gap of +5.15 pp, which coincidentally sits near the ground-truth +5 pp and is contaminated by everything else that moved in weeks 20 through 29.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/9b5d9711-9632-41ec-9c38-5ad531ca676f.png" alt="9b5d9711-9632-41ec-9c38-5ad531ca676f" style="display:block;margin:0 auto" width="1454" height="1027" loading="lazy">

<p><em>Figure 2: The diagnostic on the real 50,000-user dataset. Top panel: wave 1's trajectory in red and the fitted synthetic control in navy dashed, with pre-period RMSE of 3.74 pp and a post-treatment gap averaging +8.29 pp. Bottom panel: the placebo distribution built by re-fitting the synthetic control with each of the 25 donor workspaces standing in as the placebo treated unit. The observed gap lies outside the full placebo range, which drives the pseudo p-value in Step 3.</em></p>
<p><em>Where Figure 1 schematically showed the method, this figure shows that it produces a pre-period fit tight enough to make the post-period gap interpretable and a placebo distribution that discriminates the observed effect from noise.</em></p>
<h2 id="heading-step-1-fit-donor-weights-with-slsqp">Step 1: Fit Donor Weights with SLSQP</h2>
<p>The synthetic control weight vector <code>w</code> is the solution to a constrained optimization problem: minimize the pre-period mean squared error between the treated series and the weighted combination of donor series, subject to each weight being in [0, 1] and all weights summing to 1. The non-negativity and sum-to-1 constraints together define a convex combination, which is what prevents extrapolation beyond the support of the donor pool.</p>
<pre><code class="language-python">from scipy.optimize import minimize

n_donors = len(wave2_ws)
Y_pre = treated_series[:PRE]
D_pre = donor_matrix[:PRE, :]

def objective(w):
    return np.mean((Y_pre - D_pre @ w) ** 2)

w0 = np.ones(n_donors) / n_donors
bounds = [(0, 1)] * n_donors
constraints = [{"type": "eq", "fun": lambda w: w.sum() - 1}]

result = minimize(
    objective, w0, method="SLSQP", bounds=bounds,
    constraints=constraints,
    options={"ftol": 1e-12, "maxiter": 5000},
)
w_opt = result.x

pre_mse = float(np.mean((Y_pre - D_pre @ w_opt) ** 2))
pre_rmse = float(np.sqrt(pre_mse))
nz = int((w_opt &gt; 0.001).sum())

print(f"Optimization converged: {result.success}")
print(f"Non-zero donor weights (|w| &gt; 0.001): {nz}")
print(f"Pre-period MSE:  {pre_mse:.6f}")
print(f"Pre-period RMSE: {pre_rmse:.4f}  "
      f"({pre_rmse * 100:.2f} percentage points)")

synth_full = donor_matrix @ w_opt
gap = float((treated_series[PRE:] - synth_full[PRE:]).mean())
print(f"\nObserved post-period gap: {gap:+.4f}  (ground truth = +0.0500)")

nz_pairs = sorted(
    [(ws, w_opt[i]) for i, ws in enumerate(wave2_ws) if w_opt[i] &gt; 0.001],
    key=lambda x: -x[1]
)
print("\nTop 5 donor weights:")
for ws_id, weight in nz_pairs[:5]:
    print(f"  workspace {ws_id}: w = {weight:.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Optimization converged: True
Non-zero donor weights (|w| &gt; 0.001): 12
Pre-period MSE:  0.001400
Pre-period RMSE: 0.0374  (3.74 percentage points)

Observed post-period gap: +0.0829  (ground truth = +0.0500)

Top 5 donor weights:
  workspace 35: w = 0.2016
  workspace 40: w = 0.1900
  workspace 25: w = 0.1638
  workspace 32: w = 0.0872
  workspace 36: w = 0.0784
</code></pre>
<p><strong>Here's what's happening:</strong> the <code>objective</code> function computes the mean squared error between the treated pre-period series and the dot product of the donor matrix with the weight vector.</p>
<p>SLSQP handles the non-negativity bounds and the sum-to-1 equality constraint simultaneously. The <code>w &gt; 0.001</code> threshold classifies 12 donors as non-zero. SLSQP doesn't guarantee exact zeros at inactive constraints, so the threshold is a display convention. Pre-period RMSE of 3.74 pp measures how closely the weighted donors tracked the treated unit before the upgrade. The observed post-period gap of +0.0829 is the headline estimate, which overshoots the ground-truth +5 pp, as Step 5 quantifies with a confidence interval.</p>
<p>The weights are fixed at the end of the pre-period and never re-estimated using post-treatment data. Any divergence after week 20 reflects movement the optimizer had no opportunity to fit.</p>
<h2 id="heading-step-2-plot-treated-vs-synthetic-control-trajectories">Step 2: Plot Treated vs Synthetic Control Trajectories</h2>
<p>The primary visual diagnostic for synthetic control is the trajectory overlay: plot both series together, mark the treatment date, and confirm that the synthetic control tracks the treated unit in the pre-period and that a gap opens in the post-period.</p>
<p>A tight pre-period fit is the visible signal that the identification condition holds. A ragged fit means the treated unit is outside the convex hull of the donors, and the whole exercise is suspect.</p>
<pre><code class="language-python">import matplotlib.pyplot as plt

weeks = np.arange(WINDOW)

fig, ax = plt.subplots(figsize=(9, 4.5))
ax.plot(weeks, treated_series, marker="o", linewidth=1.8,
        color="#C44E52", label="Wave 1 (treated)")
ax.plot(weeks, synth_full, marker="s", linestyle="--",
        linewidth=1.8, color="#4C72B0", label="Synthetic control")
ax.axvline(PRE, color="#555555", linestyle=":", linewidth=1.4,
           label="Model upgrade (week 20)")
ax.set_xlabel("Signup week")
ax.set_ylabel("Mean task completion rate")
ax.set_title("Treated unit vs synthetic control")
ax.legend(frameon=False)
plt.tight_layout()
plt.show()

post_gap = treated_series[PRE:] - synth_full[PRE:]
print("Post-period weekly gaps (treated minus synthetic):")
for wk, g in zip(range(PRE, WINDOW), post_gap):
    print(f"  week {wk}: {g:+.4f}")
print(f"\nMean gap: {post_gap.mean():+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Post-period weekly gaps (treated minus synthetic):
  week 20: +0.0398
  week 21: +0.1663
  week 22: +0.1019
  week 23: +0.1535
  week 24: +0.1071
  week 25: +0.1047
  week 26: +0.0424
  week 27: +0.0326
  week 28: +0.0327
  week 29: +0.0479

Mean gap: +0.0829
</code></pre>
<p><strong>Here's what's happening:</strong> the two lines track each other in the pre-period, confirming the fit assumption. After week 20, the treated series moves above the synthetic control, and the weekly gaps are all positive with a mean of +8.29 pp.</p>
<p>The spread across weeks (from +3.26 pp to +16.63 pp) is how much week-to-week noise the estimator absorbs. A single bad week could swing the mean by a percentage point, which is why the placebo and LOO steps that follow matter more than any single point estimate.</p>
<h2 id="heading-step-3-in-space-placebo-permutation-test">Step 3: In-Space Placebo Permutation Test</h2>
<p>You can't run a standard t-test on a single treated unit. The synthetic control has one treated observation (wave 1) and 25 donor observations, which is not a setup for which any conventional p-value applies.</p>
<p>The standard validation is the in-space placebo permutation test. Treat each donor in turn as if it were the "treated" unit, re-fit the synthetic control using the remaining 24 donors as its placebo pool, record the placebo post-period gap, and compare the observed gap to the distribution of placebos.</p>
<pre><code class="language-python">placebo_gaps = []

for j in range(n_donors):
    placebo_treated = donor_matrix[:, j]
    placebo_pool = np.delete(donor_matrix, j, axis=1)
    n_p = placebo_pool.shape[1]

    def obj_p(w):
        return np.mean((placebo_treated[:PRE] - placebo_pool[:PRE] @ w) ** 2)

    res_p = minimize(
        obj_p, np.ones(n_p) / n_p, method="SLSQP",
        bounds=[(0, 1)] * n_p,
        constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1}],
        options={"ftol": 1e-12, "maxiter": 5000},
    )
    synth_p = placebo_pool @ res_p.x
    placebo_gaps.append((placebo_treated[PRE:] - synth_p[PRE:]).mean())

placebo_gaps = np.array(placebo_gaps)
observed_gap = gap

rank = int((np.abs(placebo_gaps) &gt;= abs(observed_gap)).sum())
pseudo_p = (rank + 1) / (len(placebo_gaps) + 1)

print(f"Observed gap:      {observed_gap:+.4f}")
print(f"Placebo mean gap:  {placebo_gaps.mean():+.4f}")
print(f"Placebo std gap:   {placebo_gaps.std():.4f}")
print(f"Placebo gap range: [{placebo_gaps.min():+.4f}, "
      f"{placebo_gaps.max():+.4f}]")
print(f"|placebo| &gt;= |observed|: {rank} of {len(placebo_gaps)}")
print(f"Pseudo p-value: {pseudo_p:.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Observed gap:      +0.0829
Placebo mean gap:  -0.0008
Placebo std gap:   0.0380
Placebo gap range: [-0.0748, +0.0707]
|placebo| &gt;= |observed|: 0 of 25
Pseudo p-value: 0.0385
</code></pre>
<p><strong>Here's what's happening:</strong> the loop iterates over all 25 wave-2 workspaces. For each one, you remove it from the donor pool, treat it as a placebo-treated unit, and re-run the SLSQP optimization. After 25 placebo runs, you count how many placebo gaps meet or exceed the observed gap in absolute value and apply the conservative (count + 1) / (N + 1) correction.</p>
<p>None of the 25 placebos produced a gap as extreme as the observed +0.0829, yielding a pseudo-p-value of 0.0385. That rejects the null of no effect at the 5% level. The placebo distribution centers near zero (mean -0.0008, std 3.80 pp), which is the noise floor to compare the observed gap against.</p>
<p>The correct statistical statement is: the observed gap is more extreme than any placebo drawn from untreated donors at the 5% level. The permutation test's power depends on the donor pool size: with 25 donors, the smallest possible pseudo-p is 1/26 = 0.0385, so you can't get a smaller p-value with this donor count. A wider placebo distribution or a smaller observed gap would rank the observation inside the placebo bulk and push the pseudo p above any useful threshold.</p>
<h2 id="heading-step-4-leave-one-out-donor-sensitivity">Step 4: Leave-One-Out Donor Sensitivity</h2>
<p>A tight point estimate can still be fragile if it hangs on a single donor. The leave-one-out (LOO) sensitivity check drops each non-zero-weight donor in turn, refits the synthetic control on the remaining donors, and records the new gap.</p>
<p>Abadie (2021) recommends this as the first-line robustness check. If removing any single donor swings the gap by a large amount, you don't have a synthetic control&nbsp;– you have a single-donor comparison dressed up with extra weight.</p>
<pre><code class="language-python">def fit_and_gap(treated, donors, pre=PRE):
    n = donors.shape[1]
    def obj(w):
        return np.mean((treated[:pre] - donors[:pre] @ w) ** 2)
    res = minimize(
        obj, np.ones(n) / n, method="SLSQP",
        bounds=[(0, 1)] * n,
        constraints=[{"type": "eq", "fun": lambda w: w.sum() - 1}],
        options={"ftol": 1e-12, "maxiter": 5000},
    )
    synth = donors @ res.x
    return float((treated[pre:] - synth[pre:]).mean())


nz_idx = np.where(w_opt &gt; 0.001)[0]
loo_rows = []
for j in nz_idx:
    kept = np.delete(donor_matrix, j, axis=1)
    gap_new = fit_and_gap(treated_series, kept)
    loo_rows.append({
        "dropped_workspace": int(wave2_ws[j]),
        "dropped_weight": float(w_opt[j]),
        "new_gap": gap_new,
    })
loo_df = pd.DataFrame(loo_rows).sort_values("dropped_weight", ascending=False)
print(loo_df.round(4).to_string(index=False))
print(f"\nLOO gap range: [{loo_df.new_gap.min():+.4f}, "
      f"{loo_df.new_gap.max():+.4f}]")
print(f"Original gap:  {gap:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python"> dropped_workspace  dropped_weight  new_gap
                35          0.2016   0.0945
                40          0.1900   0.0756
                25          0.1638   0.0932
                32          0.0872   0.0868
                36          0.0784   0.0739
                31          0.0718   0.0858
                29          0.0648   0.0782
                26          0.0439   0.0786
                27          0.0364   0.0867
                46          0.0350   0.0794
                39          0.0192   0.0848
                42          0.0078   0.0839

LOO gap range: [+0.0739, +0.0945]
Original gap:  +0.0829
</code></pre>
<p><strong>Here's what's happening:</strong> the loop drops one non-zero-weight donor at a time and refits. All 12 LOO estimates stay positive, with the range [+7.39 pp, +9.45 pp] straddling the original +8.29 pp by about a percentage point in either direction.</p>
<p>No single donor drives the result. Even dropping workspace 35 (the largest weight at 0.2016) only shifts the gap to +9.45 pp because the optimizer redistributes weight across remaining donors.</p>
<p>That redistribution is the point of convex-combination weighting: many near-equivalent donor mixtures produce similar counterfactuals.</p>
<h2 id="heading-step-5-cluster-bootstrap-95-confidence-intervals">Step 5: Cluster Bootstrap 95% Confidence Intervals</h2>
<p>Point estimates are only half the story. A stakeholder asking "how sure are you" wants an interval. The classical non-parametric bootstrap doesn't apply cleanly to synthetic control on a single treated unit, because resampling the one treated time series with replacement destroys the time-ordering that the estimator depends on.</p>
<p>A valid substitute is the user-level cluster bootstrap: resample users with replacement, rebuild the workspace-by-week panel from the resampled user log, re-fit the donor weights on the pre-period, and record the post-period gap.</p>
<p>Repeat 500 times. The 2.5th and 97.5th percentiles of the resulting distribution are the 95% CI.</p>
<pre><code class="language-python">def build_panel(df_inner):
    dfw = df_inner[df_inner.signup_week &lt; WINDOW].copy()
    panel = (dfw.groupby(["workspace_id", "signup_week"])
             ["task_completed"].mean().reset_index())
    panel.columns = ["workspace_id", "week", "task_completed"]
    piv = panel.pivot(index="week", columns="workspace_id",
                      values="task_completed")
    piv = piv.interpolate(method="linear", axis=0).ffill().bfill()
    ws_wave_b = df_inner.groupby("workspace_id").wave.first()
    w1 = sorted(ws_wave_b[ws_wave_b == 1].index.tolist())
    w2 = sorted(ws_wave_b[ws_wave_b == 2].index.tolist())
    return piv[w1].mean(axis=1).values, piv[w2].values


rng = np.random.default_rng(7)
n = len(df)
n_reps = 500
gaps_boot = np.empty(n_reps)
for i in range(n_reps):
    sample = df.iloc[rng.integers(0, n, size=n)]
    t_b, d_b = build_panel(sample)
    gaps_boot[i] = fit_and_gap(t_b, d_b)

lo = float(np.percentile(gaps_boot, 2.5))
hi = float(np.percentile(gaps_boot, 97.5))
print(f"Post-period gap 95% CI: [{lo:+.4f}, {hi:+.4f}]")
print(f"Observed point estimate: {gap:+.4f}")
print(f"Ground truth +0.0500 inside CI: "
      f"{'YES' if lo &lt;= 0.05 &lt;= hi else 'NO'}")
print(f"Zero inside CI: {'YES' if lo &lt;= 0 &lt;= hi else 'NO'}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Post-period gap 95% CI: [+0.0511, +0.1215]
Observed point estimate: +0.0829
Ground truth +0.0500 inside CI: NO
Zero inside CI: NO
</code></pre>
<p><strong>Here's what's happening:</strong> you resample the user log 500 times, rebuild the panel from each resample, re-fit the weights on the pre-period, and take the 2.5th and 97.5th percentiles of the 500 resulting gaps. The 95% CI is [+5.11 pp, +12.15 pp]. It excludes zero with room to spare, so the effect is statistically meaningful.</p>
<p>The lower bound sits just above the +5 pp ground truth: a finite-sample upward bias typical of synthetic control on small donor panels, where each donor workspace (about 19 users per week) carries more noise than the 25-workspace treated average.</p>
<p>Placebo, LOO, and bootstrap together confirm a real positive effect. The point-estimate bias is the tradeoff for using single-workspace donors.</p>
<p>For a stakeholder report, cite the interval alongside the point estimate and note the bias direction so the team reads the number with the right calibration.</p>
<h2 id="heading-when-synthetic-control-fails">When Synthetic Control Fails</h2>
<p>Synthetic control is a precise tool with narrow failure modes. The four most common map directly to the three identification assumptions.</p>
<h3 id="heading-1-donor-pool-contamination-violates-no-interference">1. Donor Pool Contamination (Violates No Interference)</h3>
<p>If the upgrade shipped to wave 1 spills over to wave 2 (shared API rate-limit pools, shared prompt caches, users migrating between workspaces), the donors are contaminated, and the gap understates the true effect.</p>
<p>The defense is institutional: audit what changed for donor units around the treatment date, explicitly including model-level channels like shared routing, shared caching, and shared monitoring.</p>
<h3 id="heading-2-fundamentally-different-units-violates-pre-period-fit">2. Fundamentally Different Units (Violates Pre-period Fit)</h3>
<p>The convex-hull condition states that the treated unit must lie within the donors' support. If the treated unit is structurally different (for example, enterprise customers where every donor is an SMB), no weighting scheme yields a credible counterfactual, regardless of how tight the pre-period fit appears.</p>
<p>Check the weights: if the optimizer assigns 80 percent to a single donor, that donor is doing the entire job, and you should ask whether it's truly comparable.</p>
<h3 id="heading-3-post-treatment-shocks-to-donors-violate-stable-donor-composition">3. Post-Treatment Shocks to Donors (Violate Stable Donor Composition)</h3>
<p>The synthetic control projects donor behavior forward from pre-period weights. If a key donor experiences a major shock after treatment (a customer churn, an outage, a competitor release), its post-treatment trajectory is no longer a clean counterfactual. Inspect the time series of high-weight donors for unusual post-treatment patterns.</p>
<h3 id="heading-4-overfitting-risk-when-j-approaches-t-degrades-pre-period-fit-in-practice">4. Overfitting Risk When J Approaches T₀ (Degrades Pre-period Fit in Practice)</h3>
<p>The optimizer can fit the pre-period solely to noise when J ≥ T₀, creating the illusion of comparability. This tutorial runs at T₀/J = 20/25 = 0.8, in the danger zone. The LOO sensitivity check is the practical defense: if the gap holds up across donor drops, the fit reflects genuine comparability.</p>
<p>These failure modes stay invisible in your point estimate. They surface as a synthetic control that looks well-fit on paper and produces a gap that doesn't hold up when treatment rolls out to the next wave. Placebo test, LOO sensitivity, and bootstrap together are your defense.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>Synthetic control is the right tool when your feature ships globally and there's a pool of untreated units resembling the treated unit.</p>
<p>If treated and donor units operate at different scales, <strong>augmented synthetic control</strong> adds a bias-correction term from a linear outcome model. If you have many treated units with staggered adoption, <strong>generalized synthetic control</strong> (the <code>gsynth</code> R package) extends the framework.</p>
<p>For production Python work, <code>pysyncon</code> implements the full Abadie-Diamond-Hainmueller estimator with predictor-weighting via a V-matrix outer loop and adds in-time placebo tests (assigning the treatment to a pre-period date and checking for a spurious gap) that this tutorial doesn't cover. The from-scratch implementation here shows that the mechanics <code>pysyncon</code> is what you ship to a reviewer.</p>
<p>The companion notebook for this tutorial lives at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/04_synthetic_control</a>. Clone the repo, generate the synthetic dataset, and run <code>synthetic_control_demo.ipynb</code> (or <code>synthetic_control_demo.py</code>) to reproduce every code block, every number, and every figure from this tutorial.</p>
<p>When a model upgrade ships to every user at once, the naive before/after is usually the wrong number. Synthetic control builds "users like yours who didn't get the upgrade" from the data you already have, locks in the weights before the treatment week, and gives you a placebo distribution plus a bootstrap interval you can defend when a stakeholder asks how confident you are.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python ]]>
                </title>
                <description>
                    <![CDATA[ Causal inference for LLM-based features starts with one question editors ask before they ship anything: Did the change actually move the metric, or did the metric just move? Let's say that your team b ]]>
                </description>
                <link>https://www.freecodecamp.org/news/gen-ai-product-experimentation-with-regression-discontinuity-design/</link>
                <guid isPermaLink="false">69fe0255f239332df4da1c33</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ regression-discontinuity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ experimentation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Fri, 08 May 2026 15:33:41 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/a6b7e375-8638-4b98-824e-bb94c60e9e57.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Causal inference for LLM-based features starts with one question editors ask before they ship anything: Did the change actually move the metric, or did the metric just move?</p>
<p>Let's say that your team built a routing layer that splits incoming queries between two models: queries with a confidence score below 0.85 go to a premium model, and those above 0.85 go to a cheaper distilled model. The premium model costs 5x as much as the cheaper one.</p>
<p>Your boss wants the answer that ends the debate: Is the premium model worth it for the queries it sees?</p>
<p>You can't run a clean A/B test, because routing is deterministic: a query at confidence 0.84 always gets premium, a query at 0.86 always gets cheap, and you can't randomize the assignment.</p>
<p>You also can't trust a naïve comparison of premium-routed users against cheap-routed users. Premium handles the harder queries by design (that's the reason you built the gate), so the two groups differ in query difficulty before either model touches them.</p>
<p>The threshold itself is your free experiment. Right at 0.85, the assignment flips, but the queries on either side of that boundary are essentially identical. A query at confidence 0.849 isn't meaningfully different from a query at 0.851. Any differences in outcomes between the two narrow groups stem solely from the routing decision. That's what regression discontinuity design (RDD) reads.</p>
<p>In this tutorial, you'll use Python to estimate the causal effect of premium routing on task completion using sharp RDD with local linear regression. You'll sweep bandwidths to test estimate stability, run a manipulation diagnostic, check robustness with a quadratic specification, and bootstrap 95% confidence intervals around every point estimate.</p>
<p>The LLM telemetry is a 50,000-user synthetic dataset with the ground-truth premium-routing effect baked in at +6 percentage points, so you can verify that RDD recovers it.</p>
<p><strong>Companion code:</strong> every code block runs end-to-end <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/03_rdd_confidence_threshold">in the companion notebook</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-threshold-routing-is-a-natural-experiment">Why Threshold Routing is a Natural Experiment</a></p>
</li>
<li><p><a href="#heading-what-regression-discontinuity-actually-does">What Regression Discontinuity Actually Does</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-a-sharp-rdd-with-local-linear-regression">Step 1: A Sharp RDD with Local Linear Regression</a></p>
</li>
<li><p><a href="#heading-step-2-try-different-bandwidths">Step 2: Try Different Bandwidths</a></p>
</li>
<li><p><a href="#heading-step-3-checking-for-manipulation-at-the-threshold">Step 3: Checking for Manipulation at the Threshold</a></p>
</li>
<li><p><a href="#heading-step-4-quadratic-specification-as-a-robustness-check">Step 4: Quadratic Specification as a Robustness Check</a></p>
</li>
<li><p><a href="#heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</a></p>
</li>
<li><p><a href="#heading-when-regression-discontinuity-fails">When Regression Discontinuity Fails</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-threshold-routing-is-a-natural-experiment">Why Threshold Routing is a Natural Experiment</h2>
<p>The product reason this routing rule exists is to help your team spend the premium model budget where it earns its keep. Low-confidence queries are the harder ones, which is where a stronger model has the most upside. High-confidence queries already look easy enough for the cheap model to handle.</p>
<p>You'll see this routing direction across confidence-score gates for Q&amp;A assistants, query-complexity gates in multi-model gateways like OpenRouter, safety-score gates in content moderation, and latency-budget gates that re-route when the cheap model would exceed a p99 latency budget.</p>
<p>The mechanism is the same in every case: a continuous score, a threshold, and a deterministic routing rule.</p>
<p>What makes this setup useful for causal inference is that users don't pick which model they get. A query lands, the system computes confidence, and the routing layer decides. Right at the threshold, the user's experience flips from premium to cheap based on a difference too small to be meaningful.</p>
<p>Again, a query at 0.849 confidence isn't shipping a different problem to the model than a query at 0.851. Anything that differs in outcomes between those two groups is the routing decision speaking. The underlying query is the same.</p>
<p>That local randomness is the experiment RDD reads from. You don't need a randomized control group, you don't need a propensity score. And you don't need an instrument, you need a sharp threshold that nobody can game.</p>
<h2 id="heading-what-regression-discontinuity-actually-does">What Regression Discontinuity Actually Does</h2>
<p>The jump at the threshold is the causal effect, which is the number a product team can act on. RDD reads it by fitting two separate regression lines to the outcome: one for users just below the threshold and one for users just above. The vertical difference between those two fitted lines at the cutoff is the local average treatment effect at that point.</p>
<p>Graphically, picture task completion on the y-axis and query confidence on the x-axis. Completion generally trends with confidence (easier queries complete more often). At exactly 0.85, though, users below the cutoff get premium routing, and users above get cheap.</p>
<p>If premium routing helps, you'd see a sharp upward jump in task completion just below 0.85, then disappear just above. Approached from left to right with confidence rising, the visual reads as a downward step at 0.85, because you're moving from the premium-treated zone into the cheap-treated zone.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/f772c04b-5642-472c-8182-695183027294.png" alt="f772c04b-5642-472c-8182-695183027294" style="display:block;margin:0 auto" width="1517" height="857" loading="lazy">

<p><em>Figure 1. Conceptual schematic. Two outcome trajectories, one for premium-routed queries (confidence below 0.85) and one for cheap-routed queries (confidence above 0.85), meet at the threshold but don't match. The vertical gap between their endpoints at 0.85 is the local causal effect of premium routing.</em></p>
<p>That gap is identified under two named assumptions:</p>
<ol>
<li><p><strong>No manipulation of the running variable:</strong> Users (or your system) can't precisely nudge a query's confidence score across the cutoff. If anyone can game their score to land just below 0.85 and grab premium routing, the cutoff is no longer drawn at random, and RDD breaks.</p>
</li>
<li><p><strong>Continuity of potential outcomes at the cutoff:</strong> Every other factor that affects task completion (query type, user expertise, workspace tenure, time of day) varies smoothly across 0.85. Only the routing assignment changes discontinuously at exactly the threshold. If a second product rule fires at 0.85 (a different logging level, a separate UI treatment, a retry policy), RDD will attribute that rule's effect to the routing decision.</p>
</li>
</ol>
<p>These are the two assumptions you check before you trust the estimate. Step 3 below tests the first one. The second is a structural property of your system that you have to know cold.</p>
<p>Two practical choices shape every RDD: the <strong>bandwidth</strong> (how close to the cutoff to restrict the analysis) and the <strong>functional form</strong> (linear, quadratic, or local polynomial).</p>
<p>Narrow bandwidths cut potential bias by staying close to the local-randomization zone, but they shrink the sample. Linear specifications are stable, though they assume the underlying relationship can be approximated by a straight line on each side.</p>
<p>You'll try both linear and quadratic specifications at multiple bandwidths to see whether the answer holds.</p>
<p>The article uses sharp RDD throughout, since assignment is a deterministic function of confidence (below 0.85 always premium, above 0.85 always cheap). When the threshold is probabilistic and compliance is partial, the design is a fuzzy RDD, which requires an instrumental variables framework that you can implement using the <code>rdrobust</code> Python package.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You need Python 3.11 or newer, comfort with pandas and statsmodels, and rough familiarity with linear regression and interaction terms.</p>
<p>Install the packages used in this tutorial:</p>
<pre><code class="language-shell">pip install numpy pandas statsmodels matplotlib scipy
</code></pre>
<p><strong>Here's what's happening:</strong> four standard scientific Python libraries plus matplotlib for the diagnostic visualization. Nothing exotic.</p>
<p>Clone the companion repo and generate the synthetic dataset:</p>
<pre><code class="language-shell">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<p><strong>Here's what's happening:</strong> the data generator draws 50,000 users with a <code>query_confidence</code> score from a Beta(5,2) distribution, applies the routing rule (<code>routed_to_premium = query_confidence &lt; 0.85</code>), and bakes a +6-percentage-point premium routing effect into <code>task_completed</code>. Same seed, same dataset, every time.</p>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The dataset simulates a SaaS product that routes queries between a premium and a cheap model based on confidence score. The threshold is 0.85, and the ground-truth causal effect of premium routing is +6 percentage points on task completion. You know the truth, so you can check whether RDD recovers it.</p>
<p>Load the data and look at the routing breakdown:</p>
<pre><code class="language-python">import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(f"Loaded {len(df):,} rows, {df.shape[1]} columns")

print("\nRouting breakdown:")
counts = df.routed_to_premium.value_counts().to_dict()
print(f"  Premium-routed (confidence &lt; 0.85):  {counts.get(1, 0):,}")
print(f"  Cheap-routed   (confidence &gt;= 0.85): {counts.get(0, 0):,}")

print("\nQuery confidence distribution:")
print(df.query_confidence.describe().round(3))
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Loaded 50,000 rows, 16 columns

Routing breakdown:
  Premium-routed (confidence &lt; 0.85):  38,874
  Cheap-routed   (confidence &gt;= 0.85): 11,126

Query confidence distribution:
count    50000.000
mean         0.715
std          0.159
min          0.078
25%          0.611
50%          0.736
75%          0.838
max          0.998
</code></pre>
<p><strong>Here's what's happening:</strong> about 78% of queries land below the 0.85 cutoff and get premium routing. The Beta(5,2) distribution is skewed toward the upper end, with a median of 0.736, and most of its mass still sits below 0.85. The remaining 22% are queries that the model already feels confident about, and they go to the cheap model.</p>
<p>Before any regression, look at the naïve comparison every product team is tempted to run:</p>
<pre><code class="language-python">naive = (
    df[df.routed_to_premium == 1].task_completed.mean()
    - df[df.routed_to_premium == 0].task_completed.mean()
)
print(f"Naive premium-vs-cheap effect: {naive:+.4f}  (ground truth = +0.06)")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Naive premium-vs-cheap effect: +0.0632  (ground truth = +0.06)
</code></pre>
<p><strong>Here's what's happening:</strong> the naive estimate sits at +0.0632, which is suspiciously close to the truth. That's a coincidence of this specific synthetic dataset, where the only confounder of premium vs. cheap is <code>query_confidence</code> itself, and the outcome doesn't depend on confidence except through routing.</p>
<p>In production, you almost never get this lucky. User expertise, prompt phrasing, time of day, and a dozen unobserved query traits all correlate with confidence and with completion.</p>
<p>A naïve comparison in a real system can be off by 50% or more in either direction. RDD gives you identification that doesn't depend on the absence of hidden confounders.</p>
<h3 id="heading-step-1-a-sharp-rdd-with-local-linear-regression">Step 1: A Sharp RDD with Local Linear Regression</h3>
<p>The basic sharp RDD estimator is a local linear regression. Restrict to users whose confidence sits within a bandwidth of the cutoff, fit separate linear slopes on each side, and read off the jump at 0.85.</p>
<pre><code class="language-python">cutoff = 0.85
bw = 0.10

near = df[(df.query_confidence &gt; cutoff - bw)
          &amp; (df.query_confidence &lt; cutoff + bw)].copy()
near["below_cutoff"] = (near.query_confidence &lt; cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff

rdd_model = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc",
    data=near,
).fit(cov_type="HC3")

effect = rdd_model.params["below_cutoff"]
print(f"RDD effect at cutoff (LATE): {effect:+.4f}")
print(f"Std error (HC3):             {rdd_model.bse['below_cutoff']:.4f}")
print(f"p-value:                     {rdd_model.pvalues['below_cutoff']:.4f}")
print(f"N users in [0.75, 0.95):     {len(near):,}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">RDD effect at cutoff (LATE): +0.0548
Std error (HC3):             0.0131
p-value:                     0.0000
N users in [0.75, 0.95):     21,689
</code></pre>
<p><strong>Here's what's happening:</strong> the model fits separate intercepts and slopes on each side of 0.85 (<code>below_cutoff</code> is the side indicator, <code>rc</code> is confidence centered at the cutoff). The coefficient on <code>below_cutoff</code> reads off the vertical jump at the threshold, which is the local average treatment effect (LATE) for queries with confidence near 0.85. You get +0.0548, within sampling noise of the +0.06 ground truth.</p>
<p>Three notes on the specification. First, <code>task_completed</code> is binary, so this is a linear probability model. For RDD with a binary outcome at the cutoff, the linear probability model is standard practice because local linearity is the identifying assumption either way. Logit at the cutoff is an alternative if you need bounded predictions globally.</p>
<p>Second, the standard errors are used <code>cov_type="HC3"</code> to relax the homoskedasticity assumption, which is almost always wrong for binary outcomes.</p>
<p>Third, the dataset has one query per user with no within-user clustering, so cluster-robust standard errors aren't needed here. In a setting with multiple queries per user, you'd cluster on <code>user_id</code>.</p>
<p>The next diagnostic to look at is the confidence distribution near the cutoff. Figure 2 shows what 50,000 queries look like in the bandwidth window:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/9ecb8a4c-6eac-4732-95ae-2a5981917f54.png" alt="9ecb8a4c-6eac-4732-95ae-2a5981917f54" style="display:block;margin:0 auto" width="1483" height="1005" loading="lazy">

<p><em>Figure 2. Real distribution from the 50,000-user synthetic dataset. Unlike the schematic in Figure 1, this shows the actual query density by confidence score, with the routing threshold annotated. The bottom panel counts how many queries land in each 2-percentage-point bin near the cutoff (2,461 / 2,481 / 2,335 / 2,229 / 2,048 across the 0.80–0.90 range). The roughly uniform spread is the visual signal that no manipulation is concentrating users on one side of the threshold.</em></p>
<h3 id="heading-step-2-try-different-bandwidths">Step 2: Try Different Bandwidths</h3>
<p>Bandwidth choice matters. Too narrow and you have too few observations, so the confidence interval blows up. Too wide and you're extrapolating into regions where the linear specification is no longer a reasonable local approximation.</p>
<p>The honest move is to try multiple bandwidths and report whether the estimate holds.</p>
<pre><code class="language-python">results = []
for bw in [0.05, 0.10, 0.15, 0.20]:
    sub = df[(df.query_confidence &gt; cutoff - bw)
             &amp; (df.query_confidence &lt; cutoff + bw)].copy()
    sub["below_cutoff"] = (sub.query_confidence &lt; cutoff).astype(int)
    sub["rc"] = sub.query_confidence - cutoff

    m = smf.ols(
        "task_completed ~ below_cutoff + rc + below_cutoff:rc",
        data=sub,
    ).fit(cov_type="HC3")

    results.append({
        "bandwidth": bw,
        "n": len(sub),
        "effect": m.params["below_cutoff"],
        "se": m.bse["below_cutoff"],
        "p": m.pvalues["below_cutoff"],
    })

print(pd.DataFrame(results).round(4).to_string(index=False))
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python"> bandwidth      n  effect     se       p
      0.05  11554  0.0635  0.0183  0.0005
      0.10  21689  0.0548  0.0131  0.0000
      0.15  29137  0.0618  0.0112  0.0000
      0.20  34074  0.0614  0.0107  0.0000
</code></pre>
<p><strong>Here's what's happening:</strong> four bandwidths from ±0.05 to ±0.20 around the cutoff, refitting the same RDD specification at each. The estimates range from +0.0548 to +0.0635, all in the same neighborhood as the +0.06 ground truth, with standard errors that shrink as the bandwidth widens and grow as it narrows. Every p-value is well below 0.05. Whether the estimates are "stable" depends on the confidence intervals around them, which Step 5 produces with the bootstrap.</p>
<h3 id="heading-step-3-checking-for-manipulation-at-the-threshold">Step 3: Checking for Manipulation at the Threshold</h3>
<p>RDD is valid only if users can't precisely manipulate the running variable around the cutoff. If your users (or your system) can nudge confidence scores just below 0.85 to force premium routing, you get a density spike at the cutoff, and the RDD estimate is contaminated.</p>
<p>The standard diagnostic is the McCrary density test, which checks whether the distribution of the running variable has a sharp jump at the cutoff. The simple version: bin the data tightly around 0.85 and check whether the counts on the two sides are similar.</p>
<pre><code class="language-python">print("User counts in 2-percentage-point bins around 0.85:")
for lo in [0.80, 0.82, 0.84, 0.86, 0.88]:
    hi = lo + 0.02
    cnt = ((df.query_confidence &gt;= lo) &amp; (df.query_confidence &lt; hi)).sum()
    print(f"  [{lo:.2f}, {hi:.2f}):  n = {cnt:,}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">User counts in 2-percentage-point bins around 0.85:
  [0.80, 0.82):  n = 2,461
  [0.82, 0.84):  n = 2,481
  [0.84, 0.86):  n = 2,335
  [0.86, 0.88):  n = 2,229
  [0.88, 0.90):  n = 2,048
</code></pre>
<p><strong>Here's what's happening:</strong> counts trend gently downward across the bandwidth because Beta(5,2) places more mass at higher confidence levels, and the density tapers as it approaches 1.0. There's no spike or dip at the 0.84–0.86 bin that straddles the cutoff. The 433-user spread across all five bins is consistent with smooth tapering of the underlying density.</p>
<p>That's the pattern you want when manipulation is absent. For a more rigorous test, the <a href="https://github.com/rdpackages/rddensity"><code>rddensity</code></a> Python package implements the formal McCrary procedure with bias-corrected standard errors.</p>
<p>What manipulation looks like when it's real: a spike in users at confidences just barely below 0.85 (they're being nudged into premium routing) and a dip just above. If you see that pattern, the RDD estimate overstates the causal effect because the users right below 0.85 differ in motivation from those right above. They cared enough to manipulate the score, and they'd have shown different outcomes even under random routing.</p>
<h3 id="heading-step-4-quadratic-specification-as-a-robustness-check">Step 4: Quadratic Specification as a Robustness Check</h3>
<p>If the true relationship between confidence and task completion isn't exactly linear, a local linear RDD can mistake the curvature for a jump. The standard robustness check allows quadratic terms on both sides of the cutoff and tests whether the estimate holds.</p>
<pre><code class="language-python">near = df[(df.query_confidence &gt; cutoff - 0.10)
         &amp; (df.query_confidence &lt; cutoff + 0.10)].copy()
near["below_cutoff"] = (near.query_confidence &lt; cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff
near["rc2"] = near.rc ** 2

rdd_quad = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc"
    " + rc2 + below_cutoff:rc2",
    data=near,
).fit(cov_type="HC3")

print(f"Linear RDD    (bw=0.10):  effect = +0.0548, p &lt; 0.0001")
print(f"Quadratic RDD (bw=0.10):  effect = "
      f"{rdd_quad.params['below_cutoff']:+.4f}, "
      f"p = {rdd_quad.pvalues['below_cutoff']:.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Linear RDD    (bw=0.10):  effect = +0.0548, p &lt; 0.0001
Quadratic RDD (bw=0.10):  effect = +0.0569, p = 0.0036
</code></pre>
<p><strong>Here's what's happening:</strong> the quadratic specification adds squared terms and interactions with the cutoff indicator, allowing the relationship to curve differently on each side. The <code>below_cutoff</code> coefficient still captures the jump at the threshold, now under a more flexible specification.</p>
<p>The two estimates differ by 0.0022, both close to the +0.06 ground truth, and both are significant at p &lt; 0.01. The answer doesn't change when you let the model bend.</p>
<p>When linear and quadratic specifications disagree noticeably, you have a real signal. With small samples (a few thousand at narrow bandwidths), the quadratic version can lose power because four extra parameters need data to be identified.</p>
<p>The standard move is to widen the bandwidth and re-run both specifications. If they still disagree at wider bandwidths, the linear approximation is wrong, and you should report both numbers.</p>
<h3 id="heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</h3>
<p>Every point estimate in this article is a single number from a finite sample. The bootstrap quantifies how much that number would move under resampling, which is what a confidence interval describes.</p>
<pre><code class="language-python">def bootstrap_ci(df, cutoff, bw, quadratic=False, n_reps=500, seed=7):
    rng = np.random.default_rng(seed)
    near = df[(df.query_confidence &gt; cutoff - bw)
              &amp; (df.query_confidence &lt; cutoff + bw)].copy()
    near["below_cutoff"] = (near.query_confidence &lt; cutoff).astype(int)
    near["rc"] = near.query_confidence - cutoff
    if quadratic:
        near["rc2"] = near.rc ** 2
        formula = ("task_completed ~ below_cutoff + rc + below_cutoff:rc"
                   " + rc2 + below_cutoff:rc2")
    else:
        formula = "task_completed ~ below_cutoff + rc + below_cutoff:rc"

    n = len(near)
    estimates = np.empty(n_reps)
    for i in range(n_reps):
        sample = near.iloc[rng.integers(0, n, size=n)]
        m = smf.ols(formula, data=sample).fit()
        estimates[i] = m.params["below_cutoff"]
    return (np.percentile(estimates, 2.5), np.percentile(estimates, 97.5))


print("Linear RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10)
print(f"  effect = +0.0548   95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nBandwidth sensitivity:")
for bw, eff in [(0.05, 0.0635), (0.10, 0.0548), (0.15, 0.0618), (0.20, 0.0614)]:
    lo, hi = bootstrap_ci(df, cutoff, bw=bw)
    print(f"  bw = {bw:.2f}   effect = {eff:+.4f}   "
          f"95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nQuadratic RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10, quadratic=True)
print(f"  effect = +0.0569   95% CI: [{lo:+.4f}, {hi:+.4f}]")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Linear RDD (bw=0.10):
  effect = +0.0548   95% CI: [+0.0278, +0.0817]

Bandwidth sensitivity:
  bw = 0.05   effect = +0.0635   95% CI: [+0.0244, +0.0986]
  bw = 0.10   effect = +0.0548   95% CI: [+0.0278, +0.0817]
  bw = 0.15   effect = +0.0618   95% CI: [+0.0381, +0.0823]
  bw = 0.20   effect = +0.0614   95% CI: [+0.0420, +0.0808]

Quadratic RDD (bw=0.10):
  effect = +0.0569   95% CI: [+0.0205, +0.0959]
</code></pre>
<p><strong>Here's what's happening:</strong> the bootstrap resamples the bandwidth-restricted data with replacement 500 times, refits the RDD on each replicate, and collects the <code>below_cutoff</code> coefficient. The 2.5th and 97.5th percentiles of those 500 estimates form the 95% interval. Every interval covers the +0.06 ground truth, every interval excludes zero, and the bandwidth sweep produces overlapping intervals.</p>
<p>That's quantitative stability, verified by resampling across the full bandwidth range. Intervals widen as the bandwidth shrinks and narrow as it grows. The quadratic interval is wider than the linear one because the four extra parameters absorb degrees of freedom.</p>
<p>One thing the intervals do NOT do on this dataset: exclude the naive +0.0632 estimate. That's because the data generator doesn't bake in confounding by query confidence. The only difference between the premium and cheap groups in expectations is the +6pp routing effect itself, so the naïve comparison is close to the truth.</p>
<p>Real systems are messier. In a production setting where unobserved query traits affect both the routing assignment and task completion, the naïve estimate would diverge from the RDD estimate, and the bootstrap intervals would tell you which one to trust.</p>
<h2 id="heading-when-regression-discontinuity-fails">When Regression Discontinuity Fails</h2>
<p>RDD looks clean, but several specific failure modes can destroy the identification. Each one maps to a violation of one of the two named assumptions.</p>
<p><strong>Users manipulate the running variable</strong> (violates assumption 1). The whole setup depends on users (or any upstream service) being unable to precisely control which side of the cutoff they land on. Any system that reveals the cutoff and gives users a way to influence their score (a retry mechanism, a prompt engineering workaround, a confidence-inflating trick) breaks RDD.</p>
<p>Run the density check in Step 3 every time. If you find manipulation, switch to a fuzzy RDD that treats the threshold as probabilistic, or abandon the approach.</p>
<p><strong>Other policies fire at the same cutoff</strong> (violates assumption 2). If your product has additional rules that activate at 0.85 (a separate UI treatment, a different logging level, a different retry policy), RDD can't separate the routing effect from those other policy effects. Audit the full rule book for anything that shares the threshold.</p>
<p><strong>The threshold has noise or overrides</strong> (violates assumption 1, in the structural sense). Maybe routing isn't strictly deterministic at 0.85&nbsp;– it may have random jitter, or a second rule may override the main rule in some cases.</p>
<p>If assignment to the premium model isn't a deterministic function of <code>query_confidence</code>, you have a fuzzy RDD, which requires an instrumental variables framework. The <code>rdrobust</code> package handles both sharp and fuzzy designs.</p>
<p><strong>Curvature masquerading as a jump</strong> (breaks the linear approximation that supports identification at the cutoff). Sharp RDD assumes linearity is a reasonable local approximation. When the underlying outcome-confidence relationship is strongly curved, the linear specification can mistake the bend for a jump.</p>
<p>Step 4's quadratic robustness check is the standard diagnostic. If linear and quadratic disagree, widen the bandwidth and re-run both.</p>
<p><strong>Extrapolation bias</strong> (a continuity issue, reframed). RDD estimates are strictly local to the cutoff. The +0.06 effect at 0.85 tells you nothing about what premium routing would do for queries with confidence 0.30 or 0.99.</p>
<p>If you want a global average effect, you need a different technique: propensity methods, regression with confounder adjustment, or an actual experiment.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>RDD is the right tool when your AI feature is gated by a continuous score and a sharp threshold.</p>
<p>If your feature is gated by a user-controlled toggle, propensity score methods are a better fit. If it's gated by a staged rollout across workspaces, difference-in-differences handles it. If it's gated by rules you can't observe directly but that have a random component, instrumental variables is the right choice.</p>
<p>For production RDD analyses, use the <a href="https://github.com/rdpackages/rdrobust"><code>rdrobust</code></a> Python package. It gives you optimal bandwidth selection (Calonico, Cattaneo, and Titiunik 2014), bias-corrected standard errors, and a built-in plotting utility. The companion <a href="https://github.com/rdpackages/rddensity"><code>rddensity</code></a> package implements the McCrary density test you saw informally in Step 3.</p>
<p>The from-scratch version in this tutorial shows the mechanics. The rd-packages stack is what you ship to a reviewer.</p>
<p>One thing the LATE doesn't do: tell you the effect for users far from the cutoff. If a +0.06 LATE at 0.85 is enough to keep premium routing in the pipeline, you're done. If you need to know what premium would do for the easy queries you're currently sending to cheap (or the hardest queries near the floor), the next step is a small randomized rollout in those zones, scored against the RDD estimate as a calibration check. Don't generalize the LATE without evidence.</p>
<p>The companion notebook for this tutorial <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/03_rdd_confidence_threshold">lives here on GitHub</a>. Clone the repo, generate the synthetic dataset, and run <code>rdd_demo.ipynb</code> to reproduce every code block from this tutorial.</p>
<p>Threshold routing is one of the most common patterns in production LLM systems, and every confidence-gated routing decision in your stack is a potential RDD. Run the analysis.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python ]]>
                </title>
                <description>
                    <![CDATA[ Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a random sample. Your pr ]]>
                </description>
                <link>https://www.freecodecamp.org/news/product-experimentation-with-propensity-scores-causal-inference-for-llm-based-features-in-python/</link>
                <guid isPermaLink="false">69f3df46909e64ad07425413</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ propensity-score-matching ]]>
                    </category>
                
                    <category>
                        <![CDATA[ experimentation ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 23:01:26 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6a8936be-7f43-4977-9baf-6021dc892b2d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a random sample.</p>
<p>Your product shipped a new agent mode last quarter. Users have to tap the "Try agent mode" toggle to enable it. The dashboard numbers look stunning: agent-mode users complete 21 percentage points more tasks than non-users. The CPO calls it the best feature launch of the year.</p>
<p>But you know something's off. Heavy-engagement users opt into new features constantly, while light users ignore toggles entirely. That 21-point gap measures the agent's effect combined with the pre-existing gap between power users and the rest of your base.</p>
<p>This is the Opt-In Trap. It shows up in every generative AI product that ships features behind a user-controlled toggle: "Try our AI assistant," "Enable smart replies," "Turn on code suggestions." Users who click to opt in differ systematically from those who scroll past. Any naïve comparison between the two groups collapses the feature's causal effect into whatever made those users opt in in the first place.</p>
<p>Running an AI feature behind a toggle is a product experiment. The hypothesis: the feature improves outcomes for users who adopt it.</p>
<p>Unlike an A/B test, where the coin flip creates two otherwise-identical populations, the toggle creates two populations that differ before they even make a choice. That pre-existing difference is the measurement problem, and a t-test on dashboard numbers can't fix it.</p>
<p>Propensity score methods are statistical tools that data scientists use to separate adoption bias from the feature's actual effect. They reweight (or rematch) your comparison so that opted-in and non-opted-in groups look comparable on observable characteristics, approximating what a randomized experiment would have given you.</p>
<p>This tutorial walks through the full pipeline (propensity estimation, inverse-probability weighting, nearest-neighbor matching, balance diagnostics, and bootstrap confidence intervals) on a 50,000-user synthetic SaaS dataset where the ground-truth causal effect is known. You'll estimate it, quantify uncertainty, and see where the approach silently breaks.</p>
<p><strong>Companion code:</strong> every code block runs end-to-end in the companion notebook at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in</a>. The notebook (<code>psm_demo.ipynb</code>) has all outputs pre-executed, so you can read along on GitHub before running anything locally.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-opt-in-features-break-naive-comparisons">Why Opt-in Features Break Naïve Comparisons</a></p>
</li>
<li><p><a href="#heading-what-propensity-scores-actually-do">What Propensity Scores Actually Do</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-estimate-the-propensity-score">Step 1: Estimate the Propensity Score</a></p>
</li>
<li><p><a href="#heading-step-2-inverse-probability-weighting">Step 2: Inverse-Probability Weighting</a></p>
</li>
<li><p><a href="#heading-step-3-nearest-neighbor-matching">Step 3: Nearest-Neighbor Matching</a></p>
</li>
<li><p><a href="#heading-step-4-check-covariate-balance">Step 4: Check Covariate Balance</a></p>
</li>
<li><p><a href="#heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</a></p>
</li>
<li><p><a href="#heading-when-propensity-score-methods-fail">When Propensity Score Methods Fail</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-opt-in-features-break-naive-comparisons">Why Opt-in Features Break Naïve Comparisons</h2>
<p>The math of an A/B test is elegant because of one assumption: treatment is assigned independent of everything else. Flip a coin: half your users get agent mode, and the coin flip breaks every possible confound by construction. The opt-in world has no coin.</p>
<p>Three mechanisms make opt-in comparisons misleading.</p>
<h4 id="heading-1-selection-on-engagement">1. Selection on engagement</h4>
<p>Power users click everything. If your heavy-engagement cohort opts into agent mode at 65 percent and your light-engagement cohort opts in at 12 percent, you've stacked the opt-in group with users who were going to complete more tasks anyway.</p>
<p>That compositional imbalance accounts for most of the observed lift on its own, before the agent does any work.</p>
<h4 id="heading-2-selection-on-intent">2. Selection on intent</h4>
<p>Users who opt into a new feature often have a specific use case in mind. A developer who clicks "Try code suggestions" already has code to write. That user would have shown higher task completion even with the control UI.</p>
<h4 id="heading-3-selection-on-risk-tolerance">3. Selection on risk tolerance</h4>
<p>Early adopters tolerate rough edges. A user who clicks "Try beta" and sees slow latency sticks around, but a risk-averse user bounces.</p>
<p>Your opt-in group is enriched for people willing to put up with bad experiences, which affects every downstream metric you might measure.</p>
<p>All three produce the same symptom: a raw comparison of opted-in users against everyone else that can overstate the feature's causal effect by 2x or more, depending on how concentrated opt-in is among your heaviest users.</p>
<p>On the synthetic dataset in this tutorial, the naïve comparison inflates a true +8pp effect to +21pp, a 2.6x overshoot. Propensity score methods exist to correct this.</p>
<h2 id="heading-what-propensity-scores-actually-do">What Propensity Scores Actually Do</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/df8f4e49-98f3-4cd2-b4a8-f9b49d18f60a.png" alt="Schematic propensity score distributions for two hypothetical groups" style="display:block;margin:0 auto" width="1469" height="822" loading="lazy">

<p><em>Figure 1: Schematic propensity score distributions for two hypothetical groups. The opted-in group (red) skews toward higher propensities, while the non-opted-in group (blue) skews lower.</em></p>
<p>In the above figure, the bracketed strip below the x-axis splits the score range into three zones: a control-heavy region at low propensities where few treated users exist, a region of common support in the middle where both groups are well represented, and a treatment-heavy region at high propensities where few controls exist. Propensity score methods operate within the common-support region by reweighting or rematching so that the two groups appear balanced on observables. The extremes are either trimmed out or handled with caution.</p>
<p>The propensity score is the probability that a user opts in given their observable characteristics. Estimate this probability well, and you can use it to reweight your sample so that opted-in and non-opted-in users look similar on observables, just as they would have if opt-in had been randomized.</p>
<p>Two practical strategies use the propensity score:</p>
<ul>
<li><p><strong>Inverse-probability weighting (IPW)</strong> assigns each user a weight equal to the inverse of their probability of receiving the treatment they actually received. Opted-in users get weighted by 1/P(opt-in). Non-opted-in users get weighted by 1/P(no opt-in). After weighting, the two groups are balanced on observables, and the weighted difference in outcomes approximates the average treatment effect.</p>
</li>
<li><p><strong>Matching</strong> pairs each opted-in user with one or more non-opted-in users who have similar propensity scores. The average outcome difference between matched pairs estimates the average treatment effect on the treated (ATT): what opt-in users actually gained by opting in.</p>
</li>
</ul>
<p>Both methods rest on three identification assumptions working together.</p>
<ol>
<li><p>First, <strong>unconfoundedness</strong>: every observable variable that drives opt-in and affects the outcome is in your propensity model.</p>
</li>
<li><p>Second, <strong>overlap</strong> (also called positivity): every user has some nonzero probability of opting in and some nonzero probability of staying out.</p>
</li>
<li><p>Third, <strong>no interference</strong>: one user's opt-in decision does not affect another user's outcome (the stable-unit-treatment-value assumption, or SUTVA.</p>
</li>
</ol>
<p>Violate any one of these and the estimate is biased even when the other two hold. The failure modes at the end of this tutorial walk through each one.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You'll need Python 3.11 or newer, comfort with pandas and scikit-learn, and rough familiarity with logistic regression.</p>
<p>Install the packages for this tutorial:</p>
<pre><code class="language-shell">pip install numpy pandas scikit-learn matplotlib
</code></pre>
<p><strong>Here's what's happening:</strong> four packages cover the full pipeline. Pandas loads the data, NumPy handles weights and array arithmetic, scikit-learn fits the propensity model and runs nearest-neighbor matching, and matplotlib renders the overlap diagnostic.</p>
<p>Clone the companion repo to get the synthetic dataset:</p>
<pre><code class="language-shell">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<p><strong>Here's what's happening:</strong> the clone pulls the companion repo, and <code>generate_data.py</code> produces the shared synthetic dataset used across the series. Seed 42 keeps the dataset reproducible, and 50,000 users give clean signal for every estimator in this tutorial. The output CSV lands at <code>data/synthetic_llm_logs.csv</code>.</p>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The synthetic dataset simulates a SaaS product where users can opt into an agent mode that uses a more expensive model. With fifty thousand users, opt-in rates differ sharply by engagement tier: heavy users opt in at 65 percent, medium users at 35 percent, and light users at 12 percent.</p>
<p>The ground-truth causal effect baked into the data generator is +8 percentage points on task completion for users who opted in. The naive comparison inflates this to around +21 percentage points because selection bias stacks the opted-in group with your most engaged users.</p>
<p>Knowing the ground truth is what lets you verify that your propensity score method recovers it.</p>
<p>Load the data and see the selection problem:</p>
<pre><code class="language-python">import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")

print(df.groupby("engagement_tier").opt_in_agent_mode.mean().round(3))

naive_effect = (
    df[df.opt_in_agent_mode == 1].task_completed.mean()
    - df[df.opt_in_agent_mode == 0].task_completed.mean()
)
print(f"\nNaive opt-in effect: {naive_effect:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">engagement_tier
heavy     0.647
light     0.120
medium    0.353
Name: opt_in_agent_mode, dtype: float64

Naive opt-in effect: +0.2106
</code></pre>
<p><strong>Here's what's happening:</strong> you load 50,000 rows, group by engagement tier, and print the opt-in rate inside each group. Heavy users opt in far more than light users, which is the selection-on-engagement pattern baked into the data. The naïve effect lands at +0.2106 (21 percentage points), nearly three times the ground truth of +0.08. That gap is exactly what propensity score methods have to remove.</p>
<h2 id="heading-step-1-estimate-the-propensity-score">Step 1: Estimate the Propensity Score</h2>
<p>The propensity score is the output of a model that predicts opt-in from observable characteristics. Logistic regression is the right starting point because it's interpretable and fast, but watch the balance diagnostics in Step 4: if any weighted SMD stays above 0.1, the logistic model is missing an interaction, and gradient boosting is the next move.</p>
<p>For this dataset, the relevant observables are engagement tier and query confidence. In a real product, you'd include every variable you think drives opt-in: device type, tenure, plan tier, and historical usage patterns.</p>
<pre><code class="language-python">from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X = pd.get_dummies(
    df[["engagement_tier", "query_confidence"]],
    drop_first=True
).astype(float)
y_treat = df.opt_in_agent_mode

ps_model = LogisticRegression(max_iter=1000).fit(X, y_treat)
df["propensity"] = ps_model.predict_proba(X)[:, 1]

# Basic sanity checks
print(df.groupby("engagement_tier").propensity.mean().round(3))
print(
    f"\nPropensity range (treated):  "
    f"{df[df.opt_in_agent_mode == 1].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 1].propensity.max():.3f}"
)
print(
    f"Propensity range (control):  "
    f"{df[df.opt_in_agent_mode == 0].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 0].propensity.max():.3f}"
)
print(f"Propensity model AUC: {roc_auc_score(y_treat, df.propensity):.3f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">engagement_tier
heavy     0.646
light     0.120
medium    0.353
Name: propensity, dtype: float64

Propensity range (treated):  0.114 - 0.675
Propensity range (control):  0.114 - 0.673
Propensity model AUC: 0.744
</code></pre>
<p><strong>Here's what's happening:</strong> you encode the engagement tier as dummy variables, keep query confidence continuous, and fit a logistic regression model. The predicted probability from the model is each user's propensity score.</p>
<p>Scikit-learn <code>LogisticRegression</code> applies L2 regularization by default (<code>C=1.0</code>), which shrinks propensities slightly toward 0.5. For production use, you can set <code>penalty=None</code> if you want an unregularized fit.</p>
<p>Mean propensity inside each engagement tier recovers the true opt-in rate for that tier almost exactly, so the model is calibrated. The AUC of 0.744 confirms the model discriminates between opt-ins and non-opt-ins well above chance (0.5).</p>
<p>And the propensity ranges overlap between treated and control groups (both span roughly 0.11 to 0.67), which is the visual overlap condition.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/0ad957a6-1d24-4332-b033-aae6e91c4162.png" alt="wo views of the same positivity check on the real 50,000-user synthetic dataset." style="display:block;margin:0 auto" width="1283" height="942" loading="lazy">

<p><em>Figure 2: Two views of the same positivity check on the real 50,000-user synthetic dataset.</em></p>
<p>In the figure above, the top panel plots smooth kernel density curves of the fitted propensity scores for each group. The three peaks align with the three engagement tiers (light at p ≈ 0.12, medium at p ≈ 0.35, heavy at p ≈ 0.65), as expected, because the opt-in rate is tier-driven. The bottom panel translates that same distribution into raw counts per tier: every tier contains thousands of both opted-in and non-opted-in users, which is exactly what positivity requires.</p>
<p>Where Figure 1 schematically illustrated the idea, this figure shows that it holds for the data, so the weighting and matching that follow will have real counterfactuals to work with.</p>
<h2 id="heading-step-2-inverse-probability-weighting">Step 2: Inverse-Probability Weighting</h2>
<p>IPW assigns each user a weight inversely proportional to their propensity. An opted-in user with a 0.12 propensity is rare (a light user who still opted in despite low engagement) and carries information about 1 / 0.12 ≈ 8 similar users in the population. A control user with a 0.12 propensity is the expected case for light users who stayed out, so they're common and get a weight of 1 / (1 - 0.12) ≈ 1.14.</p>
<pre><code class="language-python">import numpy as np

# ATE weights: 1/P(treat) for treated, 1/P(no treat) for control
df["ipw"] = np.where(
    df.opt_in_agent_mode == 1,
    1 / df.propensity,
    1 / (1 - df.propensity)
)

t = df[df.opt_in_agent_mode == 1]
c = df[df.opt_in_agent_mode == 0]
ate_ipw = (
    (t.task_completed * t.ipw).sum() / t.ipw.sum()
    - (c.task_completed * c.ipw).sum() / c.ipw.sum()
)
print(f"IPW average treatment effect (ATE): {ate_ipw:+.4f}")

# ATT: what opt-in users actually gained
df["ipw_att"] = np.where(
    df.opt_in_agent_mode == 1,
    1,
    df.propensity / (1 - df.propensity)
)
t = df[df.opt_in_agent_mode == 1]   # re-slice now that ipw_att is in df
c = df[df.opt_in_agent_mode == 0]
treated_mean = t.task_completed.mean()
control_w_mean = (c.task_completed * c.ipw_att).sum() / c.ipw_att.sum()
att_ipw = treated_mean - control_w_mean
print(f"IPW average treatment effect on treated (ATT): {att_ipw:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">IPW average treatment effect (ATE): +0.0851
IPW average treatment effect on treated (ATT): +0.0770
</code></pre>
<p><strong>Here's what's happening:</strong> first, you compute ATE weights for every user and take the weighted difference in task completion between opted-in and non-opted-in groups. Then you compute ATT weights, which reweight only the control group to match the treated group's covariate distribution, and compute the average treatment effect on the treated.</p>
<p>ATE answers the population question: what's the effect on a random user who might or might not have opted in anyway? ATT answers the user question: What did opt-in users actually gain? On this dataset, ATE lands at +0.0851 and ATT at +0.0770, both close to the ground-truth +0.08 and a massive improvement over the naive +0.2106.</p>
<p>The distinction matters in practice. Deciding whether to roll the feature out to users who haven't opted in calls for ATE. Reporting on the value opt-in users captured calls for ATT.</p>
<h2 id="heading-step-3-nearest-neighbor-matching">Step 3: Nearest-Neighbor Matching</h2>
<p>Matching takes a different approach: pair each opted-in user with the non-opted-in user whose propensity score is closest, then take the average outcome difference across matched pairs. The result estimates ATT.</p>
<pre><code class="language-python">from sklearn.neighbors import NearestNeighbors

treated_ps = df[df.opt_in_agent_mode == 1][["propensity"]].values
control_ps = df[df.opt_in_agent_mode == 0][["propensity"]].values

nn = NearestNeighbors(n_neighbors=1).fit(control_ps)
_, idx = nn.kneighbors(treated_ps)

treated_outcomes = df[df.opt_in_agent_mode == 1].task_completed.values
matched_control_outcomes = (
    df[df.opt_in_agent_mode == 0].task_completed.values[idx.flatten()]
)

att_match = (treated_outcomes - matched_control_outcomes).mean()
print(f"1-NN matching ATT: {att_match:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">1-NN matching ATT: +0.0752
</code></pre>
<p><strong>Here's what's happening:</strong> you extract propensity scores for each group, fit a nearest-neighbor index on the control group, and find the single closest control user for every treated user.</p>
<p>The <code>NearestNeighbors</code> index allows the same control user to be selected as the match for multiple treated users, so this is a matching-with-replacement case.</p>
<p>You pull the outcomes for each treated user and their matched control, take the difference per pair, and average across pairs. The result estimates what opt-in users gained compared to very similar users who did not opt in.</p>
<p>The +0.0752 result lands close to the ground truth of +0.08 but slightly below IPW ATT, typical of 1-NN matching because a single nearest neighbor is a high-variance estimator.</p>
<p>Two variants are worth knowing. Matching with replacement (what you just ran) allows a single control user to serve as a match for multiple treated users, reducing bias when good matches are scarce but inflating variance.</p>
<p>Matching without replacement assigns each control user to at most one treated user, which keeps variance lower but forces poor-quality pairings when the treated group dwarfs the available controls.</p>
<p>For most production analyses, k-nearest-neighbor matching with k = 3-5 and replacement is a sensible default.</p>
<h2 id="heading-step-4-check-covariate-balance">Step 4: Check Covariate Balance</h2>
<p>Propensity score methods work only if they actually balance the covariates between groups. You need to verify that they did, because if the balance fails, your estimate is wrong.</p>
<p>The standard diagnostic is the standardized mean difference (SMD) for each covariate. SMD compares the treated group mean to the control group mean, divided by the pooled standard deviation.</p>
<p>Before weighting, SMDs tell you how imbalanced the raw groups are. After weighting, they should be small (|SMD| &lt; 0.1 is the conventional cutoff).</p>
<pre><code class="language-python">def smd(treated_vals, control_vals, treated_w=None, control_w=None):
    """Standardized mean difference, optionally with weights."""
    if treated_w is None:
        treated_w = np.ones(len(treated_vals))
    if control_w is None:
        control_w = np.ones(len(control_vals))
    t_mean = np.average(treated_vals, weights=treated_w)
    c_mean = np.average(control_vals, weights=control_w)
    pooled_std = np.sqrt((treated_vals.var() + control_vals.var()) / 2)
    return (t_mean - c_mean) / pooled_std

engagement_heavy = (df.engagement_tier == "heavy").astype(float).values
qc = df.query_confidence.values
tr = (df.opt_in_agent_mode == 1).values

covariates = {
    "engagement_tier_heavy": engagement_heavy,
    "query_confidence": qc,
}

print(f"{'Covariate':&lt;30} {'Raw SMD':&gt;10} {'Weighted SMD':&gt;15}")
for name, vals in covariates.items():
    smd_raw = smd(vals[tr], vals[~tr])
    smd_weighted = smd(
        vals[tr], vals[~tr],
        treated_w=df[tr].ipw.values,
        control_w=df[~tr].ipw.values,
    )
    print(f"{name:&lt;30} {smd_raw:&gt;+10.3f} {smd_weighted:&gt;+15.3f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">Covariate                         Raw SMD    Weighted SMD
engagement_tier_heavy              +0.742          +0.002
query_confidence                   -0.032          -0.003
</code></pre>
<p><strong>Here's what's happening:</strong> the helper computes the standardized mean difference for any covariate, with optional IPW weights.</p>
<p>You then print raw and weighted SMDs for each covariate. The raw SMD on <code>engagement_tier_heavy</code> is +0.742 (heavy users opt in far more than everyone else), and the weighted SMD drops to +0.002, a clean pass. Query confidence was already close to balanced on the raw data, and weighting keeps it that way. If any weighted SMD came back above 0.1 in absolute value, your propensity model would be missing something; the fix is usually richer features or interaction terms in the logistic regression.</p>
<p>Visually, Figure 2 above confirmed what the SMDs now confirm numerically: the overlap condition holds, and balance is achievable.</p>
<h2 id="heading-step-5-bootstrap-confidence-intervals">Step 5: Bootstrap Confidence Intervals</h2>
<p>Point estimates are only half the story. Any estimate you report to a product team needs an interval that tells them whether +0.08 is distinguishable from +0.03 or from +0.12. Analytic standard errors for IPW and matching are tricky because of the estimated propensity score, so the simplest and most honest move is the non-parametric bootstrap.</p>
<pre><code class="language-python">def estimate_all(sample):
    """Return (ATE_IPW, ATT_IPW, ATT_match) on a bootstrap sample."""
    s = sample.copy()
    X_s = pd.get_dummies(
        s[["engagement_tier", "query_confidence"]], drop_first=True
    ).astype(float)
    ps = LogisticRegression(max_iter=1000).fit(X_s, s.opt_in_agent_mode)
    s["p"] = ps.predict_proba(X_s)[:, 1]

    s["w_ate"] = np.where(
        s.opt_in_agent_mode == 1, 1 / s.p, 1 / (1 - s.p)
    )
    s["w_att"] = np.where(
        s.opt_in_agent_mode == 1, 1, s.p / (1 - s.p)
    )
    t, c = s[s.opt_in_agent_mode == 1], s[s.opt_in_agent_mode == 0]

    ate = (
        (t.task_completed * t.w_ate).sum() / t.w_ate.sum()
        - (c.task_completed * c.w_ate).sum() / c.w_ate.sum()
    )
    att = t.task_completed.mean() - (
        (c.task_completed * c.w_att).sum() / c.w_att.sum()
    )
    nn_b = NearestNeighbors(n_neighbors=1).fit(c[["p"]].values)
    _, idx_b = nn_b.kneighbors(t[["p"]].values)
    match = (
        t.task_completed.values
        - c.task_completed.values[idx_b.flatten()]
    ).mean()
    return ate, att, match

rng = np.random.default_rng(7)
n_reps = 500
results = np.zeros((n_reps, 3))
for i in range(n_reps):
    boot = df.iloc[rng.integers(0, len(df), size=len(df))]
    results[i] = estimate_all(boot)

for name, col in zip(["IPW ATE", "IPW ATT", "1-NN ATT"], range(3)):
    lo, hi = np.percentile(results[:, col], [2.5, 97.5])
    print(f"{name:&lt;10} 95% CI: [{lo:+.4f}, {hi:+.4f}]")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-python">IPW ATE    95% CI: [+0.0745, +0.0954]
IPW ATT    95% CI: [+0.0687, +0.0865]
1-NN ATT   95% CI: [+0.0659, +0.0940]
</code></pre>
<p><strong>Here's what's happening:</strong> you resample the dataset with replacement 500 times, refit the propensity model, and recompute each estimator on each resample, and take the 2.5th and 97.5th percentiles of the bootstrap distribution as the 95% confidence interval. All three intervals cover the ground-truth +0.08 and exclude the naive +0.21 by a wide margin.</p>
<p>The IPW ATT interval is the tightest because ATT reweights only the control group. The 1-NN matching interval is the widest because single-neighbor matching discards control users outside the matched set.</p>
<p>Running this once takes about 90 seconds on a laptop. For a stakeholder report, anchor the headline to the point estimate and cite the interval so the team sees the uncertainty alongside the number.</p>
<h2 id="heading-when-propensity-score-methods-fail">When Propensity Score Methods Fail</h2>
<p>Propensity scores make opt-in comparisons rigorous when their assumptions hold. They produce biased estimates that look clean when those assumptions fail.</p>
<p>Four common failure modes map to the three identification assumptions from earlier.</p>
<h3 id="heading-1-unmeasured-confounders-violate-unconfoundedness">1. Unmeasured Confounders (Violate Unconfoundedness)</h3>
<p>If something drives both opt-in and your outcome but isn't in your propensity model, IPW and matching produce biased estimates. This is the most common failure in practice.</p>
<p>An example: users who opt into agent mode are also the users who follow your engineering blog and read release notes. If blog-reading behavior raises task completion independently of the feature, missing that signal attributes the effect to agent mode, inflating your estimate.</p>
<p>The only real defense is domain knowledge about what drives opt-in, richer feature engineering in your propensity model, and formal sensitivity tools (Rosenbaum bounds, E-values) that quantify how strong an unmeasured confounder would have to be to overturn the result.</p>
<h3 id="heading-2-positivity-overlap-failures-violates-overlap">2. Positivity (Overlap) Failures (Violates Overlap)</h3>
<p>If some users have near-zero probability of opting in (or near-one), you've got no comparable counterfactual for them. I</p>
<p>PW creates extreme weights (1 / 0.001 = 1,000) that let a single outlier dominate the estimate. So matching is forced into poor-quality pairings.</p>
<p>Check propensity histograms and trim propensities outside [0.05, 0.95] before weighting if extreme values exist.</p>
<h3 id="heading-3-misspecified-propensity-models-degrade-unconfoundedness-in-practice">3. Misspecified Propensity Models (Degrade Unconfoundedness in Practice)</h3>
<p>A linear logistic regression can't capture nonlinear relationships. If opt-in depends on the interaction between engagement tier and query confidence (power users with complex queries opt in, while light users pass), a main-effects model misses that and produces poor balance.</p>
<p>Use flexible models (for example, gradient boosting on the propensity score or regression adjustment on top of weighting) and always check the balance after weighting. Poor balance after weighting is the primary signal of misspecification.</p>
<h3 id="heading-4-spillovers-between-users-violates-sutva">4. Spillovers Between Users (Violates SUTVA)</h3>
<p>Propensity score methods assume your users are independent. If one user opting into agent mode affects another user's task completion (for example, teammates adopting the feature together in shared workspaces), your estimated effect includes the spillover.</p>
<p>This violates the stable-unit-treatment-value-assumption, and handling it cleanly requires a different toolkit: either cluster randomization for features adopted at the workspace level or network-aware experimental designs for user-level spillovers.</p>
<p>These failure modes stay invisible in your regression coefficients. They surface as estimates that look good on paper but don't hold up when the feature rolls out to a broader audience.</p>
<p>Run balance diagnostics, check overlap plots, and document what you might have missed: those are your only real defenses.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>Propensity score methods are the right tool when your feature ships behind an opt-in toggle and you've got rich covariates to model selection with.</p>
<p>If opt-in follows a crisp rule (a threshold on query complexity, a paid-tier gate), regression discontinuity fits better. If you suspect unobserved confounders and have an external randomization source (randomized rollout noise, rate-limit-triggered routing), instrumental variables will do better.</p>
<p>To guard your estimate against propensity misspecification, doubly robust estimators combine propensity weighting with regression adjustment and stay consistent if at least one of the two component models is correctly specified.</p>
<p>The companion notebook for this tutorial <a href="http://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in">lives here</a>. Clone the repo, generate the synthetic dataset, and run <code>psm_demo.ipynb</code> (or <code>psm_demo.py</code>) to reproduce every code block, every number, and every figure from this tutorial.</p>
<p>When an AI feature ships behind a toggle, the naïve opt-in comparison is usually the wrong number. Propensity score methods give you "users comparable to those who clicked this" as your counterfactual, and the bootstrap gives you an interval you can defend when a stakeholder asks how sure you are.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It ]]>
                </title>
                <description>
                    <![CDATA[ Your team shipped an LLM-based summaries feature to wave 1 workspaces at week 20 and now the post-launch doc is due. You need a causal effect number, a specific estimate you can defend to a statistici ]]>
                </description>
                <link>https://www.freecodecamp.org/news/why-ab-testing-breaks-in-ai-rollouts-and-how-to-fix-it/</link>
                <guid isPermaLink="false">69e94caed5f8830e7dae1569</guid>
                
                    <category>
                        <![CDATA[ product experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ experimentation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ causal inference ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rudrendu Paul ]]>
                </dc:creator>
                <pubDate>Wed, 22 Apr 2026 22:33:18 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/ed63a287-c756-4dfd-a270-3c5f5ee0c1d0.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Your team shipped an LLM-based summaries feature to wave 1 workspaces at week 20 and now the post-launch doc is due. You need a causal effect number, a specific estimate you can defend to a statistician.</p>
<p>The problem is that wave 2 workspaces are still waiting, a product-wide onboarding redesign shipped the same Tuesday, and week 20 also coincided with a quarterly engagement bump. Any comparison between the two groups after week 20 mixes the feature's causal effect with the redesign, the seasonality, and whatever selection criteria determined which workspaces landed in wave 1 in the first place.</p>
<p>This is how most enterprise SaaS teams ship AI features in 2026: one workspace at a time, in waves, on a rollout calendar. Randomization doesn't happen, and because randomization doesn't happen, A/B testing can't give you a clean causal effect. The result is a number on a dashboard that everyone argues over.</p>
<p>Call this the <strong>Rollout Calendar Trap</strong>: you have real data, a real experiment structure, and a completely invalid comparison. For data scientists shipping AI features in waves, it's the primary source of bad causal claims downstream.</p>
<p>Product experimentation for generative AI features follows this exact pattern: the hypothesis is that the AI feature causes higher engagement, and the wave structure is supposed to test it.</p>
<p>The wave calendar replaced the coin flip, and that substitution breaks the math. A simple A/B comparison assumes randomized assignment that the rollout never produced, so the measurement tool fails even when the experiment design is sound.</p>
<p>Difference-in-differences is the causal inference method that fixes this. It subtracts the time trend by comparing how outcomes shift across time periods for each group, giving you a defensible causal estimate even without randomization.</p>
<p>In this tutorial you'll use it to measure the true causal effect of an AI feature rolled out across enterprise workspaces, with working Python code against a synthetic SaaS product dataset.</p>
<p>By the end you'll know how to run a DiD estimate, how to test its parallel-trends assumption, and what to do when that assumption fails.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-why-ab-testing-breaks-for-staged-rollouts">Why A/B Testing Breaks for Staged Rollouts</a></p>
</li>
<li><p><a href="#heading-what-difference-in-differences-does">What Difference-in-Differences Does</a></p>
</li>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-setting-up-the-working-example">Setting Up the Working Example</a></p>
</li>
<li><p><a href="#heading-step-1-a-simple-2x2-did">Step 1: A Simple 2x2 DiD</a></p>
</li>
<li><p><a href="#heading-step-2-regression-did-with-fixed-effects">Step 2: Regression DiD with Fixed Effects</a></p>
</li>
<li><p><a href="#heading-step-3-checking-the-parallel-trends-assumption">Step 3: Checking the Parallel-Trends Assumption</a></p>
</li>
<li><p><a href="#heading-when-difference-in-differences-fails">When Difference-in-Differences Fails</a></p>
</li>
<li><p><a href="#heading-what-to-do-next">What to Do Next</a></p>
</li>
</ul>
<h2 id="heading-why-ab-testing-breaks-for-staged-rollouts">Why A/B Testing Breaks for Staged Rollouts</h2>
<p>Random assignment is the engine that makes A/B testing a valid causal method. When you flip a coin to decide which user gets the feature, the treatment and control groups end up with identical distributions of every <strong>confounder</strong> (any variable that affects both who gets treatment and what outcome you measure). Any difference in outcomes after assignment is the causal effect of the treatment. Full stop.</p>
<p>A staged rollout across enterprise workspaces breaks that engine in three ways:</p>
<h4 id="heading-1-the-wave-assignment-isnt-random">1. The wave assignment isn't random.</h4>
<p>Product teams choose wave 1 workspaces for various reasons: they have the most engaged admins, the largest seat counts, or the best relationship with customer success. Those reasons correlate directly with your outcome. Wave 1 workspaces were going to show higher engagement anyway, feature or no feature.</p>
<h4 id="heading-2-the-calendar-introduces-a-time-trend">2. The calendar introduces a time trend</h4>
<p>Between week 20 (wave 1 launch) and week 30 (wave 2 launch), your product gets better, your onboarding improves, your sales team lands bigger customers. Any naïve "engagement after week 20 minus engagement before week 20" comparison picks up all of that along with the feature's effect.</p>
<h4 id="heading-3-adoption-inside-treated-workspaces-is-itself-selective">3. Adoption inside treated workspaces is itself selective</h4>
<p>Even inside a workspace that received the feature, not every user turns it on. Power users do, and less engaged users often wait months. Comparing users who used the feature against users who didn't introduces <strong>selection bias</strong>, where the groups differ systematically before you even measure the outcome, on top of the non-random workspace assignment.</p>
<p>A/B testing assumes none of these three problems exist. Staged rollouts guarantee all three. The naïve comparison gives you a number, and that number measures engagement theater.</p>
<h2 id="heading-what-difference-in-differences-does">What Difference-in-Differences Does</h2>
<p>Difference-in-differences compares the <em>change</em> in outcomes over time between a treated group and a control group. Subtracting one change from the other cancels any shared time trend (product improvements, seasonality, onboarding changes) because both groups experience it equally, leaving you with just the treatment effect.</p>
<p>Here's a concrete example. Imagine tracking quarterly revenue for coffee shops in two neighborhoods. One neighborhood gets a new competitor in Q3, the other doesn't.</p>
<p>Both neighborhoods experience the same underlying market trends, a local economic upturn, and holiday seasonality. DiD isolates the competitor's impact by subtracting whatever revenue shift happened in both neighborhoods.</p>
<p>Your staged rollout sets up the exact same structure: wave 1 workspaces are the neighborhood with the new entrant, wave 2 is the comparison.</p>
<p>The math formalizes this as a 2x2 table, where rows are groups (treated, control), columns are time periods (pre, post), and each cell holds the mean outcome for that group in that period:</p>
<ul>
<li><p><strong>A</strong> = mean task completion for wave 1 users <em>before</em> week 20 (coffee shops: Q2 revenue, neighborhood with incoming competitor)</p>
</li>
<li><p><strong>B</strong> = mean task completion for wave 1 users <em>after</em> week 20 (coffee shops: Q3 revenue, same neighborhood)</p>
</li>
<li><p><strong>C</strong> = mean task completion for wave 2 users before week 20 (coffee shops: Q2 revenue, the untouched neighborhood)</p>
</li>
<li><p><strong>D</strong> = mean task completion for wave 2 users after week 20 (coffee shops: Q3 revenue, same)</p>
</li>
</ul>
<pre><code class="language-text">                         Pre     Post
Treated (wave 1):         A       B
Control (wave 2):         C       D

Naive post-period gap:   B - D     (contaminated by group differences)
Naive treated change:    B - A     (contaminated by time trend)
DiD:                 (B - A) - (D - C)   ← the causal effect
</code></pre>
<p><code>B - A</code> is wave 1's change, but it includes both the treatment effect and whatever time trend moved everyone. <code>D - C</code> is wave 2's change over the same window, same time trend, no treatment. Subtracting one from the other leaves only the treatment effect.</p>
<p>The <strong>counterfactual</strong> is what wave 1 would have looked like without the treatment. DiD constructs it by saying: wave 1's counterfactual trajectory = wave 1's pre-period level, carried forward with wave 2's post-period trend. The gap between the actual wave 1 trajectory and that counterfactual is the DiD estimate.</p>
<img src="https://raw.githubusercontent.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/main/images/article-1/did_parallel_trends.png" alt="Causal inference with difference-in-differences: parallel trends and treatment effect" style="display:block;margin:0 auto" width="1485" height="807" loading="lazy">

<p><em>Figure 1: Causal inference with difference-in-differences. Blue solid: Wave 1 actual trajectory. Orange dashed: Wave 2 (control, untreated during this window). Blue dotted: the counterfactual, where Wave 1 would have gone based on Wave 2's post-period trend. The green arrow is the DiD estimate: the gap between the actual Wave 1 trajectory and the counterfactual in the post-treatment period. A, B, C, D correspond to the four cells in the table above.</em></p>
<p>Before week 20, wave 1 and wave 2 track each other closely. That's the parallel-trends requirement at work. At week 20, wave 1 pulls ahead of both wave 2 and its own counterfactual (the dotted line). That post-treatment divergence is the DiD estimate.</p>
<p>The DiD estimate handles two types of bias at once. Permanent differences between treated and control groups (wave 1 workspaces were always more engaged) cancel out because DiD focuses on <em>changes</em> in outcomes across time periods. Time trends that affect both groups (product improvements, market seasonality) cancel out because both groups experience them.</p>
<p>DiD asks one thing in return: parallel pre-treatment trends. The treated and control groups have to be moving in the same direction at the same rate before treatment starts. When that holds, you can extrapolate the shared trend forward and attribute any post-treatment divergence to the treatment. If the trends were already diverging before treatment, DiD is biased, and no amount of clever regression fixes it.</p>
<p>Parallel trends is the assumption you'll test in step 3.</p>
<h3 id="heading-companion-notebook">Companion Notebook</h3>
<p>All the code in this tutorial, including the synthetic dataset, the DiD regression, the parallel-trends plot, and the placebo pre-trend test, lives in a single executable Jupyter notebook in the GitHub repo for this series on product experimentation and causal inference for GenAI and LLM applications.</p>
<p>You can clone it, run <code>generate_data.py</code> once, and every output in this article reproduces exactly: <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/01_did_staged_rollouts">github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm</a></p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You'll need Python 3.11 or newer and comfort with pandas and basic regression. You can follow along without prior causal inference experience, as the article defines confounders and selection bias inline when they first appear. You'll encounter clustered standard errors and fixed effects in step 2. The article explains what they do and why they matter, but it doesn't derive them from scratch.</p>
<p>Install the packages for this tutorial:</p>
<pre><code class="language-bash">pip install numpy pandas statsmodels linearmodels matplotlib
</code></pre>
<p>Clone the companion repo to get the synthetic dataset:</p>
<pre><code class="language-bash">git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv
</code></pre>
<h2 id="heading-setting-up-the-working-example">Setting Up the Working Example</h2>
<p>The dataset simulates a SaaS product with an AI summaries feature launched in two waves: wave 1 workspaces get it at week 20, wave 2 at week 30, with 50,000 users total, each with one row of <a href="https://www.freecodecamp.org/news/how-to-use-opentelemetry/">telemetry</a>.</p>
<p>The data generator bakes in a +5 percentage point causal effect on task completion for users in their workspace's post-treatment period. You know the truth upfront, so you can check whether your DiD estimator actually recovers it.</p>
<p>Load the data and inspect the structure:</p>
<pre><code class="language-python">import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(df.shape)
print(df[["wave", "signup_week", "workspace_id", "task_completed"]].head())
print("\nWave sizes:", df.wave.value_counts().to_dict())
print("Treatment weeks per wave:",
      df.groupby("wave").treatment_week.first().to_dict())
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">(50000, 16)
   wave  signup_week  workspace_id  task_completed
0     2           10            36               0
1     2           51            44               1
2     2            2            28               1
3     1           15            20               1
4     1           29             0               1
Wave sizes: {2: 25063, 1: 24937}
Treatment weeks per wave: {1: 20, 2: 30}
</code></pre>
<p>Here's what's happening: you load 50,000 rows, one per user. Wave 1 has about 24,937 users across 25 workspaces; wave 2 has about 25,063 users across 25 different workspaces. The <code>treatment_week</code> column records when each user's workspace got the AI summaries feature (week 20 for wave 1, week 30 for wave 2). The <code>task_completed</code> column is your outcome: did the AI successfully complete the user's task.</p>
<p>One important detail: <code>signup_week</code> in this dataset records which calendar week a user first joined the product, and we're using it as a time index to assign users to pre- or post-treatment cohorts.</p>
<p>A user who signed up in week 22 joined after the feature launched, so their experience is "post-treatment." A user who signed up in week 14 joined before the launch, so their experience is "pre-treatment."</p>
<p>This works here because each user has one row of telemetry tied to their initial product experience. In a panel dataset with multiple observations per user across time, you'd use an observation timestamp column tied to when each row was recorded.</p>
<p>To keep the analysis clean, restrict to users who signed up before the wave 2 launch (<code>signup_week &lt; 30</code>). Wave 2 then works as a proper control group, since it hasn't been treated yet, while wave 1 has been treated for 10 weeks.</p>
<pre><code class="language-python">analysis = df[df.signup_week &lt; 30].copy()
analysis["post"] = (analysis.signup_week &gt;= 20).astype(int)
analysis["treated"] = (analysis.wave == 1).astype(int)

print(analysis.groupby(["treated", "post"])
              .agg(n=("user_id", "count"),
                   mean_completion=("task_completed", "mean"))
              .round(3))
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">                 n  mean_completion
treated post
0       0     9590            0.556
        1     4878            0.555
1       0     9633            0.592
        1     4738            0.643
</code></pre>
<p>Here's what's happening: you filter the data to the analysis window (weeks 0 to 29) and create two indicator variables. <code>post</code> is 1 for users in the post-week-20 period, 0 otherwise. <code>treated</code> is 1 for wave 1 users, 0 for wave 2. The groupby shows the four cells of the DiD 2x2 table: (treated=0, post=0), (treated=0, post=1), (treated=1, post=0), (treated=1, post=1). Those four means are everything you need for a first-pass DiD estimate.</p>
<h2 id="heading-step-1-a-simple-2x2-did">Step 1: A Simple 2x2 DiD</h2>
<p>Start with the cleanest version. Compute the four cell means by hand, then take the difference of differences:</p>
<pre><code class="language-python">cells = analysis.groupby(["treated", "post"]).task_completed.mean()

wave2_pre  = cells.loc[(0, 0)]   # control, pre
wave2_post = cells.loc[(0, 1)]   # control, post
wave1_pre  = cells.loc[(1, 0)]   # treated, pre
wave1_post = cells.loc[(1, 1)]   # treated, post

did_effect = (wave1_post - wave1_pre) - (wave2_post - wave2_pre)
print(f"Wave 1 change: {wave1_post - wave1_pre:+.4f}")
print(f"Wave 2 change: {wave2_post - wave2_pre:+.4f}")
print(f"DiD effect:    {did_effect:+.4f}")
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Wave 1 change: +0.0515
Wave 2 change: -0.0013
DiD effect:    +0.0527  (ground truth = +0.05)
</code></pre>
<p>Here's what's happening: you pull the four cell means, compute wave 1's change in task completion from pre to post, compute wave 2's change over the same calendar window (wave 2 hasn't been treated yet), and take the difference. The DiD estimate is the piece of wave 1's change that can't be explained by whatever time trend also moved wave 2.</p>
<p>On this dataset the simple 2x2 estimate lands at +0.053, which is very close to the true +0.05. But you can't take this number to a product review. You have no standard errors, which means you can't say whether +0.053 is a real signal or within sampling noise. You have no covariate adjustment, so if wave 1 happened to have more heavy users in this cohort, some of that +0.053 could be engagement-tier composition. And you have no way to handle the workspace-level correlation in your data. Step 2 fixes all three.</p>
<h2 id="heading-step-2-regression-did-with-fixed-effects">Step 2: Regression DiD with Fixed Effects</h2>
<p>The regression formulation of DiD produces the same point estimate as the 2x2 table when there are no covariates. But it also buys you three things:</p>
<ul>
<li><p><strong>Standard errors and p-values</strong> computed correctly</p>
</li>
<li><p><strong>Covariate adjustment</strong> to reduce variance and sharpen your estimate</p>
</li>
<li><p><strong>Cluster-robust errors</strong> that handle correlation within workspaces, which a staged rollout always has</p>
</li>
</ul>
<p>The regression is: <code>outcome ~ treated + post + treated:post + controls</code>. The coefficient on the <code>treated:post</code> interaction is your DiD estimate.</p>
<pre><code class="language-python">import statsmodels.formula.api as smf

did_model = smf.ols(
    "task_completed ~ treated * post + C(engagement_tier)",
    data=analysis
).fit(
    cov_type="cluster",
    cov_kwds={"groups": analysis.workspace_id}
)

print(did_model.summary().tables[1])
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">================================================================================================
                                   coef    std err          z      P&gt;|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                        0.8301      0.007    126.538      0.000       0.817       0.843
C(engagement_tier)[T.light]     -0.4027      0.006    -63.168      0.000      -0.415      -0.390
C(engagement_tier)[T.medium]    -0.1766      0.007    -25.931      0.000      -0.190      -0.163
treated                          0.0367      0.005      6.885      0.000       0.026       0.047
post                            -0.0056      0.008     -0.684      0.494      -0.022       0.011
treated:post                     0.0541      0.011      4.981      0.000       0.033       0.075
================================================================================================
</code></pre>
<p>Here's what's happening: you fit an ordinary least squares regression of task completion on the <code>treated</code> indicator, the <code>post</code> indicator, their interaction, and a categorical control for engagement tier.</p>
<p>The <code>treated:post</code> coefficient is the DiD estimate. Users in the same workspace share common shocks, making their outcomes correlated. Grouping by <code>workspace_id</code> corrects for that.</p>
<p>On this dataset the <code>treated:post</code> coefficient comes out at +0.054 with a clustered p-value of &lt;0.001. The ground truth is +0.050. At 0.4 percentage points from the true effect, with a standard error that accounts for workspace-level correlation, that's a number you can put in a product review.</p>
<p>A few practical notes on this regression:</p>
<ul>
<li><p><strong>Controls should be time-invariant</strong> (engagement tier, signup cohort). Time-varying controls that are themselves affected by treatment will bias the estimate.</p>
</li>
<li><p><strong>Only the interaction has a causal interpretation.</strong> The intercept and level terms describe baseline differences between groups, nothing more.</p>
</li>
<li><p><strong>Clustered errors are mandatory.</strong> Skip clustering and your standard errors are 3 to 10x too small, test statistics are artificially inflated, and results look far more significant than they are.</p>
</li>
</ul>
<h2 id="heading-step-3-checking-the-parallel-trends-assumption">Step 3: Checking the Parallel-Trends Assumption</h2>
<p>DiD is only valid if wave 1 and wave 2 were moving in the same direction at the same rate <em>before</em> treatment started. You check this by plotting (or tabulating) weekly means for the two waves across the pre-treatment window.</p>
<pre><code class="language-python">import matplotlib.pyplot as plt
import numpy as np

df_plot = df[df.signup_week &lt; 30].copy()
weekly = (df_plot.groupby(["signup_week", "wave"])
             .task_completed.mean()
             .reset_index()
             .pivot(index="signup_week", columns="wave", values="task_completed"))

# 3-week rolling average to smooth week-to-week sampling noise
smoothed = weekly.rolling(3, center=True, min_periods=2).mean()

TREATMENT_WEEK = 20
pre_idx = smoothed.index[smoothed.index &lt; TREATMENT_WEEK]
post_idx = smoothed.index[smoothed.index &gt;= TREATMENT_WEEK]

# DiD counterfactual: wave 1 pre-period mean + wave 2's post-period change
wave1_pre_mean = smoothed.loc[pre_idx, 1].mean()
wave2_pre_mean = smoothed.loc[pre_idx, 2].mean()
counterfactual = wave1_pre_mean + (smoothed.loc[post_idx, 2].values - wave2_pre_mean)

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.axvspan(-0.5, TREATMENT_WEEK, alpha=0.04, color="#94A3B8", zorder=0)
ax.axvspan(TREATMENT_WEEK, 29.5, alpha=0.06, color="#3B82F6", zorder=0)
ax.plot(smoothed.index, smoothed[2], "s--", color="#F59E0B", linewidth=2,
        markersize=4, label="Wave 2 — control (untreated during this window)", zorder=3)
ax.plot(smoothed.index, smoothed[1], "o-", color="#2563EB", linewidth=2.2,
        markersize=4, label="Wave 1 — treated (AI feature on at week 20)", zorder=4)
ax.plot(post_idx, counterfactual, ":", color="#2563EB", linewidth=2.2,
        label="Wave 1 counterfactual (projected without treatment)", zorder=4)
ax.axvline(TREATMENT_WEEK, color="#DC2626", linestyle="--", linewidth=1.8,
           label="AI feature launched (week 20)")

ax.text(9.5, 0.508, "Pre-treatment period\n(parallel trends required)",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.text(24, 0.508, "Post-treatment",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.set_xlabel("Week", fontsize=11)
ax.set_ylabel("Mean task completion rate", fontsize=11)
ax.set_title("Figure 2: Data-Driven Parallel-Trends Check\n(3-week rolling average, 50k users)",
             fontsize=12, fontweight="bold", pad=14)
ax.legend(loc="upper left", fontsize=9, framealpha=0.92)
ax.set_xlim(-0.5, 29.5)
ax.set_ylim(0.50, 0.72)
ax.grid(True, alpha=0.18, linestyle=":")
ax.tick_params(labelsize=10)
plt.tight_layout()
plt.savefig("parallel_trends.png", dpi=150, bbox_inches="tight")
print("Saved parallel_trends.png")
</code></pre>
<p><strong>Expected output (Figure 2, data-driven verification):</strong></p>
<pre><code class="language-text">Saved parallel_trends.png
</code></pre>
<img src="https://raw.githubusercontent.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/main/images/article-1/parallel_trends.png" alt="Parallel trends visual check, data-driven verification" style="display:block;margin:0 auto" width="1486" height="804" loading="lazy">

<p><em>Figure 2 is the data-driven parallel-trends check from your actual dataset, plotted as a 3-week rolling average to smooth week-to-week sampling noise. Both waves track each other closely before week 20, and small wiggles in the pre-period affect both groups at the same time, which is exactly what parallel trends looks like. After week 20, wave 1 separates cleanly above the dotted counterfactual line. The gap between the solid blue line and the dotted line in the post-treatment window is the DiD estimate playing out in your actual data.</em></p>
<p>Here's what's happening: you group by signup week and wave, compute the mean task completion rate per cell, pivot so each wave is a column, and plot the two time series together.</p>
<p>A vertical dashed line marks week 20 when wave 1 got treatment. In the pre-treatment window (weeks 0 to 19) the two series should track each other closely. After week 20, wave 1 should pull ahead of wave 2 by roughly the treatment effect.</p>
<p>To put a number on it, run a placebo regression on the pre-treatment period only. Regress the outcome on a linear time trend interacted with the treated indicator. If the interaction coefficient is near zero and insignificant, the two groups were moving in parallel before treatment:</p>
<pre><code class="language-python">pre_only = analysis[analysis.post == 0].copy()
pre_only["weeks_since_start"] = pre_only.signup_week - 10  # center

placebo_model = smf.ols(
    "task_completed ~ treated * weeks_since_start + C(engagement_tier)",
    data=pre_only
).fit(
    cov_type="cluster",
    cov_kwds={"groups": pre_only.workspace_id}
)

print("Pre-trend slope difference:",
      placebo_model.params["treated:weeks_since_start"])
print("p-value:",
      placebo_model.pvalues["treated:weeks_since_start"])
</code></pre>
<p><strong>Expected output:</strong></p>
<pre><code class="language-text">Pre-trend slope difference: -0.00095...
p-value: 0.4435...
</code></pre>
<p>Here's what's happening: you restrict to pre-treatment observations, fit a regression that lets wave 1 and wave 2 follow different linear trends in the pre-period, and read off the interaction coefficient.</p>
<p>A coefficient close to zero with p &gt; 0.05 means the two waves were moving in parallel before treatment. If that coefficient is large and statistically significant, the parallel-trends assumption is broken: your DiD estimate is absorbing whatever differential trend separated the groups before week 20.</p>
<p>If the placebo test fails, stop and rethink. Your options: restrict to a narrower pre-window where trends were parallel, find a better control group, or switch to synthetic control, which builds a weighted counterfactual from multiple untreated units.</p>
<p>On this synthetic dataset the placebo test passes: the pre-trend slope difference is -0.00095 with p = 0.44, so the parallel-trends assumption holds and the +0.054 estimate from step 2 is trustworthy.</p>
<h2 id="heading-when-difference-in-differences-fails">When Difference-in-Differences Fails</h2>
<p>DiD is a precise accounting method, and every precise method has specific failure modes worth knowing before you trust its output. Here are four common ones:</p>
<h3 id="heading-1-non-parallel-pre-trends">1. Non-parallel Pre-trends</h3>
<p>When the treated and control groups were already diverging before treatment started, DiD mistakes that pre-existing drift for a treatment effect.</p>
<p>The placebo test in step 3 is your guard. Run it every time. If it fails, you have three options:</p>
<ol>
<li><p>Restrict the analysis to a shorter pre-window where trends were parallel and re-run the placebo</p>
</li>
<li><p>Find a better control group whose pre-trend matches the treated group</p>
</li>
<li><p>Switch to synthetic control, which builds a weighted counterfactual from multiple untreated units and picks the weights to match the treated group's pre-treatment trajectory</p>
</li>
</ol>
<h3 id="heading-2-staggered-adoption">2. Staggered Adoption</h3>
<p>A staged rollout with three or more waves demands a different approach than a clean 2x2. Wave 1 gets treated at week 20, wave 2 at week 30, wave 3 at week 40. Once wave 2 is treated, it's no longer a valid control for wave 1 comparisons that span weeks 30 and beyond. Earlier treated units start acting as controls for later ones, which contaminates the estimate.</p>
<p>That's the Goodman-Bacon decomposition problem, and the standard two-way fixed effects estimator from step 2 will silently absorb it. The Callaway-Sant'Anna estimator (see <a href="https://www.sciencedirect.com/science/article/abs/pii/S0304407620303948">their 2021 paper</a>) fixes this by averaging only the clean 2x2 comparisons and discarding the contaminated ones. The <code>differences</code> package in Python implements it.</p>
<h3 id="heading-3-time-varying-confounders-that-hit-only-the-treated-group">3. Time-varying Confounders that Hit Only the Treated Group</h3>
<p>If your marketing team runs a targeted campaign in wave 1 workspaces during week 22, you've got a treatment-specific shock DiD can't net out.</p>
<p>Parallel trends certifies the pre-treatment period, but the post-treatment window remains your responsibility to audit.</p>
<p>Check every product or marketing event inside the analysis window. If you find one, the only options are to redesign the study, restrict the analysis to the window before the shock, or model the shock explicitly as a second treatment variable.</p>
<h3 id="heading-4-anticipation-effects">4. Anticipation Effects</h3>
<p>If wave 1 customers knew in week 18 that the feature was coming in week 20, some will have started behaving differently before treatment technically started: signing up more, pre-configuring settings, contacting support. That contaminates the "pre" period. The tell is a bump or dip in wave 1 in the weeks immediately before week 20 on the event-study plot.</p>
<p>The fix is to push the pre-period cutoff back. Treat week 18 as the "treatment" start for purposes of the analysis, which removes the anticipation window from your pre-period baseline.</p>
<p>Each of these failure modes has a diagnostic and a specific remedy. Naming them in your analysis builds credibility with skeptical reviewers. DiD is a careful accounting identity – it produces reliable estimates exactly as long as its inputs are clean.</p>
<h2 id="heading-what-to-do-next">What to Do Next</h2>
<p>The regression DiD above is the right tool for a two-wave rollout. If your rollout has three or more waves, switch to the Callaway-Sant'Anna estimator. If your rollout crosses a treatment threshold you set deliberately (confidence scores, query complexity), look into regression discontinuity. If you want to compare a single treated unit against a constructed counterfactual, synthetic control is the right choice.</p>
<p>The <a href="http://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm">companion notebook for this tutorial is here</a>. Clone the repo, generate the synthetic dataset with <code>generate_data.py</code>, and open <code>did_demo.ipynb</code> to reproduce every code block with pre-saved outputs.</p>
<p>If you ship AI features in waves, your rollout calendar is already a DiD study. The only question is whether you run the analysis.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
