experimentation - freeCodeCamp.org

Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python

Rudrendu Paul — Fri, 08 May 2026 15:33:41 +0000

Causal inference for LLM-based features starts with one question editors ask before they ship anything: Did the change actually move the metric, or did the metric just move?

Let's say that your team built a routing layer that splits incoming queries between two models: queries with a confidence score below 0.85 go to a premium model, and those above 0.85 go to a cheaper distilled model. The premium model costs 5x as much as the cheaper one.

Your boss wants the answer that ends the debate: Is the premium model worth it for the queries it sees?

You can't run a clean A/B test, because routing is deterministic: a query at confidence 0.84 always gets premium, a query at 0.86 always gets cheap, and you can't randomize the assignment.

You also can't trust a naïve comparison of premium-routed users against cheap-routed users. Premium handles the harder queries by design (that's the reason you built the gate), so the two groups differ in query difficulty before either model touches them.

The threshold itself is your free experiment. Right at 0.85, the assignment flips, but the queries on either side of that boundary are essentially identical. A query at confidence 0.849 isn't meaningfully different from a query at 0.851. Any differences in outcomes between the two narrow groups stem solely from the routing decision. That's what regression discontinuity design (RDD) reads.

In this tutorial, you'll use Python to estimate the causal effect of premium routing on task completion using sharp RDD with local linear regression. You'll sweep bandwidths to test estimate stability, run a manipulation diagnostic, check robustness with a quadratic specification, and bootstrap 95% confidence intervals around every point estimate.

The LLM telemetry is a 50,000-user synthetic dataset with the ground-truth premium-routing effect baked in at +6 percentage points, so you can verify that RDD recovers it.

Companion code: every code block runs end-to-end in the companion notebook.

Why Threshold Routing is a Natural Experiment
What Regression Discontinuity Actually Does
Prerequisites
Setting Up the Working Example
Step 1: A Sharp RDD with Local Linear Regression
Step 2: Try Different Bandwidths
Step 3: Checking for Manipulation at the Threshold
Step 4: Quadratic Specification as a Robustness Check
Step 5: Bootstrap Confidence Intervals
When Regression Discontinuity Fails
What to Do Next

Why Threshold Routing is a Natural Experiment

The product reason this routing rule exists is to help your team spend the premium model budget where it earns its keep. Low-confidence queries are the harder ones, which is where a stronger model has the most upside. High-confidence queries already look easy enough for the cheap model to handle.

You'll see this routing direction across confidence-score gates for Q&A assistants, query-complexity gates in multi-model gateways like OpenRouter, safety-score gates in content moderation, and latency-budget gates that re-route when the cheap model would exceed a p99 latency budget.

The mechanism is the same in every case: a continuous score, a threshold, and a deterministic routing rule.

What makes this setup useful for causal inference is that users don't pick which model they get. A query lands, the system computes confidence, and the routing layer decides. Right at the threshold, the user's experience flips from premium to cheap based on a difference too small to be meaningful.

Again, a query at 0.849 confidence isn't shipping a different problem to the model than a query at 0.851. Anything that differs in outcomes between those two groups is the routing decision speaking. The underlying query is the same.

That local randomness is the experiment RDD reads from. You don't need a randomized control group, you don't need a propensity score. And you don't need an instrument, you need a sharp threshold that nobody can game.

What Regression Discontinuity Actually Does

The jump at the threshold is the causal effect, which is the number a product team can act on. RDD reads it by fitting two separate regression lines to the outcome: one for users just below the threshold and one for users just above. The vertical difference between those two fitted lines at the cutoff is the local average treatment effect at that point.

Graphically, picture task completion on the y-axis and query confidence on the x-axis. Completion generally trends with confidence (easier queries complete more often). At exactly 0.85, though, users below the cutoff get premium routing, and users above get cheap.

If premium routing helps, you'd see a sharp upward jump in task completion just below 0.85, then disappear just above. Approached from left to right with confidence rising, the visual reads as a downward step at 0.85, because you're moving from the premium-treated zone into the cheap-treated zone.

Figure 1. Conceptual schematic. Two outcome trajectories, one for premium-routed queries (confidence below 0.85) and one for cheap-routed queries (confidence above 0.85), meet at the threshold but don't match. The vertical gap between their endpoints at 0.85 is the local causal effect of premium routing.

That gap is identified under two named assumptions:

No manipulation of the running variable: Users (or your system) can't precisely nudge a query's confidence score across the cutoff. If anyone can game their score to land just below 0.85 and grab premium routing, the cutoff is no longer drawn at random, and RDD breaks.
Continuity of potential outcomes at the cutoff: Every other factor that affects task completion (query type, user expertise, workspace tenure, time of day) varies smoothly across 0.85. Only the routing assignment changes discontinuously at exactly the threshold. If a second product rule fires at 0.85 (a different logging level, a separate UI treatment, a retry policy), RDD will attribute that rule's effect to the routing decision.

These are the two assumptions you check before you trust the estimate. Step 3 below tests the first one. The second is a structural property of your system that you have to know cold.

Two practical choices shape every RDD: the bandwidth (how close to the cutoff to restrict the analysis) and the functional form (linear, quadratic, or local polynomial).

Narrow bandwidths cut potential bias by staying close to the local-randomization zone, but they shrink the sample. Linear specifications are stable, though they assume the underlying relationship can be approximated by a straight line on each side.

You'll try both linear and quadratic specifications at multiple bandwidths to see whether the answer holds.

The article uses sharp RDD throughout, since assignment is a deterministic function of confidence (below 0.85 always premium, above 0.85 always cheap). When the threshold is probabilistic and compliance is partial, the design is a fuzzy RDD, which requires an instrumental variables framework that you can implement using the rdrobust Python package.

Prerequisites

You need Python 3.11 or newer, comfort with pandas and statsmodels, and rough familiarity with linear regression and interaction terms.

Install the packages used in this tutorial:

pip install numpy pandas statsmodels matplotlib scipy

Here's what's happening: four standard scientific Python libraries plus matplotlib for the diagnostic visualization. Nothing exotic.

Clone the companion repo and generate the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here's what's happening: the data generator draws 50,000 users with a query_confidence score from a Beta(5,2) distribution, applies the routing rule (routed_to_premium = query_confidence < 0.85), and bakes a +6-percentage-point premium routing effect into task_completed. Same seed, same dataset, every time.

Setting Up the Working Example

The dataset simulates a SaaS product that routes queries between a premium and a cheap model based on confidence score. The threshold is 0.85, and the ground-truth causal effect of premium routing is +6 percentage points on task completion. You know the truth, so you can check whether RDD recovers it.

Load the data and look at the routing breakdown:

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(f"Loaded {len(df):,} rows, {df.shape[1]} columns")

print("\nRouting breakdown:")
counts = df.routed_to_premium.value_counts().to_dict()
print(f"  Premium-routed (confidence < 0.85):  {counts.get(1, 0):,}")
print(f"  Cheap-routed   (confidence >= 0.85): {counts.get(0, 0):,}")

print("\nQuery confidence distribution:")
print(df.query_confidence.describe().round(3))

Expected output:

Loaded 50,000 rows, 16 columns

Routing breakdown:
  Premium-routed (confidence < 0.85):  38,874
  Cheap-routed   (confidence >= 0.85): 11,126

Query confidence distribution:
count    50000.000
mean         0.715
std          0.159
min          0.078
25%          0.611
50%          0.736
75%          0.838
max          0.998

Here's what's happening: about 78% of queries land below the 0.85 cutoff and get premium routing. The Beta(5,2) distribution is skewed toward the upper end, with a median of 0.736, and most of its mass still sits below 0.85. The remaining 22% are queries that the model already feels confident about, and they go to the cheap model.

Before any regression, look at the naïve comparison every product team is tempted to run:

naive = (
    df[df.routed_to_premium == 1].task_completed.mean()
    - df[df.routed_to_premium == 0].task_completed.mean()
)
print(f"Naive premium-vs-cheap effect: {naive:+.4f}  (ground truth = +0.06)")

Expected output:

Naive premium-vs-cheap effect: +0.0632  (ground truth = +0.06)

Here's what's happening: the naive estimate sits at +0.0632, which is suspiciously close to the truth. That's a coincidence of this specific synthetic dataset, where the only confounder of premium vs. cheap is query_confidence itself, and the outcome doesn't depend on confidence except through routing.

In production, you almost never get this lucky. User expertise, prompt phrasing, time of day, and a dozen unobserved query traits all correlate with confidence and with completion.

A naïve comparison in a real system can be off by 50% or more in either direction. RDD gives you identification that doesn't depend on the absence of hidden confounders.

Step 1: A Sharp RDD with Local Linear Regression

The basic sharp RDD estimator is a local linear regression. Restrict to users whose confidence sits within a bandwidth of the cutoff, fit separate linear slopes on each side, and read off the jump at 0.85.

cutoff = 0.85
bw = 0.10

near = df[(df.query_confidence > cutoff - bw)
          & (df.query_confidence < cutoff + bw)].copy()
near["below_cutoff"] = (near.query_confidence < cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff

rdd_model = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc",
    data=near,
).fit(cov_type="HC3")

effect = rdd_model.params["below_cutoff"]
print(f"RDD effect at cutoff (LATE): {effect:+.4f}")
print(f"Std error (HC3):             {rdd_model.bse['below_cutoff']:.4f}")
print(f"p-value:                     {rdd_model.pvalues['below_cutoff']:.4f}")
print(f"N users in [0.75, 0.95):     {len(near):,}")

Expected output:

RDD effect at cutoff (LATE): +0.0548
Std error (HC3):             0.0131
p-value:                     0.0000
N users in [0.75, 0.95):     21,689

Here's what's happening: the model fits separate intercepts and slopes on each side of 0.85 (below_cutoff is the side indicator, rc is confidence centered at the cutoff). The coefficient on below_cutoff reads off the vertical jump at the threshold, which is the local average treatment effect (LATE) for queries with confidence near 0.85. You get +0.0548, within sampling noise of the +0.06 ground truth.

Three notes on the specification. First, task_completed is binary, so this is a linear probability model. For RDD with a binary outcome at the cutoff, the linear probability model is standard practice because local linearity is the identifying assumption either way. Logit at the cutoff is an alternative if you need bounded predictions globally.

Second, the standard errors are used cov_type="HC3" to relax the homoskedasticity assumption, which is almost always wrong for binary outcomes.

Third, the dataset has one query per user with no within-user clustering, so cluster-robust standard errors aren't needed here. In a setting with multiple queries per user, you'd cluster on user_id.

The next diagnostic to look at is the confidence distribution near the cutoff. Figure 2 shows what 50,000 queries look like in the bandwidth window:

Figure 2. Real distribution from the 50,000-user synthetic dataset. Unlike the schematic in Figure 1, this shows the actual query density by confidence score, with the routing threshold annotated. The bottom panel counts how many queries land in each 2-percentage-point bin near the cutoff (2,461 / 2,481 / 2,335 / 2,229 / 2,048 across the 0.80–0.90 range). The roughly uniform spread is the visual signal that no manipulation is concentrating users on one side of the threshold.

Step 2: Try Different Bandwidths

Bandwidth choice matters. Too narrow and you have too few observations, so the confidence interval blows up. Too wide and you're extrapolating into regions where the linear specification is no longer a reasonable local approximation.

The honest move is to try multiple bandwidths and report whether the estimate holds.

results = []
for bw in [0.05, 0.10, 0.15, 0.20]:
    sub = df[(df.query_confidence > cutoff - bw)
             & (df.query_confidence < cutoff + bw)].copy()
    sub["below_cutoff"] = (sub.query_confidence < cutoff).astype(int)
    sub["rc"] = sub.query_confidence - cutoff

    m = smf.ols(
        "task_completed ~ below_cutoff + rc + below_cutoff:rc",
        data=sub,
    ).fit(cov_type="HC3")

    results.append({
        "bandwidth": bw,
        "n": len(sub),
        "effect": m.params["below_cutoff"],
        "se": m.bse["below_cutoff"],
        "p": m.pvalues["below_cutoff"],
    })

print(pd.DataFrame(results).round(4).to_string(index=False))

Expected output:

 bandwidth      n  effect     se       p
      0.05  11554  0.0635  0.0183  0.0005
      0.10  21689  0.0548  0.0131  0.0000
      0.15  29137  0.0618  0.0112  0.0000
      0.20  34074  0.0614  0.0107  0.0000

Here's what's happening: four bandwidths from ±0.05 to ±0.20 around the cutoff, refitting the same RDD specification at each. The estimates range from +0.0548 to +0.0635, all in the same neighborhood as the +0.06 ground truth, with standard errors that shrink as the bandwidth widens and grow as it narrows. Every p-value is well below 0.05. Whether the estimates are "stable" depends on the confidence intervals around them, which Step 5 produces with the bootstrap.

Step 3: Checking for Manipulation at the Threshold

RDD is valid only if users can't precisely manipulate the running variable around the cutoff. If your users (or your system) can nudge confidence scores just below 0.85 to force premium routing, you get a density spike at the cutoff, and the RDD estimate is contaminated.

The standard diagnostic is the McCrary density test, which checks whether the distribution of the running variable has a sharp jump at the cutoff. The simple version: bin the data tightly around 0.85 and check whether the counts on the two sides are similar.

print("User counts in 2-percentage-point bins around 0.85:")
for lo in [0.80, 0.82, 0.84, 0.86, 0.88]:
    hi = lo + 0.02
    cnt = ((df.query_confidence >= lo) & (df.query_confidence < hi)).sum()
    print(f"  [{lo:.2f}, {hi:.2f}):  n = {cnt:,}")

Expected output:

User counts in 2-percentage-point bins around 0.85:
  [0.80, 0.82):  n = 2,461
  [0.82, 0.84):  n = 2,481
  [0.84, 0.86):  n = 2,335
  [0.86, 0.88):  n = 2,229
  [0.88, 0.90):  n = 2,048

Here's what's happening: counts trend gently downward across the bandwidth because Beta(5,2) places more mass at higher confidence levels, and the density tapers as it approaches 1.0. There's no spike or dip at the 0.84–0.86 bin that straddles the cutoff. The 433-user spread across all five bins is consistent with smooth tapering of the underlying density.

That's the pattern you want when manipulation is absent. For a more rigorous test, the rddensity Python package implements the formal McCrary procedure with bias-corrected standard errors.

What manipulation looks like when it's real: a spike in users at confidences just barely below 0.85 (they're being nudged into premium routing) and a dip just above. If you see that pattern, the RDD estimate overstates the causal effect because the users right below 0.85 differ in motivation from those right above. They cared enough to manipulate the score, and they'd have shown different outcomes even under random routing.

Step 4: Quadratic Specification as a Robustness Check

If the true relationship between confidence and task completion isn't exactly linear, a local linear RDD can mistake the curvature for a jump. The standard robustness check allows quadratic terms on both sides of the cutoff and tests whether the estimate holds.

near = df[(df.query_confidence > cutoff - 0.10)
         & (df.query_confidence < cutoff + 0.10)].copy()
near["below_cutoff"] = (near.query_confidence < cutoff).astype(int)
near["rc"] = near.query_confidence - cutoff
near["rc2"] = near.rc ** 2

rdd_quad = smf.ols(
    "task_completed ~ below_cutoff + rc + below_cutoff:rc"
    " + rc2 + below_cutoff:rc2",
    data=near,
).fit(cov_type="HC3")

print(f"Linear RDD    (bw=0.10):  effect = +0.0548, p < 0.0001")
print(f"Quadratic RDD (bw=0.10):  effect = "
      f"{rdd_quad.params['below_cutoff']:+.4f}, "
      f"p = {rdd_quad.pvalues['below_cutoff']:.4f}")

Expected output:

Linear RDD    (bw=0.10):  effect = +0.0548, p < 0.0001
Quadratic RDD (bw=0.10):  effect = +0.0569, p = 0.0036

Here's what's happening: the quadratic specification adds squared terms and interactions with the cutoff indicator, allowing the relationship to curve differently on each side. The below_cutoff coefficient still captures the jump at the threshold, now under a more flexible specification.

The two estimates differ by 0.0022, both close to the +0.06 ground truth, and both are significant at p < 0.01. The answer doesn't change when you let the model bend.

When linear and quadratic specifications disagree noticeably, you have a real signal. With small samples (a few thousand at narrow bandwidths), the quadratic version can lose power because four extra parameters need data to be identified.

The standard move is to widen the bandwidth and re-run both specifications. If they still disagree at wider bandwidths, the linear approximation is wrong, and you should report both numbers.

Step 5: Bootstrap Confidence Intervals

Every point estimate in this article is a single number from a finite sample. The bootstrap quantifies how much that number would move under resampling, which is what a confidence interval describes.

def bootstrap_ci(df, cutoff, bw, quadratic=False, n_reps=500, seed=7):
    rng = np.random.default_rng(seed)
    near = df[(df.query_confidence > cutoff - bw)
              & (df.query_confidence < cutoff + bw)].copy()
    near["below_cutoff"] = (near.query_confidence < cutoff).astype(int)
    near["rc"] = near.query_confidence - cutoff
    if quadratic:
        near["rc2"] = near.rc ** 2
        formula = ("task_completed ~ below_cutoff + rc + below_cutoff:rc"
                   " + rc2 + below_cutoff:rc2")
    else:
        formula = "task_completed ~ below_cutoff + rc + below_cutoff:rc"

    n = len(near)
    estimates = np.empty(n_reps)
    for i in range(n_reps):
        sample = near.iloc[rng.integers(0, n, size=n)]
        m = smf.ols(formula, data=sample).fit()
        estimates[i] = m.params["below_cutoff"]
    return (np.percentile(estimates, 2.5), np.percentile(estimates, 97.5))


print("Linear RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10)
print(f"  effect = +0.0548   95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nBandwidth sensitivity:")
for bw, eff in [(0.05, 0.0635), (0.10, 0.0548), (0.15, 0.0618), (0.20, 0.0614)]:
    lo, hi = bootstrap_ci(df, cutoff, bw=bw)
    print(f"  bw = {bw:.2f}   effect = {eff:+.4f}   "
          f"95% CI: [{lo:+.4f}, {hi:+.4f}]")

print("\nQuadratic RDD (bw=0.10):")
lo, hi = bootstrap_ci(df, cutoff, bw=0.10, quadratic=True)
print(f"  effect = +0.0569   95% CI: [{lo:+.4f}, {hi:+.4f}]")

Expected output:

Linear RDD (bw=0.10):
  effect = +0.0548   95% CI: [+0.0278, +0.0817]

Bandwidth sensitivity:
  bw = 0.05   effect = +0.0635   95% CI: [+0.0244, +0.0986]
  bw = 0.10   effect = +0.0548   95% CI: [+0.0278, +0.0817]
  bw = 0.15   effect = +0.0618   95% CI: [+0.0381, +0.0823]
  bw = 0.20   effect = +0.0614   95% CI: [+0.0420, +0.0808]

Quadratic RDD (bw=0.10):
  effect = +0.0569   95% CI: [+0.0205, +0.0959]

Here's what's happening: the bootstrap resamples the bandwidth-restricted data with replacement 500 times, refits the RDD on each replicate, and collects the below_cutoff coefficient. The 2.5th and 97.5th percentiles of those 500 estimates form the 95% interval. Every interval covers the +0.06 ground truth, every interval excludes zero, and the bandwidth sweep produces overlapping intervals.

That's quantitative stability, verified by resampling across the full bandwidth range. Intervals widen as the bandwidth shrinks and narrow as it grows. The quadratic interval is wider than the linear one because the four extra parameters absorb degrees of freedom.

One thing the intervals do NOT do on this dataset: exclude the naive +0.0632 estimate. That's because the data generator doesn't bake in confounding by query confidence. The only difference between the premium and cheap groups in expectations is the +6pp routing effect itself, so the naïve comparison is close to the truth.

Real systems are messier. In a production setting where unobserved query traits affect both the routing assignment and task completion, the naïve estimate would diverge from the RDD estimate, and the bootstrap intervals would tell you which one to trust.

When Regression Discontinuity Fails

RDD looks clean, but several specific failure modes can destroy the identification. Each one maps to a violation of one of the two named assumptions.

Users manipulate the running variable (violates assumption 1). The whole setup depends on users (or any upstream service) being unable to precisely control which side of the cutoff they land on. Any system that reveals the cutoff and gives users a way to influence their score (a retry mechanism, a prompt engineering workaround, a confidence-inflating trick) breaks RDD.

Run the density check in Step 3 every time. If you find manipulation, switch to a fuzzy RDD that treats the threshold as probabilistic, or abandon the approach.

Other policies fire at the same cutoff (violates assumption 2). If your product has additional rules that activate at 0.85 (a separate UI treatment, a different logging level, a different retry policy), RDD can't separate the routing effect from those other policy effects. Audit the full rule book for anything that shares the threshold.

The threshold has noise or overrides (violates assumption 1, in the structural sense). Maybe routing isn't strictly deterministic at 0.85 – it may have random jitter, or a second rule may override the main rule in some cases.

If assignment to the premium model isn't a deterministic function of query_confidence, you have a fuzzy RDD, which requires an instrumental variables framework. The rdrobust package handles both sharp and fuzzy designs.

Curvature masquerading as a jump (breaks the linear approximation that supports identification at the cutoff). Sharp RDD assumes linearity is a reasonable local approximation. When the underlying outcome-confidence relationship is strongly curved, the linear specification can mistake the bend for a jump.

Step 4's quadratic robustness check is the standard diagnostic. If linear and quadratic disagree, widen the bandwidth and re-run both.

Extrapolation bias (a continuity issue, reframed). RDD estimates are strictly local to the cutoff. The +0.06 effect at 0.85 tells you nothing about what premium routing would do for queries with confidence 0.30 or 0.99.

If you want a global average effect, you need a different technique: propensity methods, regression with confounder adjustment, or an actual experiment.

What to Do Next

RDD is the right tool when your AI feature is gated by a continuous score and a sharp threshold.

If your feature is gated by a user-controlled toggle, propensity score methods are a better fit. If it's gated by a staged rollout across workspaces, difference-in-differences handles it. If it's gated by rules you can't observe directly but that have a random component, instrumental variables is the right choice.

For production RDD analyses, use the rdrobust Python package. It gives you optimal bandwidth selection (Calonico, Cattaneo, and Titiunik 2014), bias-corrected standard errors, and a built-in plotting utility. The companion rddensity package implements the McCrary density test you saw informally in Step 3.

The from-scratch version in this tutorial shows the mechanics. The rd-packages stack is what you ship to a reviewer.

One thing the LATE doesn't do: tell you the effect for users far from the cutoff. If a +0.06 LATE at 0.85 is enough to keep premium routing in the pipeline, you're done. If you need to know what premium would do for the easy queries you're currently sending to cheap (or the hardest queries near the floor), the next step is a small randomized rollout in those zones, scored against the RDD estimate as a calibration check. Don't generalize the LATE without evidence.

The companion notebook for this tutorial lives here on GitHub. Clone the repo, generate the synthetic dataset, and run rdd_demo.ipynb to reproduce every code block from this tutorial.

Threshold routing is one of the most common patterns in production LLM systems, and every confidence-gated routing decision in your stack is a potential RDD. Run the analysis.

Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python

Rudrendu Paul — Thu, 30 Apr 2026 23:01:26 +0000

Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a random sample.

Your product shipped a new agent mode last quarter. Users have to tap the "Try agent mode" toggle to enable it. The dashboard numbers look stunning: agent-mode users complete 21 percentage points more tasks than non-users. The CPO calls it the best feature launch of the year.

But you know something's off. Heavy-engagement users opt into new features constantly, while light users ignore toggles entirely. That 21-point gap measures the agent's effect combined with the pre-existing gap between power users and the rest of your base.

This is the Opt-In Trap. It shows up in every generative AI product that ships features behind a user-controlled toggle: "Try our AI assistant," "Enable smart replies," "Turn on code suggestions." Users who click to opt in differ systematically from those who scroll past. Any naïve comparison between the two groups collapses the feature's causal effect into whatever made those users opt in in the first place.

Running an AI feature behind a toggle is a product experiment. The hypothesis: the feature improves outcomes for users who adopt it.

Unlike an A/B test, where the coin flip creates two otherwise-identical populations, the toggle creates two populations that differ before they even make a choice. That pre-existing difference is the measurement problem, and a t-test on dashboard numbers can't fix it.

Propensity score methods are statistical tools that data scientists use to separate adoption bias from the feature's actual effect. They reweight (or rematch) your comparison so that opted-in and non-opted-in groups look comparable on observable characteristics, approximating what a randomized experiment would have given you.

This tutorial walks through the full pipeline (propensity estimation, inverse-probability weighting, nearest-neighbor matching, balance diagnostics, and bootstrap confidence intervals) on a 50,000-user synthetic SaaS dataset where the ground-truth causal effect is known. You'll estimate it, quantify uncertainty, and see where the approach silently breaks.

Companion code: every code block runs end-to-end in the companion notebook at github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in. The notebook (psm_demo.ipynb) has all outputs pre-executed, so you can read along on GitHub before running anything locally.

Why Opt-in Features Break Naïve Comparisons
What Propensity Scores Actually Do
Prerequisites
Setting Up the Working Example
Step 1: Estimate the Propensity Score
Step 2: Inverse-Probability Weighting
Step 3: Nearest-Neighbor Matching
Step 4: Check Covariate Balance
Step 5: Bootstrap Confidence Intervals
When Propensity Score Methods Fail
What to Do Next

Why Opt-in Features Break Naïve Comparisons

The math of an A/B test is elegant because of one assumption: treatment is assigned independent of everything else. Flip a coin: half your users get agent mode, and the coin flip breaks every possible confound by construction. The opt-in world has no coin.

Three mechanisms make opt-in comparisons misleading.

1. Selection on engagement

Power users click everything. If your heavy-engagement cohort opts into agent mode at 65 percent and your light-engagement cohort opts in at 12 percent, you've stacked the opt-in group with users who were going to complete more tasks anyway.

That compositional imbalance accounts for most of the observed lift on its own, before the agent does any work.

2. Selection on intent

Users who opt into a new feature often have a specific use case in mind. A developer who clicks "Try code suggestions" already has code to write. That user would have shown higher task completion even with the control UI.

3. Selection on risk tolerance

Early adopters tolerate rough edges. A user who clicks "Try beta" and sees slow latency sticks around, but a risk-averse user bounces.

Your opt-in group is enriched for people willing to put up with bad experiences, which affects every downstream metric you might measure.

All three produce the same symptom: a raw comparison of opted-in users against everyone else that can overstate the feature's causal effect by 2x or more, depending on how concentrated opt-in is among your heaviest users.

On the synthetic dataset in this tutorial, the naïve comparison inflates a true +8pp effect to +21pp, a 2.6x overshoot. Propensity score methods exist to correct this.

What Propensity Scores Actually Do

Figure 1: Schematic propensity score distributions for two hypothetical groups. The opted-in group (red) skews toward higher propensities, while the non-opted-in group (blue) skews lower.

In the above figure, the bracketed strip below the x-axis splits the score range into three zones: a control-heavy region at low propensities where few treated users exist, a region of common support in the middle where both groups are well represented, and a treatment-heavy region at high propensities where few controls exist. Propensity score methods operate within the common-support region by reweighting or rematching so that the two groups appear balanced on observables. The extremes are either trimmed out or handled with caution.

The propensity score is the probability that a user opts in given their observable characteristics. Estimate this probability well, and you can use it to reweight your sample so that opted-in and non-opted-in users look similar on observables, just as they would have if opt-in had been randomized.

Two practical strategies use the propensity score:

Inverse-probability weighting (IPW) assigns each user a weight equal to the inverse of their probability of receiving the treatment they actually received. Opted-in users get weighted by 1/P(opt-in). Non-opted-in users get weighted by 1/P(no opt-in). After weighting, the two groups are balanced on observables, and the weighted difference in outcomes approximates the average treatment effect.
Matching pairs each opted-in user with one or more non-opted-in users who have similar propensity scores. The average outcome difference between matched pairs estimates the average treatment effect on the treated (ATT): what opt-in users actually gained by opting in.

Both methods rest on three identification assumptions working together.

First, unconfoundedness: every observable variable that drives opt-in and affects the outcome is in your propensity model.
Second, overlap (also called positivity): every user has some nonzero probability of opting in and some nonzero probability of staying out.
Third, no interference: one user's opt-in decision does not affect another user's outcome (the stable-unit-treatment-value assumption, or SUTVA.

Violate any one of these and the estimate is biased even when the other two hold. The failure modes at the end of this tutorial walk through each one.

Prerequisites

You'll need Python 3.11 or newer, comfort with pandas and scikit-learn, and rough familiarity with logistic regression.

Install the packages for this tutorial:

pip install numpy pandas scikit-learn matplotlib

Here's what's happening: four packages cover the full pipeline. Pandas loads the data, NumPy handles weights and array arithmetic, scikit-learn fits the propensity model and runs nearest-neighbor matching, and matplotlib renders the overlap diagnostic.

Clone the companion repo to get the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Here's what's happening: the clone pulls the companion repo, and generate_data.py produces the shared synthetic dataset used across the series. Seed 42 keeps the dataset reproducible, and 50,000 users give clean signal for every estimator in this tutorial. The output CSV lands at data/synthetic_llm_logs.csv.

Setting Up the Working Example

The synthetic dataset simulates a SaaS product where users can opt into an agent mode that uses a more expensive model. With fifty thousand users, opt-in rates differ sharply by engagement tier: heavy users opt in at 65 percent, medium users at 35 percent, and light users at 12 percent.

The ground-truth causal effect baked into the data generator is +8 percentage points on task completion for users who opted in. The naive comparison inflates this to around +21 percentage points because selection bias stacks the opted-in group with your most engaged users.

Knowing the ground truth is what lets you verify that your propensity score method recovers it.

Load the data and see the selection problem:

import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")

print(df.groupby("engagement_tier").opt_in_agent_mode.mean().round(3))

naive_effect = (
    df[df.opt_in_agent_mode == 1].task_completed.mean()
    - df[df.opt_in_agent_mode == 0].task_completed.mean()
)
print(f"\nNaive opt-in effect: {naive_effect:+.4f}")

Expected output:

engagement_tier
heavy     0.647
light     0.120
medium    0.353
Name: opt_in_agent_mode, dtype: float64

Naive opt-in effect: +0.2106

Here's what's happening: you load 50,000 rows, group by engagement tier, and print the opt-in rate inside each group. Heavy users opt in far more than light users, which is the selection-on-engagement pattern baked into the data. The naïve effect lands at +0.2106 (21 percentage points), nearly three times the ground truth of +0.08. That gap is exactly what propensity score methods have to remove.

Step 1: Estimate the Propensity Score

The propensity score is the output of a model that predicts opt-in from observable characteristics. Logistic regression is the right starting point because it's interpretable and fast, but watch the balance diagnostics in Step 4: if any weighted SMD stays above 0.1, the logistic model is missing an interaction, and gradient boosting is the next move.

For this dataset, the relevant observables are engagement tier and query confidence. In a real product, you'd include every variable you think drives opt-in: device type, tenure, plan tier, and historical usage patterns.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X = pd.get_dummies(
    df[["engagement_tier", "query_confidence"]],
    drop_first=True
).astype(float)
y_treat = df.opt_in_agent_mode

ps_model = LogisticRegression(max_iter=1000).fit(X, y_treat)
df["propensity"] = ps_model.predict_proba(X)[:, 1]

# Basic sanity checks
print(df.groupby("engagement_tier").propensity.mean().round(3))
print(
    f"\nPropensity range (treated):  "
    f"{df[df.opt_in_agent_mode == 1].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 1].propensity.max():.3f}"
)
print(
    f"Propensity range (control):  "
    f"{df[df.opt_in_agent_mode == 0].propensity.min():.3f} - "
    f"{df[df.opt_in_agent_mode == 0].propensity.max():.3f}"
)
print(f"Propensity model AUC: {roc_auc_score(y_treat, df.propensity):.3f}")

Expected output:

engagement_tier
heavy     0.646
light     0.120
medium    0.353
Name: propensity, dtype: float64

Propensity range (treated):  0.114 - 0.675
Propensity range (control):  0.114 - 0.673
Propensity model AUC: 0.744

Here's what's happening: you encode the engagement tier as dummy variables, keep query confidence continuous, and fit a logistic regression model. The predicted probability from the model is each user's propensity score.

Scikit-learn LogisticRegression applies L2 regularization by default (C=1.0), which shrinks propensities slightly toward 0.5. For production use, you can set penalty=None if you want an unregularized fit.

Mean propensity inside each engagement tier recovers the true opt-in rate for that tier almost exactly, so the model is calibrated. The AUC of 0.744 confirms the model discriminates between opt-ins and non-opt-ins well above chance (0.5).

And the propensity ranges overlap between treated and control groups (both span roughly 0.11 to 0.67), which is the visual overlap condition.

Figure 2: Two views of the same positivity check on the real 50,000-user synthetic dataset.

In the figure above, the top panel plots smooth kernel density curves of the fitted propensity scores for each group. The three peaks align with the three engagement tiers (light at p ≈ 0.12, medium at p ≈ 0.35, heavy at p ≈ 0.65), as expected, because the opt-in rate is tier-driven. The bottom panel translates that same distribution into raw counts per tier: every tier contains thousands of both opted-in and non-opted-in users, which is exactly what positivity requires.

Where Figure 1 schematically illustrated the idea, this figure shows that it holds for the data, so the weighting and matching that follow will have real counterfactuals to work with.

Step 2: Inverse-Probability Weighting

IPW assigns each user a weight inversely proportional to their propensity. An opted-in user with a 0.12 propensity is rare (a light user who still opted in despite low engagement) and carries information about 1 / 0.12 ≈ 8 similar users in the population. A control user with a 0.12 propensity is the expected case for light users who stayed out, so they're common and get a weight of 1 / (1 - 0.12) ≈ 1.14.

import numpy as np

# ATE weights: 1/P(treat) for treated, 1/P(no treat) for control
df["ipw"] = np.where(
    df.opt_in_agent_mode == 1,
    1 / df.propensity,
    1 / (1 - df.propensity)
)

t = df[df.opt_in_agent_mode == 1]
c = df[df.opt_in_agent_mode == 0]
ate_ipw = (
    (t.task_completed * t.ipw).sum() / t.ipw.sum()
    - (c.task_completed * c.ipw).sum() / c.ipw.sum()
)
print(f"IPW average treatment effect (ATE): {ate_ipw:+.4f}")

# ATT: what opt-in users actually gained
df["ipw_att"] = np.where(
    df.opt_in_agent_mode == 1,
    1,
    df.propensity / (1 - df.propensity)
)
t = df[df.opt_in_agent_mode == 1]   # re-slice now that ipw_att is in df
c = df[df.opt_in_agent_mode == 0]
treated_mean = t.task_completed.mean()
control_w_mean = (c.task_completed * c.ipw_att).sum() / c.ipw_att.sum()
att_ipw = treated_mean - control_w_mean
print(f"IPW average treatment effect on treated (ATT): {att_ipw:+.4f}")

Expected output:

IPW average treatment effect (ATE): +0.0851
IPW average treatment effect on treated (ATT): +0.0770

Here's what's happening: first, you compute ATE weights for every user and take the weighted difference in task completion between opted-in and non-opted-in groups. Then you compute ATT weights, which reweight only the control group to match the treated group's covariate distribution, and compute the average treatment effect on the treated.

ATE answers the population question: what's the effect on a random user who might or might not have opted in anyway? ATT answers the user question: What did opt-in users actually gain? On this dataset, ATE lands at +0.0851 and ATT at +0.0770, both close to the ground-truth +0.08 and a massive improvement over the naive +0.2106.

The distinction matters in practice. Deciding whether to roll the feature out to users who haven't opted in calls for ATE. Reporting on the value opt-in users captured calls for ATT.

Step 3: Nearest-Neighbor Matching

Matching takes a different approach: pair each opted-in user with the non-opted-in user whose propensity score is closest, then take the average outcome difference across matched pairs. The result estimates ATT.

from sklearn.neighbors import NearestNeighbors

treated_ps = df[df.opt_in_agent_mode == 1][["propensity"]].values
control_ps = df[df.opt_in_agent_mode == 0][["propensity"]].values

nn = NearestNeighbors(n_neighbors=1).fit(control_ps)
_, idx = nn.kneighbors(treated_ps)

treated_outcomes = df[df.opt_in_agent_mode == 1].task_completed.values
matched_control_outcomes = (
    df[df.opt_in_agent_mode == 0].task_completed.values[idx.flatten()]
)

att_match = (treated_outcomes - matched_control_outcomes).mean()
print(f"1-NN matching ATT: {att_match:+.4f}")

Expected output:

1-NN matching ATT: +0.0752

Here's what's happening: you extract propensity scores for each group, fit a nearest-neighbor index on the control group, and find the single closest control user for every treated user.

The NearestNeighbors index allows the same control user to be selected as the match for multiple treated users, so this is a matching-with-replacement case.

You pull the outcomes for each treated user and their matched control, take the difference per pair, and average across pairs. The result estimates what opt-in users gained compared to very similar users who did not opt in.

The +0.0752 result lands close to the ground truth of +0.08 but slightly below IPW ATT, typical of 1-NN matching because a single nearest neighbor is a high-variance estimator.

Two variants are worth knowing. Matching with replacement (what you just ran) allows a single control user to serve as a match for multiple treated users, reducing bias when good matches are scarce but inflating variance.

Matching without replacement assigns each control user to at most one treated user, which keeps variance lower but forces poor-quality pairings when the treated group dwarfs the available controls.

For most production analyses, k-nearest-neighbor matching with k = 3-5 and replacement is a sensible default.

Step 4: Check Covariate Balance

Propensity score methods work only if they actually balance the covariates between groups. You need to verify that they did, because if the balance fails, your estimate is wrong.

The standard diagnostic is the standardized mean difference (SMD) for each covariate. SMD compares the treated group mean to the control group mean, divided by the pooled standard deviation.

Before weighting, SMDs tell you how imbalanced the raw groups are. After weighting, they should be small (|SMD| < 0.1 is the conventional cutoff).

def smd(treated_vals, control_vals, treated_w=None, control_w=None):
    """Standardized mean difference, optionally with weights."""
    if treated_w is None:
        treated_w = np.ones(len(treated_vals))
    if control_w is None:
        control_w = np.ones(len(control_vals))
    t_mean = np.average(treated_vals, weights=treated_w)
    c_mean = np.average(control_vals, weights=control_w)
    pooled_std = np.sqrt((treated_vals.var() + control_vals.var()) / 2)
    return (t_mean - c_mean) / pooled_std

engagement_heavy = (df.engagement_tier == "heavy").astype(float).values
qc = df.query_confidence.values
tr = (df.opt_in_agent_mode == 1).values

covariates = {
    "engagement_tier_heavy": engagement_heavy,
    "query_confidence": qc,
}

print(f"{'Covariate':<30} {'Raw SMD':>10} {'Weighted SMD':>15}")
for name, vals in covariates.items():
    smd_raw = smd(vals[tr], vals[~tr])
    smd_weighted = smd(
        vals[tr], vals[~tr],
        treated_w=df[tr].ipw.values,
        control_w=df[~tr].ipw.values,
    )
    print(f"{name:<30} {smd_raw:>+10.3f} {smd_weighted:>+15.3f}")

Expected output:

Covariate                         Raw SMD    Weighted SMD
engagement_tier_heavy              +0.742          +0.002
query_confidence                   -0.032          -0.003

Here's what's happening: the helper computes the standardized mean difference for any covariate, with optional IPW weights.

You then print raw and weighted SMDs for each covariate. The raw SMD on engagement_tier_heavy is +0.742 (heavy users opt in far more than everyone else), and the weighted SMD drops to +0.002, a clean pass. Query confidence was already close to balanced on the raw data, and weighting keeps it that way. If any weighted SMD came back above 0.1 in absolute value, your propensity model would be missing something; the fix is usually richer features or interaction terms in the logistic regression.

Visually, Figure 2 above confirmed what the SMDs now confirm numerically: the overlap condition holds, and balance is achievable.

Step 5: Bootstrap Confidence Intervals

Point estimates are only half the story. Any estimate you report to a product team needs an interval that tells them whether +0.08 is distinguishable from +0.03 or from +0.12. Analytic standard errors for IPW and matching are tricky because of the estimated propensity score, so the simplest and most honest move is the non-parametric bootstrap.

def estimate_all(sample):
    """Return (ATE_IPW, ATT_IPW, ATT_match) on a bootstrap sample."""
    s = sample.copy()
    X_s = pd.get_dummies(
        s[["engagement_tier", "query_confidence"]], drop_first=True
    ).astype(float)
    ps = LogisticRegression(max_iter=1000).fit(X_s, s.opt_in_agent_mode)
    s["p"] = ps.predict_proba(X_s)[:, 1]

    s["w_ate"] = np.where(
        s.opt_in_agent_mode == 1, 1 / s.p, 1 / (1 - s.p)
    )
    s["w_att"] = np.where(
        s.opt_in_agent_mode == 1, 1, s.p / (1 - s.p)
    )
    t, c = s[s.opt_in_agent_mode == 1], s[s.opt_in_agent_mode == 0]

    ate = (
        (t.task_completed * t.w_ate).sum() / t.w_ate.sum()
        - (c.task_completed * c.w_ate).sum() / c.w_ate.sum()
    )
    att = t.task_completed.mean() - (
        (c.task_completed * c.w_att).sum() / c.w_att.sum()
    )
    nn_b = NearestNeighbors(n_neighbors=1).fit(c[["p"]].values)
    _, idx_b = nn_b.kneighbors(t[["p"]].values)
    match = (
        t.task_completed.values
        - c.task_completed.values[idx_b.flatten()]
    ).mean()
    return ate, att, match

rng = np.random.default_rng(7)
n_reps = 500
results = np.zeros((n_reps, 3))
for i in range(n_reps):
    boot = df.iloc[rng.integers(0, len(df), size=len(df))]
    results[i] = estimate_all(boot)

for name, col in zip(["IPW ATE", "IPW ATT", "1-NN ATT"], range(3)):
    lo, hi = np.percentile(results[:, col], [2.5, 97.5])
    print(f"{name:<10} 95% CI: [{lo:+.4f}, {hi:+.4f}]")

Expected output:

IPW ATE    95% CI: [+0.0745, +0.0954]
IPW ATT    95% CI: [+0.0687, +0.0865]
1-NN ATT   95% CI: [+0.0659, +0.0940]

Here's what's happening: you resample the dataset with replacement 500 times, refit the propensity model, and recompute each estimator on each resample, and take the 2.5th and 97.5th percentiles of the bootstrap distribution as the 95% confidence interval. All three intervals cover the ground-truth +0.08 and exclude the naive +0.21 by a wide margin.

The IPW ATT interval is the tightest because ATT reweights only the control group. The 1-NN matching interval is the widest because single-neighbor matching discards control users outside the matched set.

Running this once takes about 90 seconds on a laptop. For a stakeholder report, anchor the headline to the point estimate and cite the interval so the team sees the uncertainty alongside the number.

When Propensity Score Methods Fail

Propensity scores make opt-in comparisons rigorous when their assumptions hold. They produce biased estimates that look clean when those assumptions fail.

Four common failure modes map to the three identification assumptions from earlier.

1. Unmeasured Confounders (Violate Unconfoundedness)

If something drives both opt-in and your outcome but isn't in your propensity model, IPW and matching produce biased estimates. This is the most common failure in practice.

An example: users who opt into agent mode are also the users who follow your engineering blog and read release notes. If blog-reading behavior raises task completion independently of the feature, missing that signal attributes the effect to agent mode, inflating your estimate.

The only real defense is domain knowledge about what drives opt-in, richer feature engineering in your propensity model, and formal sensitivity tools (Rosenbaum bounds, E-values) that quantify how strong an unmeasured confounder would have to be to overturn the result.

2. Positivity (Overlap) Failures (Violates Overlap)

If some users have near-zero probability of opting in (or near-one), you've got no comparable counterfactual for them. I

PW creates extreme weights (1 / 0.001 = 1,000) that let a single outlier dominate the estimate. So matching is forced into poor-quality pairings.

Check propensity histograms and trim propensities outside [0.05, 0.95] before weighting if extreme values exist.

3. Misspecified Propensity Models (Degrade Unconfoundedness in Practice)

A linear logistic regression can't capture nonlinear relationships. If opt-in depends on the interaction between engagement tier and query confidence (power users with complex queries opt in, while light users pass), a main-effects model misses that and produces poor balance.

Use flexible models (for example, gradient boosting on the propensity score or regression adjustment on top of weighting) and always check the balance after weighting. Poor balance after weighting is the primary signal of misspecification.

4. Spillovers Between Users (Violates SUTVA)

Propensity score methods assume your users are independent. If one user opting into agent mode affects another user's task completion (for example, teammates adopting the feature together in shared workspaces), your estimated effect includes the spillover.

This violates the stable-unit-treatment-value-assumption, and handling it cleanly requires a different toolkit: either cluster randomization for features adopted at the workspace level or network-aware experimental designs for user-level spillovers.

These failure modes stay invisible in your regression coefficients. They surface as estimates that look good on paper but don't hold up when the feature rolls out to a broader audience.

Run balance diagnostics, check overlap plots, and document what you might have missed: those are your only real defenses.

What to Do Next

Propensity score methods are the right tool when your feature ships behind an opt-in toggle and you've got rich covariates to model selection with.

If opt-in follows a crisp rule (a threshold on query complexity, a paid-tier gate), regression discontinuity fits better. If you suspect unobserved confounders and have an external randomization source (randomized rollout noise, rate-limit-triggered routing), instrumental variables will do better.

To guard your estimate against propensity misspecification, doubly robust estimators combine propensity weighting with regression adjustment and stay consistent if at least one of the two component models is correctly specified.

The companion notebook for this tutorial lives here. Clone the repo, generate the synthetic dataset, and run psm_demo.ipynb (or psm_demo.py) to reproduce every code block, every number, and every figure from this tutorial.

When an AI feature ships behind a toggle, the naïve opt-in comparison is usually the wrong number. Propensity score methods give you "users comparable to those who clicked this" as your counterfactual, and the bootstrap gives you an interval you can defend when a stakeholder asks how sure you are.

Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It

Rudrendu Paul — Wed, 22 Apr 2026 22:33:18 +0000

Your team shipped an LLM-based summaries feature to wave 1 workspaces at week 20 and now the post-launch doc is due. You need a causal effect number, a specific estimate you can defend to a statistician.

The problem is that wave 2 workspaces are still waiting, a product-wide onboarding redesign shipped the same Tuesday, and week 20 also coincided with a quarterly engagement bump. Any comparison between the two groups after week 20 mixes the feature's causal effect with the redesign, the seasonality, and whatever selection criteria determined which workspaces landed in wave 1 in the first place.

This is how most enterprise SaaS teams ship AI features in 2026: one workspace at a time, in waves, on a rollout calendar. Randomization doesn't happen, and because randomization doesn't happen, A/B testing can't give you a clean causal effect. The result is a number on a dashboard that everyone argues over.

Call this the Rollout Calendar Trap: you have real data, a real experiment structure, and a completely invalid comparison. For data scientists shipping AI features in waves, it's the primary source of bad causal claims downstream.

Product experimentation for generative AI features follows this exact pattern: the hypothesis is that the AI feature causes higher engagement, and the wave structure is supposed to test it.

The wave calendar replaced the coin flip, and that substitution breaks the math. A simple A/B comparison assumes randomized assignment that the rollout never produced, so the measurement tool fails even when the experiment design is sound.

Difference-in-differences is the causal inference method that fixes this. It subtracts the time trend by comparing how outcomes shift across time periods for each group, giving you a defensible causal estimate even without randomization.

In this tutorial you'll use it to measure the true causal effect of an AI feature rolled out across enterprise workspaces, with working Python code against a synthetic SaaS product dataset.

By the end you'll know how to run a DiD estimate, how to test its parallel-trends assumption, and what to do when that assumption fails.

Why A/B Testing Breaks for Staged Rollouts
What Difference-in-Differences Does
Prerequisites
Setting Up the Working Example
Step 1: A Simple 2x2 DiD
Step 2: Regression DiD with Fixed Effects
Step 3: Checking the Parallel-Trends Assumption
When Difference-in-Differences Fails
What to Do Next

Why A/B Testing Breaks for Staged Rollouts

Random assignment is the engine that makes A/B testing a valid causal method. When you flip a coin to decide which user gets the feature, the treatment and control groups end up with identical distributions of every confounder (any variable that affects both who gets treatment and what outcome you measure). Any difference in outcomes after assignment is the causal effect of the treatment. Full stop.

A staged rollout across enterprise workspaces breaks that engine in three ways:

1. The wave assignment isn't random.

Product teams choose wave 1 workspaces for various reasons: they have the most engaged admins, the largest seat counts, or the best relationship with customer success. Those reasons correlate directly with your outcome. Wave 1 workspaces were going to show higher engagement anyway, feature or no feature.

2. The calendar introduces a time trend

Between week 20 (wave 1 launch) and week 30 (wave 2 launch), your product gets better, your onboarding improves, your sales team lands bigger customers. Any naïve "engagement after week 20 minus engagement before week 20" comparison picks up all of that along with the feature's effect.

3. Adoption inside treated workspaces is itself selective

Even inside a workspace that received the feature, not every user turns it on. Power users do, and less engaged users often wait months. Comparing users who used the feature against users who didn't introduces selection bias, where the groups differ systematically before you even measure the outcome, on top of the non-random workspace assignment.

A/B testing assumes none of these three problems exist. Staged rollouts guarantee all three. The naïve comparison gives you a number, and that number measures engagement theater.

What Difference-in-Differences Does

Difference-in-differences compares the change in outcomes over time between a treated group and a control group. Subtracting one change from the other cancels any shared time trend (product improvements, seasonality, onboarding changes) because both groups experience it equally, leaving you with just the treatment effect.

Here's a concrete example. Imagine tracking quarterly revenue for coffee shops in two neighborhoods. One neighborhood gets a new competitor in Q3, the other doesn't.

Both neighborhoods experience the same underlying market trends, a local economic upturn, and holiday seasonality. DiD isolates the competitor's impact by subtracting whatever revenue shift happened in both neighborhoods.

Your staged rollout sets up the exact same structure: wave 1 workspaces are the neighborhood with the new entrant, wave 2 is the comparison.

The math formalizes this as a 2x2 table, where rows are groups (treated, control), columns are time periods (pre, post), and each cell holds the mean outcome for that group in that period:

A = mean task completion for wave 1 users before week 20 (coffee shops: Q2 revenue, neighborhood with incoming competitor)
B = mean task completion for wave 1 users after week 20 (coffee shops: Q3 revenue, same neighborhood)
C = mean task completion for wave 2 users before week 20 (coffee shops: Q2 revenue, the untouched neighborhood)
D = mean task completion for wave 2 users after week 20 (coffee shops: Q3 revenue, same)

                         Pre     Post
Treated (wave 1):         A       B
Control (wave 2):         C       D

Naive post-period gap:   B - D     (contaminated by group differences)
Naive treated change:    B - A     (contaminated by time trend)
DiD:                 (B - A) - (D - C)   ← the causal effect

B - A is wave 1's change, but it includes both the treatment effect and whatever time trend moved everyone. D - C is wave 2's change over the same window, same time trend, no treatment. Subtracting one from the other leaves only the treatment effect.

The counterfactual is what wave 1 would have looked like without the treatment. DiD constructs it by saying: wave 1's counterfactual trajectory = wave 1's pre-period level, carried forward with wave 2's post-period trend. The gap between the actual wave 1 trajectory and that counterfactual is the DiD estimate.

Figure 1: Causal inference with difference-in-differences. Blue solid: Wave 1 actual trajectory. Orange dashed: Wave 2 (control, untreated during this window). Blue dotted: the counterfactual, where Wave 1 would have gone based on Wave 2's post-period trend. The green arrow is the DiD estimate: the gap between the actual Wave 1 trajectory and the counterfactual in the post-treatment period. A, B, C, D correspond to the four cells in the table above.

Before week 20, wave 1 and wave 2 track each other closely. That's the parallel-trends requirement at work. At week 20, wave 1 pulls ahead of both wave 2 and its own counterfactual (the dotted line). That post-treatment divergence is the DiD estimate.

The DiD estimate handles two types of bias at once. Permanent differences between treated and control groups (wave 1 workspaces were always more engaged) cancel out because DiD focuses on changes in outcomes across time periods. Time trends that affect both groups (product improvements, market seasonality) cancel out because both groups experience them.

DiD asks one thing in return: parallel pre-treatment trends. The treated and control groups have to be moving in the same direction at the same rate before treatment starts. When that holds, you can extrapolate the shared trend forward and attribute any post-treatment divergence to the treatment. If the trends were already diverging before treatment, DiD is biased, and no amount of clever regression fixes it.

Parallel trends is the assumption you'll test in step 3.

Companion Notebook

All the code in this tutorial, including the synthetic dataset, the DiD regression, the parallel-trends plot, and the placebo pre-trend test, lives in a single executable Jupyter notebook in the GitHub repo for this series on product experimentation and causal inference for GenAI and LLM applications.

You can clone it, run generate_data.py once, and every output in this article reproduces exactly: github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm

Prerequisites

You'll need Python 3.11 or newer and comfort with pandas and basic regression. You can follow along without prior causal inference experience, as the article defines confounders and selection bias inline when they first appear. You'll encounter clustered standard errors and fixed effects in step 2. The article explains what they do and why they matter, but it doesn't derive them from scratch.

Install the packages for this tutorial:

pip install numpy pandas statsmodels linearmodels matplotlib

Clone the companion repo to get the synthetic dataset:

git clone https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm.git
cd product-experimentation-causal-inference-genai-llm
python data/generate_data.py --seed 42 --n-users 50000 --out data/synthetic_llm_logs.csv

Setting Up the Working Example

The dataset simulates a SaaS product with an AI summaries feature launched in two waves: wave 1 workspaces get it at week 20, wave 2 at week 30, with 50,000 users total, each with one row of telemetry.

The data generator bakes in a +5 percentage point causal effect on task completion for users in their workspace's post-treatment period. You know the truth upfront, so you can check whether your DiD estimator actually recovers it.

Load the data and inspect the structure:

import pandas as pd

df = pd.read_csv("data/synthetic_llm_logs.csv")
print(df.shape)
print(df[["wave", "signup_week", "workspace_id", "task_completed"]].head())
print("\nWave sizes:", df.wave.value_counts().to_dict())
print("Treatment weeks per wave:",
      df.groupby("wave").treatment_week.first().to_dict())

Expected output:

(50000, 16)
   wave  signup_week  workspace_id  task_completed
0     2           10            36               0
1     2           51            44               1
2     2            2            28               1
3     1           15            20               1
4     1           29             0               1
Wave sizes: {2: 25063, 1: 24937}
Treatment weeks per wave: {1: 20, 2: 30}

Here's what's happening: you load 50,000 rows, one per user. Wave 1 has about 24,937 users across 25 workspaces; wave 2 has about 25,063 users across 25 different workspaces. The treatment_week column records when each user's workspace got the AI summaries feature (week 20 for wave 1, week 30 for wave 2). The task_completed column is your outcome: did the AI successfully complete the user's task.

One important detail: signup_week in this dataset records which calendar week a user first joined the product, and we're using it as a time index to assign users to pre- or post-treatment cohorts.

A user who signed up in week 22 joined after the feature launched, so their experience is "post-treatment." A user who signed up in week 14 joined before the launch, so their experience is "pre-treatment."

This works here because each user has one row of telemetry tied to their initial product experience. In a panel dataset with multiple observations per user across time, you'd use an observation timestamp column tied to when each row was recorded.

To keep the analysis clean, restrict to users who signed up before the wave 2 launch (signup_week < 30). Wave 2 then works as a proper control group, since it hasn't been treated yet, while wave 1 has been treated for 10 weeks.

analysis = df[df.signup_week < 30].copy()
analysis["post"] = (analysis.signup_week >= 20).astype(int)
analysis["treated"] = (analysis.wave == 1).astype(int)

print(analysis.groupby(["treated", "post"])
              .agg(n=("user_id", "count"),
                   mean_completion=("task_completed", "mean"))
              .round(3))

Expected output:

                 n  mean_completion
treated post
0       0     9590            0.556
        1     4878            0.555
1       0     9633            0.592
        1     4738            0.643

Here's what's happening: you filter the data to the analysis window (weeks 0 to 29) and create two indicator variables. post is 1 for users in the post-week-20 period, 0 otherwise. treated is 1 for wave 1 users, 0 for wave 2. The groupby shows the four cells of the DiD 2x2 table: (treated=0, post=0), (treated=0, post=1), (treated=1, post=0), (treated=1, post=1). Those four means are everything you need for a first-pass DiD estimate.

Step 1: A Simple 2x2 DiD

Start with the cleanest version. Compute the four cell means by hand, then take the difference of differences:

cells = analysis.groupby(["treated", "post"]).task_completed.mean()

wave2_pre  = cells.loc[(0, 0)]   # control, pre
wave2_post = cells.loc[(0, 1)]   # control, post
wave1_pre  = cells.loc[(1, 0)]   # treated, pre
wave1_post = cells.loc[(1, 1)]   # treated, post

did_effect = (wave1_post - wave1_pre) - (wave2_post - wave2_pre)
print(f"Wave 1 change: {wave1_post - wave1_pre:+.4f}")
print(f"Wave 2 change: {wave2_post - wave2_pre:+.4f}")
print(f"DiD effect:    {did_effect:+.4f}")

Expected output:

Wave 1 change: +0.0515
Wave 2 change: -0.0013
DiD effect:    +0.0527  (ground truth = +0.05)

Here's what's happening: you pull the four cell means, compute wave 1's change in task completion from pre to post, compute wave 2's change over the same calendar window (wave 2 hasn't been treated yet), and take the difference. The DiD estimate is the piece of wave 1's change that can't be explained by whatever time trend also moved wave 2.

On this dataset the simple 2x2 estimate lands at +0.053, which is very close to the true +0.05. But you can't take this number to a product review. You have no standard errors, which means you can't say whether +0.053 is a real signal or within sampling noise. You have no covariate adjustment, so if wave 1 happened to have more heavy users in this cohort, some of that +0.053 could be engagement-tier composition. And you have no way to handle the workspace-level correlation in your data. Step 2 fixes all three.

Step 2: Regression DiD with Fixed Effects

The regression formulation of DiD produces the same point estimate as the 2x2 table when there are no covariates. But it also buys you three things:

Standard errors and p-values computed correctly
Covariate adjustment to reduce variance and sharpen your estimate
Cluster-robust errors that handle correlation within workspaces, which a staged rollout always has

The regression is: outcome ~ treated + post + treated:post + controls. The coefficient on the treated:post interaction is your DiD estimate.

import statsmodels.formula.api as smf

did_model = smf.ols(
    "task_completed ~ treated * post + C(engagement_tier)",
    data=analysis
).fit(
    cov_type="cluster",
    cov_kwds={"groups": analysis.workspace_id}
)

print(did_model.summary().tables[1])

Expected output:

================================================================================================
                                   coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                        0.8301      0.007    126.538      0.000       0.817       0.843
C(engagement_tier)[T.light]     -0.4027      0.006    -63.168      0.000      -0.415      -0.390
C(engagement_tier)[T.medium]    -0.1766      0.007    -25.931      0.000      -0.190      -0.163
treated                          0.0367      0.005      6.885      0.000       0.026       0.047
post                            -0.0056      0.008     -0.684      0.494      -0.022       0.011
treated:post                     0.0541      0.011      4.981      0.000       0.033       0.075
================================================================================================

Here's what's happening: you fit an ordinary least squares regression of task completion on the treated indicator, the post indicator, their interaction, and a categorical control for engagement tier.

The treated:post coefficient is the DiD estimate. Users in the same workspace share common shocks, making their outcomes correlated. Grouping by workspace_id corrects for that.

On this dataset the treated:post coefficient comes out at +0.054 with a clustered p-value of <0.001. The ground truth is +0.050. At 0.4 percentage points from the true effect, with a standard error that accounts for workspace-level correlation, that's a number you can put in a product review.

A few practical notes on this regression:

Controls should be time-invariant (engagement tier, signup cohort). Time-varying controls that are themselves affected by treatment will bias the estimate.
Only the interaction has a causal interpretation. The intercept and level terms describe baseline differences between groups, nothing more.
Clustered errors are mandatory. Skip clustering and your standard errors are 3 to 10x too small, test statistics are artificially inflated, and results look far more significant than they are.

Step 3: Checking the Parallel-Trends Assumption

DiD is only valid if wave 1 and wave 2 were moving in the same direction at the same rate before treatment started. You check this by plotting (or tabulating) weekly means for the two waves across the pre-treatment window.

import matplotlib.pyplot as plt
import numpy as np

df_plot = df[df.signup_week < 30].copy()
weekly = (df_plot.groupby(["signup_week", "wave"])
             .task_completed.mean()
             .reset_index()
             .pivot(index="signup_week", columns="wave", values="task_completed"))

# 3-week rolling average to smooth week-to-week sampling noise
smoothed = weekly.rolling(3, center=True, min_periods=2).mean()

TREATMENT_WEEK = 20
pre_idx = smoothed.index[smoothed.index < TREATMENT_WEEK]
post_idx = smoothed.index[smoothed.index >= TREATMENT_WEEK]

# DiD counterfactual: wave 1 pre-period mean + wave 2's post-period change
wave1_pre_mean = smoothed.loc[pre_idx, 1].mean()
wave2_pre_mean = smoothed.loc[pre_idx, 2].mean()
counterfactual = wave1_pre_mean + (smoothed.loc[post_idx, 2].values - wave2_pre_mean)

fig, ax = plt.subplots(figsize=(10, 5.5))
ax.axvspan(-0.5, TREATMENT_WEEK, alpha=0.04, color="#94A3B8", zorder=0)
ax.axvspan(TREATMENT_WEEK, 29.5, alpha=0.06, color="#3B82F6", zorder=0)
ax.plot(smoothed.index, smoothed[2], "s--", color="#F59E0B", linewidth=2,
        markersize=4, label="Wave 2 — control (untreated during this window)", zorder=3)
ax.plot(smoothed.index, smoothed[1], "o-", color="#2563EB", linewidth=2.2,
        markersize=4, label="Wave 1 — treated (AI feature on at week 20)", zorder=4)
ax.plot(post_idx, counterfactual, ":", color="#2563EB", linewidth=2.2,
        label="Wave 1 counterfactual (projected without treatment)", zorder=4)
ax.axvline(TREATMENT_WEEK, color="#DC2626", linestyle="--", linewidth=1.8,
           label="AI feature launched (week 20)")

ax.text(9.5, 0.508, "Pre-treatment period\n(parallel trends required)",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.text(24, 0.508, "Post-treatment",
        fontsize=9, ha="center", color="#64748B", style="italic")
ax.set_xlabel("Week", fontsize=11)
ax.set_ylabel("Mean task completion rate", fontsize=11)
ax.set_title("Figure 2: Data-Driven Parallel-Trends Check\n(3-week rolling average, 50k users)",
             fontsize=12, fontweight="bold", pad=14)
ax.legend(loc="upper left", fontsize=9, framealpha=0.92)
ax.set_xlim(-0.5, 29.5)
ax.set_ylim(0.50, 0.72)
ax.grid(True, alpha=0.18, linestyle=":")
ax.tick_params(labelsize=10)
plt.tight_layout()
plt.savefig("parallel_trends.png", dpi=150, bbox_inches="tight")
print("Saved parallel_trends.png")

Expected output (Figure 2, data-driven verification):

Saved parallel_trends.png

Figure 2 is the data-driven parallel-trends check from your actual dataset, plotted as a 3-week rolling average to smooth week-to-week sampling noise. Both waves track each other closely before week 20, and small wiggles in the pre-period affect both groups at the same time, which is exactly what parallel trends looks like. After week 20, wave 1 separates cleanly above the dotted counterfactual line. The gap between the solid blue line and the dotted line in the post-treatment window is the DiD estimate playing out in your actual data.

Here's what's happening: you group by signup week and wave, compute the mean task completion rate per cell, pivot so each wave is a column, and plot the two time series together.

A vertical dashed line marks week 20 when wave 1 got treatment. In the pre-treatment window (weeks 0 to 19) the two series should track each other closely. After week 20, wave 1 should pull ahead of wave 2 by roughly the treatment effect.

To put a number on it, run a placebo regression on the pre-treatment period only. Regress the outcome on a linear time trend interacted with the treated indicator. If the interaction coefficient is near zero and insignificant, the two groups were moving in parallel before treatment:

pre_only = analysis[analysis.post == 0].copy()
pre_only["weeks_since_start"] = pre_only.signup_week - 10  # center

placebo_model = smf.ols(
    "task_completed ~ treated * weeks_since_start + C(engagement_tier)",
    data=pre_only
).fit(
    cov_type="cluster",
    cov_kwds={"groups": pre_only.workspace_id}
)

print("Pre-trend slope difference:",
      placebo_model.params["treated:weeks_since_start"])
print("p-value:",
      placebo_model.pvalues["treated:weeks_since_start"])

Expected output:

Pre-trend slope difference: -0.00095...
p-value: 0.4435...

Here's what's happening: you restrict to pre-treatment observations, fit a regression that lets wave 1 and wave 2 follow different linear trends in the pre-period, and read off the interaction coefficient.

A coefficient close to zero with p > 0.05 means the two waves were moving in parallel before treatment. If that coefficient is large and statistically significant, the parallel-trends assumption is broken: your DiD estimate is absorbing whatever differential trend separated the groups before week 20.

If the placebo test fails, stop and rethink. Your options: restrict to a narrower pre-window where trends were parallel, find a better control group, or switch to synthetic control, which builds a weighted counterfactual from multiple untreated units.

On this synthetic dataset the placebo test passes: the pre-trend slope difference is -0.00095 with p = 0.44, so the parallel-trends assumption holds and the +0.054 estimate from step 2 is trustworthy.

When Difference-in-Differences Fails

DiD is a precise accounting method, and every precise method has specific failure modes worth knowing before you trust its output. Here are four common ones:

1. Non-parallel Pre-trends

When the treated and control groups were already diverging before treatment started, DiD mistakes that pre-existing drift for a treatment effect.

The placebo test in step 3 is your guard. Run it every time. If it fails, you have three options:

Restrict the analysis to a shorter pre-window where trends were parallel and re-run the placebo
Find a better control group whose pre-trend matches the treated group
Switch to synthetic control, which builds a weighted counterfactual from multiple untreated units and picks the weights to match the treated group's pre-treatment trajectory

2. Staggered Adoption

A staged rollout with three or more waves demands a different approach than a clean 2x2. Wave 1 gets treated at week 20, wave 2 at week 30, wave 3 at week 40. Once wave 2 is treated, it's no longer a valid control for wave 1 comparisons that span weeks 30 and beyond. Earlier treated units start acting as controls for later ones, which contaminates the estimate.

That's the Goodman-Bacon decomposition problem, and the standard two-way fixed effects estimator from step 2 will silently absorb it. The Callaway-Sant'Anna estimator (see their 2021 paper) fixes this by averaging only the clean 2x2 comparisons and discarding the contaminated ones. The differences package in Python implements it.

3. Time-varying Confounders that Hit Only the Treated Group

If your marketing team runs a targeted campaign in wave 1 workspaces during week 22, you've got a treatment-specific shock DiD can't net out.

Parallel trends certifies the pre-treatment period, but the post-treatment window remains your responsibility to audit.

Check every product or marketing event inside the analysis window. If you find one, the only options are to redesign the study, restrict the analysis to the window before the shock, or model the shock explicitly as a second treatment variable.

4. Anticipation Effects

If wave 1 customers knew in week 18 that the feature was coming in week 20, some will have started behaving differently before treatment technically started: signing up more, pre-configuring settings, contacting support. That contaminates the "pre" period. The tell is a bump or dip in wave 1 in the weeks immediately before week 20 on the event-study plot.

The fix is to push the pre-period cutoff back. Treat week 18 as the "treatment" start for purposes of the analysis, which removes the anticipation window from your pre-period baseline.

Each of these failure modes has a diagnostic and a specific remedy. Naming them in your analysis builds credibility with skeptical reviewers. DiD is a careful accounting identity – it produces reliable estimates exactly as long as its inputs are clean.

What to Do Next

The regression DiD above is the right tool for a two-wave rollout. If your rollout has three or more waves, switch to the Callaway-Sant'Anna estimator. If your rollout crosses a treatment threshold you set deliberately (confidence scores, query complexity), look into regression discontinuity. If you want to compare a single treated unit against a constructed counterfactual, synthetic control is the right choice.

The companion notebook for this tutorial is here. Clone the repo, generate the synthetic dataset with generate_data.py, and open did_demo.ipynb to reproduce every code block with pre-saved outputs.

If you ship AI features in waves, your rollout calendar is already a DiD study. The only question is whether you run the analysis.

experimentation - freeCodeCamp.org

Product Experimentation with Regression Discontinuity: How an LLM Confidence Threshold Creates a Natural Experiment in Python

Table of Contents

Why Threshold Routing is a Natural Experiment

What Regression Discontinuity Actually Does

Prerequisites

Setting Up the Working Example

Step 1: A Sharp RDD with Local Linear Regression

Step 2: Try Different Bandwidths

Step 3: Checking for Manipulation at the Threshold

Step 4: Quadratic Specification as a Robustness Check

Step 5: Bootstrap Confidence Intervals

When Regression Discontinuity Fails

What to Do Next

Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python

Table of Contents

Why Opt-in Features Break Naïve Comparisons

1. Selection on engagement

2. Selection on intent

3. Selection on risk tolerance

What Propensity Scores Actually Do

Prerequisites

Setting Up the Working Example

Step 1: Estimate the Propensity Score

Step 2: Inverse-Probability Weighting

Step 3: Nearest-Neighbor Matching

Step 4: Check Covariate Balance

Step 5: Bootstrap Confidence Intervals

When Propensity Score Methods Fail

1. Unmeasured Confounders (Violate Unconfoundedness)

2. Positivity (Overlap) Failures (Violates Overlap)

3. Misspecified Propensity Models (Degrade Unconfoundedness in Practice)

4. Spillovers Between Users (Violates SUTVA)

What to Do Next

Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It

Table of Contents

Why A/B Testing Breaks for Staged Rollouts

1. The wave assignment isn't random.

2. The calendar introduces a time trend

3. Adoption inside treated workspaces is itself selective

What Difference-in-Differences Does

Companion Notebook

Prerequisites

Setting Up the Working Example

Step 1: A Simple 2x2 DiD

Step 2: Regression DiD with Fixed Effects

Step 3: Checking the Parallel-Trends Assumption

When Difference-in-Differences Fails

1. Non-parallel Pre-trends

2. Staggered Adoption

3. Time-varying Confounders that Hit Only the Treated Group

4. Anticipation Effects

What to Do Next