Data Science - freeCodeCamp.org

How to Clean Time Series Data in Python

Bala Priya C — Mon, 18 May 2026 09:01:50 +0000

Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, it has passed through collection, transmission, and storage, each step a potential source of corruption.

Cleaning time series data is harder than cleaning tabular data because time is a structural constraint. You can't shuffle rows or impute a missing value with a column mean without pulling future data into a past observation. Every cleaning decision has to respect temporal ordering, or it breaks the integrity of everything built on top of it.

This guide walks through the full cleaning pipeline in Python: from raw data arrival to a dataset ready for feature engineering or modelling. We'll cover missing value detection and imputation, outlier identification and treatment, duplicate handling, frequency alignment, noise smoothing, and schema validation, applied to sample sensor data throughout.

You can get the Colab notebook from GitHub and follow along.

Prerequisites

To follow along to this guide, you'll need to be:

Comfortable working with Python and pandas DataFrames
Familiar with time-indexed data
Aware of what feature engineering and machine learning modelling involve at a high level

We'll use pandas and numpy for data manipulation, scipy for signal smoothing and statistical tests, scikit-learn for anomaly detection, and statsmodels for seasonal decomposition. Install them before running any code in this guide:

pip install pandas numpy scipy scikit-learn statsmodels

How to Audit Your Time Series Before Cleaning It
How to Reindex to a Canonical Frequency
How to Handle Missing Values
How to Detect and Handle Outliers
How to Remove Duplicates
Frequency Alignment and Resampling
Smoothing Noise
- Exponential Weighted Moving Average
- Savitzky-Golay Filter
Schema and Sanity Validation
The Complete Cleaning Checklist

How to Audit Your Time Series Before Cleaning It

The first rule of data cleaning is: look before you cut. Before imputing, smoothing, or dropping anything, you need a complete picture of what's wrong and where.

A good audit covers the following:

The time index: Is it regular? Are there gaps?
Missing value distribution: Are missing values random or clustered?
Value range: Are there obvious gaps or sensor failures?
Duplicate timestamps

Let's spin up a sample dataset (with some of the above problems):

# Simulate one week of smart grid voltage readings (hourly)
# with realistic problems injected
periods = 168
index = pd.date_range("2024-06-01", periods=periods, freq="H")

voltage = (
    230.0
    + 3.5 * np.sin(2 * np.pi * np.arange(periods) / 24)
    + np.random.normal(0, 1.2, periods)
)

# Inject problems
voltage[14:17] = np.nan          # sensor dropout: 3 consecutive missing
voltage[42] = np.nan             # isolated missing
voltage[78] = 312.4              # spike outlier
voltage[101:104] = np.nan        # another dropout
voltage[130] = 187.2             # dip outlier

series = pd.Series(voltage, index=index, name="voltage_v")

# --- Audit ---
print("=== TIME SERIES AUDIT ===")
print(f"Period:        {series.index.min()} → {series.index.max()}")
print(f"Observations:  {len(series)}")
print(f"Expected freq: {pd.infer_freq(series.index)}")
print(f"\nMissing values: {series.isna().sum()} ({series.isna().mean()*100:.1f}%)")
print(f"Value range:    [{series.min():.2f}, {series.max():.2f}]")
print(f"Mean ± Std:     {series.mean():.2f} ± {series.std():.2f}")

# Identify consecutive missing runs
missing_mask = series.isna()
missing_runs = []
run_start = None
for i, (ts, is_missing) in enumerate(missing_mask.items()):
    if is_missing and run_start is None:
        run_start = ts
    elif not is_missing and run_start is not None:
        missing_runs.append((run_start, missing_mask.index[i - 1]))
        run_start = None

print(f"\nMissing runs ({len(missing_runs)} total):")
for start, end in missing_runs:
    print(f"  {start} → {end}")

Output:

=== TIME SERIES AUDIT ===
Period:        2024-06-01 00:00:00 → 2024-06-07 23:00:00
Observations:  168
Expected freq: h

Missing values: 7 (4.2%)
Value range:    [187.20, 312.40]
Mean ± Std:     230.22 ± 7.81

Missing runs (3 total):
  2024-06-01 14:00:00 → 2024-06-01 16:00:00
  2024-06-02 18:00:00 → 2024-06-02 18:00:00
  2024-06-05 05:00:00 → 2024-06-05 07:00:00

This audit gives you a map of the damage before you start cleaning. The key task is distinguishing between isolated missing values, which are imputable with local context, and missing long runs, which may need a different strategy or flagging for downstream consumers.

How to Reindex to a Canonical Frequency

Before imputing missing values, you need to confirm your time index is actually regular. A common problem in ingested time series is that missing timestamps are simply absent rather than represented as NaN rows — which means a .fillna() call will never find them.

# Simulate a sensor feed with missing timestamps (not just missing values)
irregular_index = index.delete([14, 15, 16, 42, 101, 102, 103])
irregular_series = series.dropna().reindex(irregular_index)

print(f"Original length:   {len(series)}")
print(f"Irregular length:  {len(irregular_series)}")
print(f"Inferred freq:     {pd.infer_freq(irregular_series.index)}")  # None = irregular

# Reindex to the full canonical hourly grid
canonical_index = pd.date_range(
    start=irregular_series.index.min(),
    end=irregular_series.index.max(),
    freq="H"
)

reindexed = irregular_series.reindex(canonical_index)

print(f"\nAfter reindex:")
print(f"Length:         {len(reindexed)}")
print(f"Missing values: {reindexed.isna().sum()}")
print(f"Inferred freq:  {pd.infer_freq(reindexed.index)}")

Output:

Original length:   168
Irregular length:  161
Inferred freq:     None

After reindex:
Length:         168
Missing values: 7
Inferred freq:  h

pd.infer_freq returning None is your signal that the index has gaps. After reindexing to the canonical grid, missing timestamps become explicit NaN rows, and now your imputation logic can find them.

How to Handle Missing Values

Not all missing values should be handled the same way. A single isolated missing reading in a smooth signal is best filled with interpolation. A 3-hour sensor dropout in a volatile signal, however, might be better flagged than fabricated. Strategy should match both gap length and signal behavior.

Forward Fill — For Step-Function Signals

Forward fill is appropriate when the variable holds its last known value until something changes it — a machine state, a setpoint, a categorical flag.

# Equipment operating mode — a step signal
mode_data = pd.Series(
    ["running", "running", np.nan, np.nan, "idle", "idle", np.nan, "running"],
    index=pd.date_range("2024-06-01", periods=8, freq="H"),
    name="operating_mode"
)

filled_mode = mode_data.ffill()
print(pd.DataFrame({"original": mode_data, "ffill": filled_mode}))

Output:

                    original    ffill
2024-06-01 00:00:00  running  running
2024-06-01 01:00:00  running  running
2024-06-01 02:00:00      NaN  running
2024-06-01 03:00:00      NaN  running
2024-06-01 04:00:00     idle     idle
2024-06-01 05:00:00     idle     idle
2024-06-01 06:00:00      NaN     idle
2024-06-01 07:00:00  running  running

Time-Weighted Interpolation — For Continuous Signals

For continuous sensor readings, linear interpolation weighted by time handles irregular gaps correctly because it doesn't assume equal spacing.

# Fill the voltage series using time-based interpolation
voltage_clean = reindexed.interpolate(method="time")

# Compare original vs filled around the first gap
gap_window = voltage_clean["2024-06-01 12:00":"2024-06-01 18:00"]
original_window = reindexed["2024-06-01 12:00":"2024-06-01 18:00"]

comparison = pd.DataFrame({
    "original":     original_window,
    "interpolated": gap_window.round(3),
    "was_missing":  original_window.isna(),
})
print(comparison)

Output:

                       original  interpolated  was_missing
2024-06-01 12:00:00  230.290355       230.290        False
2024-06-01 13:00:00  226.798197       226.798        False
2024-06-01 14:00:00         NaN       226.848         True
2024-06-01 15:00:00         NaN       226.897         True
2024-06-01 16:00:00         NaN       226.947         True
2024-06-01 17:00:00  226.996356       226.996        False
2024-06-01 18:00:00  225.410371       225.410        False

Seasonal Decomposition Imputation — For Long Gaps

For gaps longer than a few steps in a seasonal signal, interpolating across the gap ignores the seasonal pattern. A better approach is to decompose the series, impute each component separately, then reconstruct.

from statsmodels.tsa.seasonal import seasonal_decompose

# Use a longer series for decomposition (needs enough periods)
long_voltage = pd.Series(
    230.0
    + 3.5 * np.sin(2 * np.pi * np.arange(336) / 24)
    + np.random.normal(0, 1.0, 336),
    index=pd.date_range("2024-06-01", periods=336, freq="H")
)

# Inject a 6-hour gap
long_voltage.iloc[100:106] = np.nan

# Interpolate first to give decompose a complete series to work with
temp_filled = long_voltage.interpolate(method="time")
decomp = seasonal_decompose(temp_filled, model="additive", period=24)

# Reconstruct: trend + seasonal + zero residual for missing positions
reconstructed = long_voltage.copy()
missing_idx = long_voltage[long_voltage.isna()].index
reconstructed[missing_idx] = (
    decomp.trend[missing_idx].fillna(method="ffill")
    + decomp.seasonal[missing_idx]
)

print(f"Missing before: {long_voltage.isna().sum()}")
print(f"Missing after:  {reconstructed.isna().sum()}")
print("\nFilled values at gap:")
print(reconstructed[missing_idx].round(3))

Output:


                       original  interpolated  was_missing
2024-06-01 12:00:00  230.290355       230.290        False
2024-06-01 13:00:00  226.798197       226.798        False
2024-06-01 14:00:00         NaN       226.848         True
2024-06-01 15:00:00         NaN       226.897         True
2024-06-01 16:00:00         NaN       226.947         True
2024-06-01 17:00:00  226.996356       226.996        False
2024-06-01 18:00:00  225.410371       225.410        False

The seasonal decomposition imputation respects the time-of-day pattern. As you can see, the filled values aren't a flat line across the gap but follow the expected daily curve.

How to Detect and Handle Outliers

Outliers in time series are trickier than in tabular data because context matters. For example, an unusually high or low voltage might be a sensor spike or a genuine grid event. You need methods that use temporal context, not just global statistics.

Z-Score with Rolling Window

A global Z-score misses local anomalies in non-stationary series. A rolling Z-score flags values that are unusual relative to their local neighbourhood.

Note: A non-stationary series is a time series whose statistical properties—such as mean, variance, or trend—change over time instead of remaining constant.

window = 24  # 24-hour rolling window

roll_mean = voltage_clean.rolling(window, center=True, min_periods=1).mean()
roll_std  = voltage_clean.rolling(window, center=True, min_periods=1).std()

rolling_z = (voltage_clean - roll_mean) / roll_std

threshold = 3.0
outliers_z = rolling_z[rolling_z.abs() > threshold]

print(f"Rolling Z-score outliers detected: {len(outliers_z)}")
print(outliers_z.round(3))

Output:

Rolling Z-score outliers detected: 2
2024-06-04 06:00:00    4.646
2024-06-06 10:00:00   -4.484
Name: voltage_v, dtype: float64

Z-score outlier detection works best for approximately Gaussian (normal) distributions because it assumes the data is centered around a mean with symmetric spread measured by standard deviation.

IQR-Based Outlier Detection

The interquartile range (IQR) method is more robust for detecting outliers in non-Gaussian distributions. The interquartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1), representing the spread of the middle 50% of the data.

Q1 = voltage_clean.quantile(0.25)
Q3 = voltage_clean.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = voltage_clean[
    (voltage_clean < lower_bound) | (voltage_clean > upper_bound)
]

print(f"IQR bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"Outliers detected: {len(outliers_iqr)}")
print(outliers_iqr.round(2))

Output:

IQR bounds: [220.16, 239.46]
Outliers detected: 2
2024-06-04 06:00:00    312.4
2024-06-06 10:00:00    187.2
Name: voltage_v, dtype: float64

Isolation Forest — For Multivariate Outlier Detection

When you have multiple sensors, an isolated reading on one channel might look normal, but its combination with readings from other channels reveals the anomaly. Isolation Forest handles this naturally.

# Build a multi-sensor DataFrame
np.random.seed(42)
n = 200

sensor_df = pd.DataFrame({
    "voltage_v":    230 + 3 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 1, n),
    "current_a":    15  + 0.8 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 0.3, n),
    "frequency_hz": 50  + np.random.normal(0, 0.05, n),
}, index=pd.date_range("2024-06-01", periods=n, freq="H"))

# Inject a multivariate anomaly — voltage drops, current spikes together
sensor_df.iloc[88, 0] = 194.2   # voltage dip
sensor_df.iloc[88, 1] = 28.7    # current surge (consistent with fault)

clf = IsolationForest(contamination=0.02, random_state=42)
sensor_df["anomaly_score"] = clf.fit_predict(sensor_df[["voltage_v", "current_a", "frequency_hz"]])

anomalies = sensor_df[sensor_df["anomaly_score"] == -1]
print(f"Anomalies detected: {len(anomalies)}")
print(anomalies[["voltage_v", "current_a", "frequency_hz"]].round(2))

Output:

Anomalies detected: 4
                     voltage_v  current_a  frequency_hz
2024-06-02 07:00:00     234.75      15.84         49.90
2024-06-04 06:00:00     233.09      15.82         50.15
2024-06-04 16:00:00     194.20      28.70         50.08
2024-06-06 05:00:00     235.09      15.41         49.91

In practice you'd follow up anomaly scores with domain-specific threshold rules.

Outlier Treatment

Once outliers are identified, you can handle them in several ways:

Cap them using Winsorization by limiting extreme values to a threshold.
Replace them with interpolated or estimated values.
Flag them so the model can handle them appropriately.

# Winsorize: cap at the IQR bounds
voltage_winsorized = voltage_clean.clip(lower=lower_bound, upper=upper_bound)

# Replace outliers with time-interpolated values
voltage_outlier_fixed = voltage_clean.copy()
voltage_outlier_fixed[outliers_iqr.index] = np.nan
voltage_outlier_fixed = voltage_outlier_fixed.interpolate(method="time")

print("Outlier treatment comparison:")
for ts in outliers_iqr.index:
    print(f"\n  {ts}")
    print(f"    Original:     {voltage_clean[ts]:.2f}")
    print(f"    Winsorized:   {voltage_winsorized[ts]:.2f}")
    print(f"    Interpolated: {voltage_outlier_fixed[ts]:.2f}")

Output:

Outlier treatment comparison:

  2024-06-04 06:00:00
    Original:     312.40
    Winsorized:   239.46
    Interpolated: 232.01

  2024-06-06 10:00:00
    Original:     187.20
    Winsorized:   220.16
    Interpolated: 231.43

Winsorization preserves the point but clips it to a plausible range — useful when you want to retain the information that something anomalous happened. Interpolation treats the outlier as if it were missing — better when you believe the reading is simply wrong.

How to Remove Duplicates

Duplicate timestamps are common when data pipelines retry on failure. Unlike tabular duplicates, time series duplicates aren't always identical, a retry might deliver a slightly different reading for the same timestamp.

# Inject duplicate timestamps with slightly different values (retry scenario)
dup_index = index.tolist()
dup_index.insert(20, index[20])  # exact duplicate timestamp
dup_index.insert(55, index[55])  # retry duplicate

dup_values = voltage_clean.tolist()
dup_values.insert(20, voltage_clean.iloc[20])
dup_values.insert(55, voltage_clean.iloc[55] + 0.7)  # slightly different value

dup_series = pd.Series(dup_values, index=pd.DatetimeIndex(dup_index), name="voltage_v")

print(f"Length with duplicates: {len(dup_series)}")
print(f"Duplicate timestamps:   {dup_series.index.duplicated().sum()}")

# Strategy 1: keep first (original reading)
dedup_first = dup_series[~dup_series.index.duplicated(keep="first")]

# Strategy 2: keep mean (average across retries)
dedup_mean = dup_series.groupby(level=0).mean()

print(f"\nAfter dedup (keep first): {len(dedup_first)}")
print(f"After dedup (mean):       {len(dedup_mean)}")

# Show the retry duplicate
ts_retry = index[55]
print(f"\nRetry duplicate at {ts_retry}:")
print(f"  Values:      {dup_series[ts_retry].values.round(3)}")
print(f"  Keep first:  {dedup_first[ts_retry]:.3f}")
print(f"  Mean:        {dedup_mean[ts_retry]:.3f}")

Output:

Length with duplicates: 170
Duplicate timestamps:   2

After dedup (keep first): 168
After dedup (mean):       168

Retry duplicate at 2024-06-03 07:00:00:
  Values:      [235.198 234.498]
  Keep first:  235.198
  Mean:        234.848

For most sensor pipelines, keep-first is the right default; the first delivery is the original reading. Mean makes sense when retries come from independent sensors measuring the same quantity.

Frequency Alignment and Resampling

Real pipelines often mix data at different frequencies. For example, you may need a 1-minute meter reading merged with an hourly weather feed. Before joining them, you need to align frequencies explicitly.

# 1-minute power draw readings
power_1min = pd.Series(
    42 + 18 * ((pd.date_range("2024-06-01", periods=1440, freq="T").hour.isin(range(8, 19)))).astype(int)
    + np.random.normal(0, 2, 1440),
    index=pd.date_range("2024-06-01", periods=1440, freq="T"),
    name="power_kw"
)

# Downsample to hourly: mean is appropriate for power (average over the hour)
power_hourly_mean = power_1min.resample("H").mean().round(2)

# Downsample to hourly: max (peak demand within the hour)
power_hourly_max = power_1min.resample("H").max().round(2)

# Downsample to hourly: sum (total energy = kWh)
energy_hourly_kwh = (power_1min.resample("H").sum() / 60).round(3)

comparison = pd.DataFrame({
    "mean_kw":    power_hourly_mean,
    "peak_kw":    power_hourly_max,
    "energy_kwh": energy_hourly_kwh,
}).iloc[7:13]

print(comparison)

Output:

                     mean_kw  peak_kw  energy_kwh
2024-06-01 07:00:00    42.13    46.28      42.133
2024-06-01 08:00:00    60.56    64.81      60.557
2024-06-01 09:00:00    59.91    64.88      59.912
2024-06-01 10:00:00    60.07    65.16      60.066
2024-06-01 11:00:00    60.08    64.99      60.083
2024-06-01 12:00:00    59.72    63.65      59.724

Which aggregation you choose matters enormously for downstream use. Mean power is right for load profiling. Peak power is right for capacity planning. Sum (converted to kWh) is right for billing. You can probably see why the right answer is domain-specific and not technical.

Smoothing Noise

Raw sensor data often contains high-frequency noise that obscures the underlying signal. Smoothing before feature engineering prevents the model from fitting to noise, but over-smoothing destroys real variation.

Exponential Weighted Moving Average

Exponential Weighted Moving Average or EWMA gives more weight to recent observations and adapts quickly to level changes. This is better than a simple moving average for non-stationary signals.

# Noisy temperature sensor (°C)
temp_noisy = pd.Series(
    3.5
    + 1.2 * np.sin(2 * np.pi * np.arange(168) / 24)
    + np.random.normal(0, 0.8, 168),  # high noise
    index=pd.date_range("2024-06-01", periods=168, freq="H"),
    name="temperature_c"
)

temp_ewma = temp_noisy.ewm(span=6, adjust=False).mean()
temp_sma  = temp_noisy.rolling(window=6, center=True).mean()

comparison = pd.DataFrame({
    "raw":  temp_noisy,
    "ewma": temp_ewma.round(3),
    "sma":  temp_sma.round(3),
}).iloc[22:30]

print(comparison)

Output:

                          raw   ewma    sma
2024-06-01 22:00:00  3.212372  2.843  3.035
2024-06-01 23:00:00  3.106840  2.918  3.176
2024-06-02 00:00:00  3.712290  3.145  3.011
2024-06-02 01:00:00  3.344376  3.202  3.294
2024-06-02 02:00:00  2.148946  2.901  3.705
2024-06-02 03:00:00  4.241105  3.284  4.087
2024-06-02 04:00:00  5.677429  3.968  4.381
2024-06-02 05:00:00  5.400083  4.377  4.765

Savitzky-Golay Filter

For signals where you need to preserve peak shapes — not just smooth them away — the Savitzky-Golay filter fits a polynomial over a sliding window and is better at maintaining the height of genuine spikes.

from scipy.signal import savgol_filter

temp_savgol = pd.Series(
    savgol_filter(temp_noisy.values, window_length=11, polyorder=2),
    index=temp_noisy.index,
    name="temp_savgol"
).round(3)

print(pd.DataFrame({
    "raw":    temp_noisy,
    "savgol": temp_savgol,
}).iloc[22:30])

Output:

                          raw  savgol
2024-06-01 22:00:00  3.212372   2.960
2024-06-01 23:00:00  3.106840   2.944
2024-06-02 00:00:00  3.712290   3.114
2024-06-02 01:00:00  3.344376   3.379
2024-06-02 02:00:00  2.148946   3.809
2024-06-02 03:00:00  4.241105   4.288
2024-06-02 04:00:00  5.677429   4.749
2024-06-02 05:00:00  5.400083   5.138

Schema and Sanity Validation

Cleaning without validation is incomplete. You need automated checks that run every time new data arrives — catching problems before they silently corrupt downstream models.

def validate_time_series(series: pd.Series, config: dict) -> dict:
    """
    Run schema and sanity checks on a time series.
    Returns a report dict with pass/fail per check.
    """
    report = {}

    # Frequency check
    inferred = pd.infer_freq(series.index)
    report["freq_regular"] = inferred == config["expected_freq"]

    # Missing value threshold
    missing_rate = series.isna().mean()
    report["missing_below_threshold"] = missing_rate <= config["max_missing_rate"]
    report["missing_rate"] = round(missing_rate, 4)

    # Value range check
    in_range = series.dropna().between(config["min_value"], config["max_value"])
    report["values_in_range"] = in_range.all()
    report["out_of_range_count"] = (~in_range).sum()

    # Duplicate timestamps
    report["no_duplicates"] = not series.index.duplicated().any()

    # Monotonic index
    report["index_monotonic"] = series.index.is_monotonic_increasing

    return report


config = {
    "expected_freq":    "H",
    "max_missing_rate": 0.05,
    "min_value":        210.0,
    "max_value":        250.0,
}

report = validate_time_series(voltage_outlier_fixed, config)

print("=== VALIDATION REPORT ===")
for check, result in report.items():
    if check in ("missing_rate", "out_of_range_count"):
        print(f"  {check}: {result}")
    else:
        status = "✓ PASS" if result else "✗ FAIL"
        print(f"  {status}  {check}")

Output:

=== VALIDATION REPORT ===
  ✗ FAIL  freq_regular
  ✓ PASS  missing_below_threshold
  missing_rate: 0.0
  ✓ PASS  values_in_range
  out_of_range_count: 0
  ✓ PASS  no_duplicates
  ✓ PASS  index_monotonic

This validator is the kind of function you wrap around every data ingestion step in a production pipeline. Run it before cleaning to know what's broken, and after cleaning to confirm everything passed.

The Complete Cleaning Checklist

Here's the full sequence to run on any incoming time series dataset:

Step	Technique	When to Use
Audit	Index check, missing map, value range	Always — before anything else
Reindex	`reindex` to canonical frequency	When timestamps are absent rather than NaN
Missing: short gaps	Time interpolation	Continuous signals, gaps ≤ 3 steps
Missing: step signals	Forward fill	Categorical or setpoint data
Missing: long gaps	Seasonal decomposition impute	Seasonal signals, gaps > 6 steps
Outliers: univariate	Rolling Z-score or IQR	Single sensor, local anomalies
Outliers: multivariate	Isolation Forest	Multiple correlated sensors
Outlier treatment	Winsorize or interpolate	Depending on whether event is real
Duplicates	Keep first or group mean	Pipeline retry duplicates
Resampling	`.resample()` with correct aggregation	Frequency alignment before joins
Smoothing	EWMA or Savitzky-Golay	Noisy sensors before feature engineering
Validation	Schema + sanity checks	After cleaning, and on every new batch

Wrapping Up

The order matters. Reindex before imputing. Impute before smoothing. Validate after everything. Skipping steps or doing them out of order compounds errors in ways that are very difficult to trace back once you're looking at model predictions.

Time series cleaning isn't glamorous work, but a model trained on clean data and thoughtfully engineered features will almost always outperform a more sophisticated model trained on data that wasn't cleaned properly. Getting this pipeline right is the highest-leverage thing you can do before you try running even the simplest algorithm on your time series data.

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Rakshath Naik — Tue, 05 May 2026 16:59:17 +0000

In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.

Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.

Done.

Except something looks odd.

When we take a closer look, we see that most customers are buying items worth $8 - $15. So where's $20 coming from?

In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.

Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.

In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?

Prerequisites
The Dataset
Mean: The Sensitive Giant
Median: The Robust Middle
Beyond Averages: Understanding Spread with Quartiles
Applying IQR to Our Dataset
Final Comparison and Insights
Conclusion
Connect with me

Prerequisites

To follow along here, you'll need:

Basic Python knowledge: Understanding of variables and functions.

The Pandas library: Familiarity with loading data and basic DataFrame operations.

A development environment: Access to a tool like Jupyter Notebook, VS Code, or Google Colab.

A Dataset: For this analysis, I used the Online Retail Dataset, which is available for download here.

The Dataset

We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.

Source: UCI Machine Learning Repository
Collected by: UK-based online retail company (2010–2011)
Size: 541,909 transactions
Features: 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)
Ownership: Public dataset hosted by UCI
License: Open for research and educational use

Mean: The Sensitive Giant

In statistics and data analysis, the terms "average" and "arithmetic mean" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:

$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$

In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")

The results are as follows:

Average Order Value (Mean): 20.40

At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.

Take a look at the graph for the mean below.

The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)

The graph shows a right-skewed distribution where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of $8 - $15 range, but the red line is being dragged to the right by the long tail of high-value bulk orders by some customers.

In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.

In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.

Median: The Robust Middle

When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.

Median is defined as the middle value after sorting the data.

In our dataset, we sort all the transactions and pick the middle one.

The formula for calculating the median is:

$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} & \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} & \text{if } n \text{ is even} \end{cases}$$

Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")

The results are as follows:

Typical Order Value (Median): 11.10

Now you'll notice that the result lies in the $8 — $15 range, where most of the transactions lie.

The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)

In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.

In the above figure the median graph accurately highlights the range where most of the customers lie.

Beyond Averages: Understanding Spread with Quartiles

So far, we've studied the median, but knowing the center is not enough.

To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.

Quartiles divide the dataset into the following parts:

Q1(25th percentile): 25% of transactions are below this.
Q2 (50th percentile): Median
Q3 (75th percentile): 75% of transactions are below this.

This is formally expressed as the Interquartile Range (IQR):

$$IQR = Q_3 - Q_1$$

The IQR: Detecting Outliers

The IQR measures the spread of the middle 50%.

If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.

Outlier Rule:

Lower Bound = Q1 — 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

A Simple Example to Understand IQR

Consider the following transaction values:

$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$

Step 1: Find the Median (Q2):

The middle value is:

$$Q_2 = 12$$

Step 2: Find Q1 (Lower Quartile):

The lower half is [5, 8, 10]. The median of the lower half is:

$$Q_1 = 8$$

Step 3: Find Q3 (Upper Quartile):

The upper half is [15, 18, 20]. The median of the upper half is:

$$Q_3 = 18$$

Step 4: Calculate IQR:

$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$

Step 5: Find Outlier Bounds:

$$\begin{aligned} \text{Lower Bound} &= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$

Any value below -7 or above 33 is an outlier (but in this demo problem, no outliers exist).

Applying IQR to Our Dataset

In our retail dataset, instead of neat values, we have bulk values and even negative returns.

# 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

When we calculate IQR for our dataset, we get:

Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180

The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)

As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.

Revisiting the Mean After Removing Outliers

Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] >= lower_bound) & (df['TotalPrice'] <= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")

After recomputing, we get:

Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63

Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.

Final Comparison and Insights

Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.

The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.

After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.

Conclusion

The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.

This highlights a key lesson: The mean isn't wrong, but it must be used with an understanding of the data.

Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.

Connect with me

If you want to dive deeper, you can visit: Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis.

The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained.

Great John — Tue, 14 Apr 2026 20:29:40 +0000

In August 2012, Knight Capital, a major trading firm in the United States, deployed faulty trading software to its production system. The system used this incorrect configuration data and it triggered millions of unintended stock trades.

The company lost about $440 million in just 45 minutes. Knight Capital nearly collapsed and had to be rescued by investors. It was later acquired by another firm.

When Target expanded into Canada, the company relied on a new supply chain system that contained incorrect product and inventory data. Product information in the database was incomplete and inaccurate. Prices, sizes, and product descriptions were entered incorrectly.

Inventory systems reported items in stock that were actually unavailable. Customers found empty shelves in stores despite the system showing stock. The company lost over $2 billion in the Canadian market. Target eventually shut down all Canadian stores in 2015.

One employee made the statement “Even though we had a great supply chain system on paper, we didn’t have accurate data. Bad data leads to bad decisions’’

Another famous example of data-related engineering failures involves the Mars Climate Orbiter spacecraft. One engineering team used metric units (newtons). Another team used imperial units (pounds-force). The system failed to convert the data correctly. The spacecraft entered Mars' atmosphere at the wrong altitude. The mission failed and the spacecraft was destroyed. The loss was about $125 million.

In this article, we'll delve deep into what data quality truly means, the types of data errors that silently break systems, the developer’s responsibility in preventing them, and the validation layers that work together to keep bad data out of production.

What We'll Cover:

Prerequisites
The Importance of Data Quality
- How Does Bad Data Happen in the First Place?
- The Cost of Bad Data
Types of Data Errors
What Makes Good Data?
Data Validation Layers
Testing Strategies to Protect Data Quality
Conclusion

Prerequisites

A basic understanding of what data is
A basic understanding of data structures
An understanding of what an API is
An understanding of what a database is and what it does

The Importance of Data Quality

As you can see from just these few examples, the quality of the data you're working with really matters.

Gartner reports that organisations attribute around $15 million in annual losses to poor‑quality data. The same research also shows that nearly 60% of companies have no clear idea what bad data actually costs them, largely because they don’t track or measure data‑quality issues at all.

A 2016 study by IBM is even more eye-popping. IBM found that poor data quality strips $3.1 trillion from the U.S. economy annually due to lower productivity, system outages, and higher maintenance costs.

Bad data is, and will continue to be, the kryptonite of any organisation. This is even more concerning as more organisations now depend on data for strategy execution than ever before.

When data is wrong, incomplete, duplicated, or inconsistent, the consequences ripple outward: Incorrect dashboards mislead teams, which leads to making incorrect decisions. Implementing these decisions can lead to faulty strategy and policy implementation.

Eventually, the organisation pays the price, financially, operationally, and reputationally. And while money can be recovered, reputation rarely bounces back so easily.

How Does Bad Data Happen in the First Place?

Form fields are usually the first place where data enters an application, so they’re often where bad data begins. This is why the developer’s role is so critical.

Many of the most damaging data errors don’t originate from malicious users or complex edge cases – they come from simple oversights that the system should never have allowed in the first place.

But it's equally important to recognise that data quality issues often originate before the data ever reaches an application. Upstream processes — how data is collected, measured, recorded, or pre‑validated — can introduce inaccuracies long before the system receives it.

For example, a nurse might weigh a patient using an uncalibrated mechanical scale, record the incorrect value on a paper form, and later have that value transcribed into the hospital system. By the time the data enters the application, the error is already embedded.

This means that maintaining data quality requires attention both to upstream data collection practices and to the system-level validation that developers control.

When the UI, backend, or API layer permits invalid, incomplete, inconsistent, or logically impossible data to enter the pipeline, the organisation inherits a long‑term liability. Even small choices — such as allowing empty fields, ignoring duplicates, or failing to enforce validation rules — can introduce errors that may only surface months later in reports or dashboards, leading to confusion and inaccurate insights.

The Cost of Bad Data

Data quality can also be impacted at any stage of the data pipeline: before ingestion, in production, or even during analysis.

If bad data is caught in the UI, it's almost free, if we're thinking in terms of cost. If it's caught at the API layer, that's still pretty cheap. If it's caught in the database, the cost is moderate. And if it's caught in a report or ML model months later, that's expensive, and sometimes irreversible.

A key principle in modern data management is: the cheapest and safest place to catch bad data is at the source, and that is before ingestion. The well-known 1-10-100 Rule, introduced by George Labovitz and Yu Sang Chang in 1992, clearly illustrates this idea.

According to the rule, it costs about $1 to validate data at the point of entry, $10 to correct it after it has entered the system, and $100 per record if the error goes unnoticed and causes problems further down the line.

As the saying goes, an ounce of prevention is worth a pound of cure – and this is especially true when it comes to maintaining high-quality data.

To help buttress my point, I’ve categorised the different types of errors and oversights that developers should never allow that can and should be prevented before they ever reach the database, analytics layer, or reporting systems.

Types of Data Errors

Required Field Errors

If you build a form that allows a user to submit a registration form with important fields left empty (like first name, last name, email address, phone number, date of birth, or address), you're directly letting incomplete data enter the system.

I remember a scenario from my time as a data analyst where I was analysing a dataset containing different types of alarms triggered across several buildings. These alarms fell into categories such as aquarium alarms, intruder alarms, fire alarms, and maintenance alarms.

The purpose of the analysis was simple: identify which buildings had the highest frequency of alarms so that maintenance, resources, or investigations could be allocated appropriately.

Whenever an alarm went off, the security team recorded it using a software system. By the end of each month, we could view the cumulative alarms and generate insights.

But I encountered a major data quality issue. The security officers often selected the alarm category but failed to submit the building where the alarm occurred — and the system allowed this incomplete record to be saved into the database.

Every alarm had to occur in a specific building. Yet during analysis, I would see entries like “20 fire alarms” with no building information attached. Since I couldn’t determine where these alarms happened, the data became unusable. I had no choice but to delete those records because they provided no actionable value.

This is a classic example of poor data validation. If the developer had implemented proper constraints, the system would never allow an alarm to be submitted without a building name.

Required fields should be enforced at the UI and backend levels to prevent missing data from entering the system in the first place. These gaps lead to missing or unusable data in the database, often forcing teams to delete or manually repair records later.

To prevent these errors, you can use required‑field validation, disable the submit button until all mandatory fields are completed, and visually highlight missing fields with inline error messages.

Here's a practical code example of some bad code (no required checks):

From the above code snippet, the core problem is that the form doesn't enforce required input. Neither HTML‑level validation (using the required attribute) nor JavaScript‑based checks are implemented. This omission allows users to submit the form without providing necessary information, making the form unreliable for collecting valid and complete user data.

From a usability and data quality perspective, this is problematic. Forms are typically designed to collect meaningful and complete information, and fields such as “Full name” and “Email” are usually essential. Without marking these inputs as required or validating them programmatically, we risk receiving blank or invalid submissions, which can compromise the quality of stored data and any processes that depend on it.

Here's an example of a better version (UI prevents empty submission):

In this revised version of the code, the addition of the required attribute to both the name and email input elements ensures that the browser won't allow the form to be submitted unless these fields are filled. This is an important step toward maintaining data completeness and improving the overall reliability of the form.

Also, by checking e.target.checkValidity(), we now ensure that the form is evaluated before submission proceeds.

Another positive aspect is the conditional use of e.preventDefault(). When the form is invalid, the default submission behavior is stopped, preventing incomplete or incorrect data from being sent.

Format Validation Errors

If you have a form that allows a user to enter an email without an @ symbol, an email without a domain, a phone number containing letters, or a postcode/ZIP code in the wrong format, that allows invalid data to enter the system.

The same applies when you allow a user to submit an impossible date (32/15/2025) or a credit card number with the wrong length.

These issues will cause the data analyst to spend more time cleaning the data, if it's even cleanable. And such incorrect inputs create unreliable data that breaks downstream processes and increases cleanup costs.

To prevent these types of errors, you can use regex validation, input masks, and field‑type restrictions (for example, numeric‑only fields for phone numbers) to enforce correct formats before submission.

Here's a bad example of allowing format validation errors:

This code doesn't perform any checks on the format or structure of the phone number. The function simply retrieves whatever value exists – whether valid, invalid, or blank – and logs it to the console without any condition.

Here's the fixed version:

This version fixes the earlier mistake by introducing a clear validation rule. Before the system accepts the phone number, it checks whether the input contains only digits. The regular expression ^\d+$ ensures that the value is made up entirely of numbers, with no letters or symbols allowed. If the user enters anything invalid, the function stops and displays an error message instead of saving bad data.

This approach prevents the format error that occurred in the previous example. Instead of blindly trusting whatever the user types, the code now enforces a rule that matches the expected format of a phone number. This is what a responsible developer should do: verify the input before using it.

Range and Limit Errors

Allowing users to enter values outside acceptable limits – such as negative ages, quantities below zero, discounts above 100%, or measurements far beyond realistic ranges – that enables the ingestion of data that violates business rules. These errors distort analytics, break calculations, and create operational inconsistencies.

To mitigate these errors, you can apply min/max constraints, sliders, steppers, and numeric boundaries to ensure values fall within valid ranges.

Here's a bad example of allowing range and limit errors:

As seen above, we've created an input field for age but doesn't specify any limits or constraints. The browser allows the user to type any number — including values that make no sense, such as negative ages, extremely large ages, or decimals. The JavaScript function simply reads the value and logs it without checking whether the age is realistic.

Here's a better version:

Now in this version, the inclusion of the min="0" and max="120" attributes sets clear boundaries for acceptable input values. This ensures that only realistic age values within a defined range are allowed, preventing invalid entries such as negative numbers or excessively large ages.

The JavaScript function further enhances this validation by using the checkValidity() method. This method checks whether the input satisfies all defined constraints, including the required condition and the specified numeric range. If the input doesn't meet these conditions, the function prevents further execution and displays an alert message, informing the user that the entered age must fall within the allowed range.

Logical Consistency Errors

If you allow a user to select an end date before the start date, choose a checkout date earlier than check‑in at a hotel, or enter a delivery date before the order date, this will result in logically impossible data. The same applies when you allow a user to enter a graduation year earlier than their admission to a program, or submit working hours that exceed 24 hours in a day.

You can mitigate this by implementing cross‑field validation, business‑rule checks, and conditional logic that ensures related fields remain consistent.

Here's a bad example of a logical consistency error:

In the code above, the core issue is the complete absence of validation. Although the inputs use type="date", which provides a structured way for users to select dates, the code doesn't enforce that either field is required. This means the user can leave one or both date fields empty, and the save() function will still run and log the values. As a result, the system may end up processing incomplete or meaningless data.

Beyond missing required checks, the code also fails to validate the logical relationship between the two dates. In any scenario involving a start date and an end date, it's expected that the start date shouldn't occur after the end date. But this code performs no such comparison.

This means that the user can select a start date that's later than the end date, and the system will accept it without warning. This leads to inconsistent or impossible data being recorded.

Also, the function simply logs the values without providing any feedback to the user. There's no mechanism to alert the user when a field is empty or when the dates are logically incorrect. This reduces usability and makes it difficult for users to understand or correct their mistakes.

Here's the fixed version:

In this improved version, first, both date fields now include the required attribute, ensuring that the user can't leave either field empty without triggering validation.

Second, we've added a logical validation check to ensure that the relationship between the two dates is correct. After retrieving the values, the function converts them into Date objects and compares them to verify that the end date doesn't occur before the start date. If this condition is violated, the function stops execution and displays an alert informing the user of the error.

This prevents inconsistent or impossible date ranges from being accepted.

Duplicate and Data Integrity Errors

When you let a user submit an email that's already registered, choose a username that's already taken, or enter a duplicate employee ID or student number, this results in identity conflicts and duplicate records. Problems also arise when you allow users to upload unsupported file types, oversized files, or corrupted images.

Security risks can emerge when users are able to enter HTML/script tags (XSS), SQL‑injection patterns, or disallowed special characters. These issues compromise data quality, system integrity, and security.

You can prevent these types of issues by using uniqueness checks, file‑type and size validation, and input sanitization to block duplicates, invalid uploads, and malicious inputs.

Here's an example of a duplicate error:

This code blindly pushes every email into the savedEmails array without checking whether the email already exists. Because there is no duplicate detection, the user can enter the same email multiple times.

Here is the fixed version:

In this improved version of the code, we've implemented proper validation steps to prevent duplicate email entries. Before saving the email, the function checks whether the value already exists in the savedEmails array using the includes() method. If the email is found, the function stops execution and displays an alert informing the user that the email has already been saved. This ensures that each email is stored only once, maintaining the uniqueness and integrity of the data.

Relational Errors (Reference Integrity)

If you let a user select a city that doesn’t belong to the chosen country, a product ID that no longer exists, a retired SKU, or a shipping method unavailable in the selected region, this can result in broken references.

The same applies when users can select a manager from a different department or choose a fully booked time slot, not setting the right roles and permissions. These errors break relationships between tables and corrupt downstream joins and reports.

Here, you can use dependent dropdowns, real‑time lookups, and foreign‑key validation to help ensure that users can only select valid, existing, and compatible options.

Here's a bad example of a relational error:

From the above, the mistake in this code is that we've treated country and city as completely independent fields, even though one is supposed to depend on the other. By presenting all cities regardless of the selected country, the interface allows users to create combinations that make no sense — such as choosing “United Kingdom” with “New York” or “United States” with “Manchester.”

Also, because the save() function performs no validation and simply logs whatever the user selects, the system ends up accepting and storing relationships that should never exist. This breaks the logical link between the two fields and leads to invalid, inconsistent data that can corrupt downstream.

Here's the fixed, production-ready version:

This improved code turns the country–city form into a controlled, relationship‑aware flow instead of two loose dropdowns.

When the user selects a country, the loadCities() function runs. It first clears the city dropdown and, if no country is selected, keeps the city field disabled so the user can't choose a city on its own.

Once a valid country is chosen, the city dropdown is enabled and populated only with the cities that belong to that specific country, using the citiesByCountry mapping. Also, the city values are normalised (lowercased and stripped of spaces) so they’re consistent and safe to compare.

When the user clicks “Save,” the save() function checks that both a country and a city have been selected. If either is missing, it shows an alert and stops. It then rebuilds the list of valid city values for the chosen country and verifies that the selected city is actually in that list.

Structural Errors (Dropdowns, Radio Buttons, Enums)

If users can type a country as “U.S.A”, “USA”, “United States”, or “us”, enter gender as “male”, “Male”, “M”, or “man”, or type a department as “Engineering”, “Eng”, or “engineer”, this can result in inconsistent categorical data.

The same applies to currencies typed as “usd”, “USD”, “US Dollars”, product categories spelled differently, status values like “active”, “Active”, “ACT”, “enabled”, or boolean values like “yes”, “Yes”, “Y”, “1”.

These inconsistencies make analytics, grouping, and reporting unreliable, and the analyst will spend time cleaning and standardizing these files.

You should replace free‑text fields with dropdowns, radio buttons, and enums to enforce standardized categorical values.

Bad example of a structural error:


  Country

The problem with this code is that it pretends to save a country value without doing any real validation or enforcing any rules, which makes the form unreliable and prone to bad data.

The form uses a plain text input for “country,” meaning the user can type anything they want — misspellings, random characters, invalid countries, or even leave it blank. Because the input isn’t marked as required and the JavaScript doesn’t check whether the field contains a meaningful value, the form will happily “save” an empty string or nonsense text.

The submit handler prevents the default form submission but does nothing beyond logging whatever the user typed, so the system accepts invalid, incomplete, or malformed data without question. In short, the code collects input but doesn't validate it, doesn't enforce correctness, and doesn't protect the system from bad or unusable values.

Here's the fixed version:


  Country

The biggest improvement is that we're no longer relying on a free‑text field for the country. By switching to a dropdown, the form now limits the user to a controlled set of valid options. This prevents misspellings, random text, or invalid country names from ever entering the system.

These are the main types of data errors you might come across in your work. Now that we've discussed what causes them and some key fixes/preventative measures you can take, let's move on to data quality itself.

What Makes Good Data?

So what, in fact, is data quality? IBM defines it as the degree of accuracy, consistency, completeness, reliability, and relevance of the data collected, stored, and used within an organization or a specific context.

Let's look at each of these features of quality data a bit more closely to understand what they entail.

Completeness:

Completeness measures how much of the required data is actually present. When large portions of fields are missing, the dataset stops representing reality and any analysis built on it becomes unreliable.

An example would be a sign‑up form that stores users, but half of them are missing an email address. If you run an analysis on “email engagement,” your results will be skewed because a big chunk of users can’t even receive emails. This means that this data is incomplete.

Uniqueness:

Uniqueness checks whether each real‑world entity appears only once in the dataset. Duplicate records inflate counts, break joins, and distort metrics.

An example would be a customer table containing two rows for the same person with the same customer ID. When calculating “active customers,” the system counts them twice, inflating revenue projections.

Validity:

Validity evaluates whether data follows the expected format, type, or business rules. This includes correct data types, allowed ranges, and patterns defined by the system.

An example would be a field meant to store dates contains values like “32/99/2025” or “tomorrow.” These invalid entries break downstream ETL jobs that expect a proper date format.

Timeliness:

Timeliness reflects whether data is available when it’s needed. Even accurate data becomes useless if it arrives too late for the process that depends on it. For example, after a customer places an order, the system should generate an order ID instantly.

Accuracy:

Accuracy measures how closely data matches the real‑world truth. When multiple systems report the same metric, one must be designated as the authoritative source to avoid conflicting values.

Consistency:

Consistency checks whether data aligns across different datasets or within related fields. If two systems describe the same concept, their values shouldn't contradict each other.

For example, a company’s HR system reports 50 employees in Engineering, but the payroll system lists only 42. Since both describe the same group, the mismatch signals a data quality issue.

Fitness for Purpose:

Fitness for purpose assesses whether the data is suitable for the specific business task at hand. Even complete, accurate, and timely data may be unhelpful if it doesn’t answer the intended question.

A dataset of website clicks might be perfect for analysing user engagement, for example, but it’s useless for forecasting revenue because it contains no purchase or pricing information.

Data Validation Layers

Now that we've highlighted the characteristics that ensure quality data, it's important to discuss the layers of data validation.

There are five layers you'll need to check to enforce data quality.

Frontend Layer — “Protect the User, Not the System”

Frontend validation plays an important role in enhancing the user experience – but it doesn't provide real protection for a system.

Since frontend logic operates within the user’s environment, we can't trust it as a mechanism for enforcing data quality. Any code executed in the browser is ultimately under the user’s control, meaning it can be disabled, modified, intercepted, or bypassed entirely.

For instance, a user can simply open browser developer tools, remove validation rules, and submit invalid or malicious data without restriction.

Frontend validation is incapable of enforcing complex business rules. Constraints such as ensuring that a discounted price is lower than the original price, validating that a start date precedes an end date, preventing stock levels from becoming negative, or confirming that a product belongs to a valid category within the database require deeper system-level checks.

At the frontend level, what is being validated is: required fields, email format, password strength, address fields, and payment input format.

So frontend validation doesn't guarantee data quality or security, as it can be bypassed through API tools (like Postman), disabled JavaScript, malicious bots, and third-party integrations.

Because of this, it's best to treat the front-end as a usability layer, not a trust layer.

Backend Validation — “The Real Gatekeeper”

You can only guarantee true data quality and system integrity at the backend and database layers.

The backend is responsible for enforcing request validation, implementing business logic, and managing authentication and authorization.

If validation fails here, invalid data is rejected before it can propagate. Without this layer, data corruption begins at ingestion.

For example:

$request->validate([
   'name' => 'required|string|max:255',
   'price' => 'required|numeric|min:0',
   'stock' => 'required|integer|min:0',
   'category_id' => 'required|exists:categories,id',
]);

The code snippet above demonstrates how you can use request validation in Laravel to ensure that incoming data meets specific requirements before it's processed or stored in the database. This is an essential practice in web development, as it helps maintain data integrity, prevents errors, and enhances application security.

In this example, we're using the $request->validate() method to define a set of validation rules for four input fields: name, price, stock, and category_id. Each field is assigned a series of constraints that the incoming data must satisfy.

The name field is marked as required, meaning it must be included in the request and can't be empty. It must also be a string, ensuring that only textual data is accepted, and it's limited to a maximum length of 255 characters using max:255. This prevents excessively long inputs that could potentially cause issues in the database or user interface.

Similarly, the price field is required and must be numeric, allowing only numbers such as integers or decimal values. The rule min:0 ensures that the price can't be negative, which is logically consistent for most product pricing scenarios.

The stock field is also required and must be an integer, meaning it can only accept whole numbers. This is appropriate for counting physical items. Like the price field, it includes a min:0 rule to prevent negative stock values, which would not make sense in an inventory system.

Finally, the category_id field is validated to ensure it is both present and valid. The required rule ensures that a category is selected, while the exists:categories,id rule checks that the provided value corresponds to an existing id in the categories database table. This prevents invalid or non-existent category references, thereby preserving relational integrity within the database.

This layer validates null values, data types and formats, allowed ranges, and referential integrity (exists).

Database Layer — “Protect the Data at Rest”

Validation at the application level is insufficient on its own. You'll also need to enforce database-level constraints like NOT NULL constraints, UNIQUE constraints (email, SKU, order number), foreign keys (orders.user_id → users.id), and check constraints (for example, price >= 0).

This layer is critical because application bugs may bypass validation, background jobs and imports may skip controllers, and malicious actors may attempt direct access.

The database layer acts as the final line of defense, ensuring structural integrity regardless of application failures. Database constraints are the last hard stop: they enforce correctness even when code is bypassed.

Service Layer / Business Logic — “Validate Real-World Rules”

This layer enforces domain-specific logic that can't be captured by simple validation rules. The service layer is where the application stops asking “Is this data shaped correctly?” and starts asking “Is this allowed to happen in the real world?”.

This layer enforces domain‑specific rules that can't be captured by simple request validation or database constraints. These rules reflect business truth, not structural correctness.

Example:

if (\(product->stock < \)quantity) {
   throw new OutOfStockException();
}

This prevents overselling and ensures the system reflects physical reality.

if (\(cartTotal !== \)calculatedTotal) {
   throw new PriceMismatchException();
}

This protects revenue and prevents tampering.

In this layer, you enforce real‑world business rules by ensuring inventory correctness, recalculating totals, applying discount logic, and checking user‑specific limits.

Jobs / Queues / Data Ingestion — “Validate External Data”

When importing or processing external data (for example, supplier feeds), validation must occur before processing. You'll need to ensure schema conformity, that the required columns are present, that you have the correct data types, that the JSON structure is valid, and that you're detecting duplicate batches.

This is because external data sources are a major source of data quality issues. Without validation here, corrupted data can silently enter the system at scale.

Now that we've discussed the layers of a modern application stack, it should be clear that data quality isn't something you “check once” at the UI.

It must be enforced repeatedly, at multiple depths of the system. Each layer catches a different class of defects, and together they form a defensive wall that prevents bad data from ever reaching storage, analytics, or downstream consumers.

Testing Strategies to Protect Data Quality

To wrap up, here are the three foundational testing strategy every developer should apply to protect data quality.

Unit Testing

Unit tests are the first line of defense in data quality. In this context, a “unit” refers to a single column, a single transformation, or a single validation rule.

The purpose is straightforward: verify that the smallest building blocks of your data logic behave exactly as intended. This matters because if these low‑level rules are not tested and validated, incorrect or inconsistent data will flow into the database and contaminate everything built on top of it.

By isolating each rule or transformation, you can guarantee that schema constraints, field‑level assumptions, and low‑level logic remain correct before data ever flows into larger pipelines or business processes.

Typical questions answered at this layer include:

Does this column allow nulls?
Does this regex correctly strip whitespace from email strings?
Does this transformation produce the expected output for a single row?

This is where you can verify that the data contract is sound. If a column must be non‑null, unique, or follow a specific pattern, the unit test enforces it. When these rules fail here, they fail cheaply – before they can corrupt a table or mislead a dashboard.

To make this concrete, here’s what a unit test looks like in a real codebase. Even though this example comes from Laravel, the testing principle is identical to data‑quality unit tests: one rule, one expectation, isolated from everything else.

Example: Testing a Discount Calculation Rule

Imagine your e‑commerce shop has this rule:

If a product costs more than £100, apply a 10% discount.
Otherwise, apply no discount.

Let's say this is your discount logic:

 100) {
            return $price * 0.10; // 10% discount
        }

        return 0;
    }
}

The unit test for this logic will be:

calculate(200);

        \(this->assertEquals(20, \)discount);
    }

    /** @test */
    public function it_applies_no_discount_when_price_is_100_or_below()
    {
        $service = new DiscountService();

        \(discount = \)service->calculate(100);

        \(this->assertEquals(0, \)discount);
    }
}

The DiscountService contains a simple rule: if a price is greater than 100, a 10% discount is applied. Otherwise, no discount is applied. The unit test verifies this rule in isolation, without involving controllers, databases, or HTTP requests. By testing the service directly, the developer ensures that the core calculation behaves exactly as intended.

The first test checks the positive case — a price of 200 should produce a discount of 20. The second test checks the boundary condition — a price of 100 should produce no discount. Together, these tests confirm both sides of the rule and protect against regressions if the logic changes in the future.

Now, since this is Laravel example, Laravel tests help you verify both your logic (unit tests) and your full application behaviour (feature tests). You can run them using php artisan test, which executes tests in a separate testing environment, ensuring your real database and main codebase remain safe and unaffected.

Integration Testing: The Flow & Lineage Check

While unit tests validate the correctness of individual rules, integration tests validate the movement of data across components. Integration testing verifies that multiple layers work together as a single data flow.

In this example, the controller receives an order, calls the discount service, applies the transformation, and persists the result to the database. That interaction across layers is what elevates this from a unit test to an integration test. This is where you test the real‑world flow:

Controller → Service → Repository → MySQL
Check if MySQL migrations run correctly
Check foreign keys enforce relationships
Check to ensure services interact with the database as expected
Check to ensure models and repositories behave consistently

Integration tests reveal issues that only appear when components interact: incorrect joins, broken migrations, mismatched field names, or subtle type mismatches that unit tests cannot detect.

This is the layer where you catch the bugs that would otherwise silently corrupt data lineage.

Here's an example:

create(['subtotal' => 150]);

        \(response = \)this->postJson("/orders/{$order->id}/apply-discount");

        $response->assertStatus(200);

        $this->assertDatabaseHas('orders', [
            'id' => $order->id,
            'grand_total' => 135, // 150 - 10% discount
            'discount_total' => 15
        ]);
    }
}

This represents a full flow rather than a single rule:

Controller → Service
Service → Calculation
Controller → Database write
Database → Final state

This test begins by creating an order using an Eloquent factory. It immediately steps beyond the boundaries of a unit test, since it interacts with the database and relies on Laravel’s model layer to persist real data.

From there, the test sends an actual HTTP POST request to the /orders/{id}/apply-discount endpoint, which means it's not calling a method directly, but instead it's traveling through Laravel’s routing layer, invoking the controller responsible for handling the request, and triggering whatever business logic is responsible for calculating and applying the discount.

This movement through multiple layers (routing, controller, service logic, and model persistence) is precisely what defines integration testing: the goal is to verify that these components work together correctly as a system.

Once the request is processed, the test asserts that the response returns a successful status code, which confirms that the HTTP layer behaved as expected.

But the most important part comes afterward, when the test checks the database to ensure that the correct grand_total and discount_total were saved. This final assertion proves that the discount logic was executed, the model was updated, and the changes were successfully written to the database.

In other words, the test isn't merely checking whether a calculation is correct. It's also checking whether the entire pipeline – from receiving the request to updating the database – functions as a coherent whole.

Functional Testing: The Business Rule Check

Functional tests validate the entire user experience, from the moment a request enters the system to the moment a response is returned. This includes:

HTTP requests
Controller logic
Validation rules
Service operations
Database writes
Redirects or rendered views

This is where you test the business rules that govern real‑world behaviour:

“A student can't register for two exams at the same time.”

“A cart can't have negative quantities.”

“A user can't update their profile without a valid email.”

Functional tests ensure that the system behaves correctly from the perspective of the user and the business, not just the code.

Here's an example: Functional Test

create(['price' => 40]);

        // Simulate existing cart
        $this->withSession([
            'cart' => [
                $product->id => ['quantity' => 2]
            ]
        ]);

        // Act: user tries to update quantity to a negative number
        \(response = \)this->post('/cart/update', [
            'product_id' => $product->id,
            'quantity' => -5
        ]);

        // Assert: system rejects invalid business behaviour
        $response->assertStatus(302); // redirect back with errors
        $response->assertSessionHasErrors(['quantity']);

        // Assert: cart remains unchanged (business rule preserved)
        \(this->assertEquals(2, session('cart')[\)product->id]['quantity']);
    }
}

The test begins by creating a realistic environment in which a user interacts with a shopping cart. This is essential for understanding the behaviour the system is meant to enforce.

First, it generates a real product in the database using a factory, giving the product a price so that it resembles an item a customer might genuinely add to their cart.

Once the product exists, the test manually seeds the session with a cart containing that product and a quantity of two. This simulates a user who has already added the item to their cart in a previous interaction, and it establishes the baseline state the system must preserve if the user attempts an invalid update.

With the environment prepared, the test then imitates a user action by sending a POST request to the /cart/update endpoint. Instead of calling a method directly, it uses Laravel’s HTTP layer to reproduce the exact behaviour of a browser submitting a form. The request includes the product ID and a deliberately invalid quantity of negative five.

This is the heart of the scenario: the user is attempting something that violates the business rules of the application, and the test is designed to confirm that the system responds appropriately.

Now, when the request is processed, the test expects the application to reject the input, redirect the user back, and attach validation errors to the session. The assertion that the response has a 302 status code and contains validation errors confirms that the validation layer is functioning correctly and that the controller is enforcing the rule that quantities can't be negative.

The final part of the test is where the business rule is truly verified. After the failed update attempt, the test inspects the session to ensure that the cart remains unchanged. This is crucial because rejecting invalid input is only half of the requirement: the system must also protect the integrity of the existing cart data.

Functional tests answer questions like:

Does the system prevent invalid real‑world behaviour?
Does the user get the correct feedback?
Does the data remain consistent after the request?
Does the final output match the business expectation?

Conclusion

Data quality is never the result of a single check or a single team. It emerges from a disciplined, layered approach where each testing level catches a different category of defects.

Unit tests safeguard the smallest rules, integration tests validate the flow of data across components, and functional tests enforce the business logic that governs real‑world behaviour.

When these layers operate together, bad data has nowhere to hide. When they don’t, even a minor oversight can slip through the cracks and escalate into a costly downstream failure.

So as you can see, your role in data quality is fundamentally proactive, not reactive. By designing systems with validation, integrity, and monitoring in mind, you ensure that data flowing through the pipeline is accurate, timely, complete, unique, and fit for purpose – supporting reliable analytics, reporting, and intelligent systems.

How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD

Sandeep Bharadwaj Mannapur — Tue, 17 Mar 2026 20:33:56 +0000

Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and trust over time.

Most ML systems fail in production for boring (and painful) reasons: the training code and the serving code drift apart, input data changes shape, a “small” preprocessing tweak breaks predictions, or the model silently degrades because real-world behavior shifts. None of these problems are solved by a better algorithm, they’re solved by engineering: repeatable pipelines, validation, versioning, monitoring, and automated checks.

In this hands-on handbook, you’ll build a complete mini ML platform on your local machine, an end-to-end project that takes a model from training to deployment with the core “last mile” infrastructure in place. We’ll use a fraud detection example (predicting fraudulent transactions), but the same workflow works for churn prediction or any binary classification problem. Everything runs locally (no cloud required), and every step is copy-paste runnable so you can follow along and verify outputs as you go.

By the end, you'll have a production-ready ML pipeline running on your machine – from training the model to serving predictions, with the infrastructure to test, monitor, and iterate with confidence. And yes, we'll do it in a hands-on manner with code snippets you can copy-paste and run. Let's dive in!

📦 Get the Complete Code
All code from this handbook is available in a ready-to-run repository:
Repository: https://github.com/sandeepmb/freecodecamp-local-ml-platform
Clone it and follow along, or use it as a reference implementation.

Project Overview and Setup
Build a Simple Model and API (The Naive Approach)
- Train a Quick Model
- Serve Predictions with FastAPI
Where the Naive Approach Breaks
Add Experiment Tracking and Model Registry with MLflow
Ensure Feature Consistency with Feast
Add Data Validation with Great Expectations
- Define Expectations
- Integrate Validation into FastAPI
Monitor Model Performance and Data Drift
Automate Testing and Deployment with CI/CD
Incident Response Playbook
How to Put It All Together
What’s Next: Scale to Production
Conclusion
References

Project Overview and Setup

Before we jump into coding, let's set the stage. Our use-case is credit card fraud detection – a binary classification problem where we predict whether a transaction is fraudulent (is_fraud = 1) or legitimate (is_fraud = 0). This is a common ML task and a good proxy for production ML challenges because fraud patterns can change over time (allowing us to discuss model drift), and bad input data (for example, malformed transaction info) can cause serious issues if not handled properly.

Tech Stack

We will use Python-based tools that are popular in MLOps but still beginner-friendly:

Tool	Purpose	Why We Chose It
MLflow	Experiment tracking and model registry	Open-source, widely adopted, great UI
Feast	Feature store for consistent feature serving	Production-grade, runs locally, same API for offline/online
FastAPI	High-performance web framework for serving predictions	Fast, automatic docs, modern Python
Great Expectations	Data validation framework	Declarative expectations, great reports
Evidently	Monitoring for data drift and model decay	Beautiful reports, easy to integrate
Docker	Containerization for environment consistency	Industry standard, works everywhere
GitHub Actions	CI/CD automation	Free for public repos, tight GitHub integration

Let me explain each tool briefly:

MLflow is an open-source platform designed to manage the ML lifecycle. It provides experiment tracking (logging parameters, metrics, and artifacts), a model registry (versioning models with aliases), and model serving capabilities. We'll use it to ensure our experiments are reproducible and our models are versioned.

Feast (Feature Store) is an open-source feature store that helps manage and serve features consistently between training and inference. This prevents a common problem called "training-serving skew" where the features used in production differ slightly from those used in training, causing silent accuracy degradation.

FastAPI is a modern, fast web framework for building APIs with Python. It's known for being easy to use, efficient, and producing automatic interactive documentation. We'll use it to serve our model predictions.

Great Expectations is an open-source tool for data quality testing. It allows us to define "expectations" on data (like "amount should be positive" or "hour should be between 0 and 23") and test incoming data against them.

Evidently is an open-source library for monitoring data and model performance over time. It can detect data drift (when input distributions change) and model decay (when accuracy drops).

Docker ensures the same environment and dependencies in development and deployment, avoiding the classic "works on my machine" problem.

GitHub Actions provides CI/CD automation. An efficient CI/CD pipeline helps integrate and deploy changes faster and with fewer errors.

💡 Mental Model: Think of this as building a "safety net" around your ML model. Each tool we add catches a different failure mode, like defensive driving for machine learning.

Prerequisites

You'll need:

Python 3.9+ installed on your machine
Docker Desktop installed and running
GitHub account (if you want to try the CI/CD pipeline)
Basic familiarity with Python and ML concepts (what training and prediction mean)

You don't need MLOps or Kubernetes experience. Everything will be done locally with just Python and Docker – no cloud and no Kubernetes needed.

Project Structure

Let's set up a basic project structure on your local machine. Open your terminal and run:

# Create project directory and subfolders
mkdir ml-platform-tutorial && cd ml-platform-tutorial
mkdir -p data models src tests feature_repo

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

Your project structure should look like this:

ml-platform-tutorial/
├── data/              # Training and test datasets
├── models/            # Saved model files
├── src/               # Source code
├── tests/             # Test files
├── feature_repo/      # Feast feature repository
├── venv/              # Virtual environment
└── requirements.txt   # Dependencies

Next, create a requirements.txt with all the necessary libraries:

# requirements.txt

# Core ML libraries
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0

# Experiment tracking and model registry
mlflow==2.10.0

# Feature store
feast==0.36.0

# API framework
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.26.0

# Data validation
great-expectations==0.18.8

# Monitoring
evidently==0.7.20

# Testing
pytest==8.0.0
pytest-cov==4.1.0

# Utilities
pyarrow==15.0.0
pydantic==2.6.0

📌 Version Note: Exact versions are pinned to ensure reproducibility. Newer versions may work, but all examples were tested with the versions listed here.

Install the dependencies:

pip install -r requirements.txt

This might take a few minutes as it installs all the packages. Once complete, we're ready to start building our project step by step.

Checkpoint: You should have a project folder with data/, models/, src/, tests/, and feature_repo/ directories, and an activated virtual environment with all dependencies installed. Verify by running python -c "import mlflow; import feast; import fastapi; print('All imports successful!')".

Figure 1: The Complete ML Platform We'll Build

Don't worry if this looks complex, we'll build each component step by step, starting with the simplest piece and connecting them together.

1. Build a Simple Model and API (The Naive Approach)

To illustrate why we need all these tools, let's start by building a naive ML system without any MLOps infrastructure. We'll train a simple model and deploy it quickly, then observe what problems arise. This "naive approach" is how most ML projects start – and understanding its limitations will motivate the solutions we implement later.

1.1 Train a Quick Model

First, we need some data. For simplicity, we'll generate a synthetic dataset for fraud detection so that we don't rely on any external data files. The dataset will have features like:

amount: Transaction amount in dollars
hour: Hour of the day (0-23) when the transaction occurred
day_of_week: Day of the week (0=Monday, 6=Sunday)
merchant_category: Type of merchant (grocery, restaurant, retail, online, travel)
is_fraud: Label indicating if the transaction is fraudulent (1) or legitimate (0)

We will simulate that only ~2% of transactions are fraud, which is an imbalance typical in real fraud data. This imbalance is important because it affects how we evaluate our model.

Create src/generate_data.py:

# src/generate_data.py
"""
Generate synthetic fraud detection dataset.

This script creates realistic-looking transaction data where fraudulent
transactions have different patterns than legitimate ones:
- Fraud tends to have higher amounts
- Fraud tends to occur late at night
- Fraud is more common for online and travel merchants
"""
import pandas as pd
import numpy as np

def generate_transactions(n_samples=10000, fraud_ratio=0.02, seed=42):
    """
    Generate synthetic fraud detection dataset.
    
    Args:
        n_samples: Total number of transactions to generate
        fraud_ratio: Proportion of fraudulent transactions (default 2%)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with transaction features and fraud labels
    
    Fraud transactions have different patterns:
    - Higher amounts (mean \(245 vs \)33 for legit)
    - Late night hours (0-5, 23)
    - More likely to be online or travel merchants
    """
    np.random.seed(seed)
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud

    # Legitimate transactions: normal shopping patterns
    # - Amounts follow a log-normal distribution (most small, some large)
    # - Hours are uniformly distributed throughout the day
    # - Merchant categories weighted toward everyday shopping
    legit = pd.DataFrame({
        "amount": np.random.lognormal(mean=3.5, sigma=1.2, size=n_legit),  # ~$33 average
        "hour": np.random.randint(0, 24, size=n_legit),
        "day_of_week": np.random.randint(0, 7, size=n_legit),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_legit,
            p=[0.30, 0.25, 0.25, 0.15, 0.05]  # Weighted toward everyday shopping
        ),
        "is_fraud": 0
    })
    
    # Fraudulent transactions: suspicious patterns
    # - Higher amounts (fraudsters go big)
    # - Late night hours (less scrutiny)
    # - More online and travel (easier to exploit)
    fraud = pd.DataFrame({
        "amount": np.random.lognormal(mean=5.5, sigma=1.5, size=n_fraud),  # ~$245 average
        "hour": np.random.choice([0, 1, 2, 3, 4, 5, 23], size=n_fraud),  # Late night
        "day_of_week": np.random.randint(0, 7, size=n_fraud),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_fraud,
            p=[0.05, 0.05, 0.10, 0.60, 0.20]  # Weighted toward online/travel
        ),
        "is_fraud": 1
    })
    
    # Combine and shuffle
    df = pd.concat([legit, fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    return df

if __name__ == "__main__":
    # Generate dataset
    print("Generating synthetic fraud detection dataset...")
    df = generate_transactions(n_samples=10000, fraud_ratio=0.02)
    
    # Split into train (80%) and test (20%)
    train_df = df.sample(frac=0.8, random_state=42)
    test_df = df.drop(train_df.index)
    
    # Save to CSV files
    train_df.to_csv("data/train.csv", index=False)
    test_df.to_csv("data/test.csv", index=False)
    
    # Print summary statistics
    print(f"\nDataset generated successfully!")
    print(f"Training set: {len(train_df):,} transactions")
    print(f"Test set: {len(test_df):,} transactions")
    print(f"Overall fraud ratio: {df['is_fraud'].mean():.2%}")
    print(f"\nLegitimate transactions - Average amount: ${df[df['is_fraud']==0]['amount'].mean():.2f}")
    print(f"Fraudulent transactions - Average amount: ${df[df['is_fraud']==1]['amount'].mean():.2f}")
    print(f"\nMerchant category distribution (fraud):")
    print(df[df['is_fraud']==1]['merchant_category'].value_counts(normalize=True))

Run the data generation script:

python src/generate_data.py

You should see output like:

Generating synthetic fraud detection dataset...

Dataset generated successfully!
Training set: 8,000 transactions
Test set: 2,000 transactions
Overall fraud ratio: 2.00%

Legitimate transactions - Average amount: $33.45
Fraudulent transactions - Average amount: $245.67

Merchant category distribution (fraud):
online        0.60
travel        0.20
retail        0.10
restaurant    0.05
grocery       0.05

Now you have data/train.csv and data/test.csv with ~8000 training and ~2000 testing transactions.

Why This Matters: The synthetic data has realistic patterns — fraud is rare (2%), high-value, late-night, and concentrated in certain merchant categories. These patterns give our model something to learn.

Now, let's train a quick model. We'll use a simple Random Forest classifier from scikit-learn to predict is_fraud. In this naive version, we won't do much feature engineering – just label encode the categorical merchant_category and feed everything to the model.

Create src/train_naive.py:

# src/train_naive.py
"""
Train a fraud detection model - NAIVE VERSION.

This script demonstrates the "quick and dirty" approach to ML:
- No experiment tracking
- No model versioning
- Just train and save to a pickle file

We'll improve on this in later sections.
"""
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report
)

def main():
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    print(f"Training samples: {len(train_df):,}")
    print(f"Test samples: {len(test_df):,}")
    print(f"Training fraud ratio: {train_df['is_fraud'].mean():.2%}")
    
    # Encode the categorical feature
    # We need to save the encoder to use the same mapping at inference time
    print("\nEncoding categorical features...")
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    print(f"Merchant category mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
    
    # Prepare features and labels
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    # Train a Random Forest classifier
    print("\nTraining Random Forest model...")
    model = RandomForestClassifier(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of each tree
        random_state=42,       # For reproducibility
        n_jobs=-1              # Use all CPU cores
    )
    model.fit(X_train, y_train)
    print("Training complete!")
    
    # Evaluate on test data
    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"  True Negatives:  {cm[0][0]:,} (correctly identified legitimate)")
    print(f"  False Positives: {cm[0][1]:,} (legitimate flagged as fraud)")
    print(f"  False Negatives: {cm[1][0]:,} (fraud missed - DANGEROUS!)")
    print(f"  True Positives:  {cm[1][1]:,} (correctly caught fraud)")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Feature importance
    print("\nFeature Importance:")
    for name, importance in sorted(
        zip(feature_cols, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    ):
        print(f"  {name}: {importance:.4f}")
    
    # Save the model and encoder together
    print("\nSaving model to models/model.pkl...")
    with open("models/model.pkl", "wb") as f:
        pickle.dump((model, encoder), f)
    
    print("\nModel trained and saved successfully!")
    print("\nWARNING: This naive approach has several problems:")
    print("  - No record of hyperparameters or metrics")
    print("  - No model versioning")
    print("  - No way to reproduce this exact model")
    print("  - We'll fix these issues in the following sections!")

if __name__ == "__main__":
    main()

Run the training script:

python src/train_naive.py

You should see output similar to:

Loading data...
Training samples: 8,000
Test samples: 2,000
Training fraud ratio: 2.00%

Encoding categorical features...
Merchant category mapping: {'grocery': 0, 'online': 1, 'restaurant': 2, 'retail': 3, 'travel': 4}

Training Random Forest model...
Training complete!

==================================================
MODEL EVALUATION
==================================================

Accuracy:  0.9820
Precision: 0.7273
Recall:    0.6154
F1-score:  0.6667

Confusion Matrix:
  True Negatives:  1,956 (correctly identified legitimate)
  False Positives: 4 (legitimate flagged as fraud)
  False Negatives: 32 (fraud missed - DANGEROUS!)
  True Positives:  8 (correctly caught fraud)

Feature Importance:
  amount: 0.5423
  hour: 0.2156
  merchant_encoded: 0.1345
  day_of_week: 0.1076

Important observation: You'll see ~98% accuracy but a lower F1-score (around 0.5-0.7). With only 2% fraud, accuracy is extremely misleading! A model that always predicts "not fraud" would achieve 98% accuracy while catching zero fraud. This is why we focus on F1-score, precision, and recall for imbalanced classification problems.

💡 If you're new to imbalanced classification, remember: high accuracy can be meaningless when the positive class is rare.

The script outputs a file models/model.pkl containing both the trained model and the label encoder (we need both for inference).

Checkpoint: You should now have:

data/train.csv (~8,000 rows)
data/test.csv (~2,000 rows)
models/model.pkl (trained model + encoder)

The model should show ~98% accuracy but F1 around 0.5-0.7. Verify the files exist: ls -la data/ models/

1.2 Serve Predictions with FastAPI

Now that we have a model, let's deploy it as an API so that clients can get predictions. We'll use FastAPI because it's straightforward, very fast, and produces automatic interactive documentation.

FastAPI is known for:

Easy to use: Pythonic syntax with type hints
High performance: One of the fastest Python frameworks
Automatic documentation: Swagger UI out of the box
Data validation: Using Pydantic models

Create src/serve_naive.py:

# src/serve_naive.py
"""
Serve fraud detection model as a REST API - NAIVE VERSION.

This is a simple API that:
1. Loads the trained model at startup
2. Accepts transaction data via POST request
3. Returns fraud prediction

We'll improve this with validation, monitoring, and better
model loading in later sections.
"""
import pickle
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional

# Load the trained model and encoder at startup
# This is loaded once when the server starts, not on every request
print("Loading model...")
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)
print("Model loaded successfully!")

# Create the FastAPI application
app = FastAPI(
    title="Fraud Detection API",
    description="""
    Predict whether a credit card transaction is fraudulent.
    
    This API accepts transaction details and returns:
    - Whether the transaction is predicted to be fraud
    - The probability of fraud (0.0 to 1.0)
    
    **Note:** This is the naive version without validation or monitoring.
    """,
    version="1.0.0"
)

# Define the input schema using Pydantic
# This provides automatic validation and documentation
class Transaction(BaseModel):
    """Schema for a transaction to be evaluated for fraud."""
    amount: float = Field(
        ..., 
        description="Transaction amount in dollars",
        example=150.00
    )
    hour: int = Field(
        ..., 
        description="Hour of the day (0-23)",
        example=14
    )
    day_of_week: int = Field(
        ..., 
        description="Day of week (0=Monday, 6=Sunday)",
        example=3
    )
    merchant_category: str = Field(
        ..., 
        description="Type of merchant",
        example="online"
    )

class PredictionResponse(BaseModel):
    """Schema for the prediction response."""
    is_fraud: bool = Field(description="Whether the transaction is predicted as fraud")
    fraud_probability: float = Field(description="Probability of fraud (0.0 to 1.0)")
    
@app.post("/predict", response_model=PredictionResponse)
def predict(transaction: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Takes transaction details and returns a fraud prediction
    along with the probability score.
    """
    # Convert the request to a dictionary
    data = transaction.dict()
    
    # Encode the merchant category using the same encoder from training
    # This ensures consistency between training and serving
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        # Handle unknown merchant categories
        # In production, we'd want better handling here
        data["merchant_encoded"] = 0
    
    # Prepare features in the same order as training
    X = [[
        data["amount"],
        data["hour"],
        data["day_of_week"],
        data["merchant_encoded"]
    ]]
    
    # Get prediction and probability
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0][1]  # Probability of class 1 (fraud)
    
    return PredictionResponse(
        is_fraud=bool(prediction),
        fraud_probability=round(float(probability), 4)
    )

@app.get("/health")
def health_check():
    """
    Health check endpoint.
    
    Returns the status of the API. Useful for:
    - Load balancer health checks
    - Kubernetes liveness probes
    - Monitoring systems
    """
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/")
def root():
    """Root endpoint with API information."""
    return {
        "message": "Fraud Detection API",
        "version": "1.0.0",
        "docs": "/docs",
        "health": "/health"
    }

A few important things to note about this code:

Pydantic Models: We use BaseModel to define the expected input JSON schema. FastAPI automatically validates incoming requests against this schema.
Type Hints: The type hints (float, int, str) provide both documentation and runtime validation.
Feature Encoding: On each request, we encode the merchant category using the same LabelEncoder we saved from training. This ensures consistency between training and serving.
Health Endpoint: The /health endpoint is standard practice for production APIs - it allows load balancers and monitoring systems to check if the service is running.

To run this API, use Uvicorn (an ASGI server):

uvicorn src.serve_naive:app --reload --host 0.0.0.0 --port 8000

The --reload flag enables auto-reload during development (the server restarts when you change code).

You should see:

Loading model...
Model loaded successfully!
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process

Now open your browser and go to http://localhost:8000/docs. You'll see the Swagger UI – an auto-generated interactive documentation where you can test the API directly from your browser!

Test the API using curl in another terminal:

# Test with a legitimate-looking transaction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}'

Expected response:

{"is_fraud": false, "fraud_probability": 0.02}

# Test with a suspicious transaction (high amount, late night, online)
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 500.0, "hour": 3, "day_of_week": 1, "merchant_category": "online"}'

Expected response:

{"is_fraud": true, "fraud_probability": 0.78}

We have a working model served as an API! In a real scenario, we could now integrate this API with a payment processing frontend, mobile app, or any system that needs fraud predictions.

But before we celebrate, let's examine this naive approach for potential pitfalls...

Checkpoint: Your API should be running at http://localhost:8000. The Swagger UI at /docs should show both endpoints (/predict and /health). Test with curl or the Swagger UI to verify predictions are returned.

2. Where the Naive Approach Breaks

Our quick-and-dirty ML pipeline works on the surface: it can train a model and serve predictions. However, hidden problems will emerge if we try to maintain or scale this system in production.

This section is critical: understanding these issues will motivate the solutions we implement in the following sections. Let's go through the problems one by one.

Problem 1: No Experiment Tracking (Reproducibility)

Try this thought experiment: Run train_naive.py again with different hyperparameters (change n_estimators to 200, or max_depth to 15). Would you be able to exactly reproduce the previous model's results if someone asked?

Probably not. Currently, we have no record of:

Which hyperparameters we used
What metrics we achieved
What version of the data we trained on
What library versions were installed
When the training happened
Who ran the training

Three months from now, if your manager asks "How was this model trained? Can you reproduce the results?" – you'd be in trouble. You might have the code, but you don't know which version of the code, which parameters, or which data produced the model that's currently in production.

Experiment tracking is the practice of logging all these details (code versions, parameters, metrics, data versions, artifacts) so experiments can be compared and replicated. Our naive approach lacks this entirely, making our results hard to trust or build upon.

Problem 2: Model Versioning and Deployment Chaos

We trained one model and saved it as model.pkl. Now consider this scenario:

You train a new model with different hyperparameters
You overwrite model.pkl with the new model
You deploy it to production
Users start complaining about more false positives
You want to roll back to the previous model
Problem: The previous model was overwritten and is gone forever

There's no systematic versioning. Questions you cannot answer:

Which model version is currently in production?
What were the metrics for model v1 vs v2?
When was each model trained and by whom?
Can we instantly roll back if the new model performs worse?
What changed between versions?

Without version control for models, you're flying blind. Imagine deploying code without Git – that's what we're doing with our model.

Problem 3: No Data Validation – Garbage In, Garbage Out

Right now, our API will accept any input and try to make a prediction. Let's see what happens with bad data.

Create a test script src/test_bad_data.py:

# src/test_bad_data.py
"""Test what happens when we send garbage data to the API."""
import requests

BASE_URL = "http://localhost:8000"

print("Testing API with various bad inputs...\n")

# Test 1: Negative amount
print("Test 1: Negative amount")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -500.0,        # Negative amount - impossible!
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 2: Invalid hour
print("Test 2: Hour = 25 (should be 0-23)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 25,              # Invalid hour!
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 3: Invalid day of week
print("Test 3: day_of_week = 10 (should be 0-6)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 10,       # Invalid day!
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 4: Unknown merchant category
print("Test 4: Unknown merchant category")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "unknown_category"  # Not in training data!
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 5: All bad at once
print("Test 5: Everything wrong")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -1000.0,
    "hour": 99,
    "day_of_week": 15,
    "merchant_category": "totally_fake"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

print("Observation: The API happily accepts ALL garbage and returns predictions!")
print("This is dangerous - bad data leads to bad predictions with no warning.")

Run it (make sure your API is still running):

python src/test_bad_data.py

You'll see something like:

Testing API with various bad inputs...

Test 1: Negative amount
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.15}

Test 2: Hour = 25 (should be 0-23)
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.08}

...

Observation: The API happily accepts ALL garbage and returns predictions!

The API accepts garbage and returns predictions with no warning! In production, this could mean:

Incorrect predictions based on impossible data
Fraud going undetected because of malformed input
Legitimate transactions blocked based on corrupted data
No way to debug why predictions are wrong

As the saying goes: "Garbage in, garbage out." But even worse – we don't even know garbage went in!

Problem 4: Model Drift – Performance Decay Over Time

Here's a scenario that happens in every production ML system:

January: You train your model on historical fraud data. It achieves 98% accuracy and 0.67 F1-score. Everyone's happy.
February: The model is deployed and working well. Fraud is being caught.
March: Fraudsters adapt. They start using different patterns – smaller amounts, different merchant categories, different times of day.
April: Your model's accuracy has dropped from 98% to 85%. F1-score dropped from 0.67 to 0.35. Fraud is slipping through.
May: A major fraud incident occurs. Investigation reveals the model has been underperforming for 2 months.

The problem: Nobody noticed for 2 months because there was no monitoring.

This phenomenon is called data drift (when input data distributions change) or concept drift (when the relationship between inputs and outputs changes). Both are inevitable in real-world systems.

Without monitoring:

You don't know when performance degrades
You don't know why performance degrades
You can't take corrective action until users complain
By then, significant damage may have occurred

Problem 5: No CI/CD or Deployment Safety

Our "deployment process" was literally:

SSH into the server (or run locally)
Run python src/train_naive.py
Copy model.pkl to the right place
Restart the API
Hope for the best

There's:

No automated testing: A typo could break everything
No staging environment: We test directly in production
No gradual rollout: 100% of traffic hits the new model immediately
No rollback capability: If something breaks, we have to manually fix it
No audit trail: Who deployed what and when?

This is how production incidents happen. A rushed deployment at 5 PM on Friday breaks the fraud detection system, and nobody notices until Monday when fraud losses have spiked.

Figure 2: Problems with the Naive Approach

Summary: What We Need to Fix

Our simple ML service is missing critical infrastructure. Here's the mapping of problems to solutions:

Problem	Impact	Solution	Section
No experiment tracking	Can't reproduce or compare models	MLflow Tracking	3
No model versioning	Can't roll back or audit	MLflow Registry	3
No feature consistency	Training-serving skew	Feast Feature Store	4
No data validation	Garbage predictions	Great Expectations	5
No monitoring	Drift goes unnoticed	Evidently	6
No CI/CD	Risky deployments	GitHub Actions + Docker	7

The good news: We can fix each of these by incrementally adding components to our pipeline. Each tool addresses a specific problem, and together they form a robust ML platform.

Let's start fixing these issues, one by one.

3. Add Experiment Tracking and Model Registry with MLflow

What breaks without this: You can't reproduce yesterday's results, can't compare experiments, and can't roll back when a new model fails in production.

Our first fix addresses Problems 1 and 2: experiment reproducibility and model versioning.

MLflow is an open-source platform designed to manage the ML lifecycle. We'll use two of its key components:

MLflow Tracking: Log experiments (parameters, metrics, artifacts) so you can compare runs and reproduce results
MLflow Model Registry: Version your models with aliases (champion, challenger) and manage the deployment lifecycle

Why This Matters: Without tracking, ML is guesswork. With MLflow, every run is logged with parameters, metrics, and artifacts. You can compare runs side-by-side, understand what actually improved your model, and reproduce any past experiment. The Model Registry adds governance – you know exactly which model is in production and can roll back in seconds.

3.1 How to Set Up the MLflow Tracking Server

MLflow can log experiments to a local directory by default, but to use the full UI and model registry, it's best to run the MLflow tracking server.

Open a new terminal (keep it separate from your API terminal) and run:

# Create a directory for MLflow data
mkdir -p mlruns

# Start the MLflow server
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns

Let's break down these parameters:

--host 0.0.0.0: Listen on all network interfaces
--port 5000: Run on port 5000
--backend-store-uri sqlite:///mlflow.db: Store experiment metadata in a SQLite database (for production, you'd use PostgreSQL or MySQL)
--default-artifact-root ./mlruns: Store model artifacts (files) in the mlruns directory

You should see:

[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://0.0.0.0:5000

Now open your browser and navigate to http://localhost:5000. You'll see the MLflow UI – it should be empty initially since we haven't logged any experiments yet.

3.2 How to Log Experiments in Code

Now let's modify our training script to log everything to MLflow. Create src/train_mlflow.py:

# src/train_mlflow.py
"""
Train fraud detection model with MLflow experiment tracking.

This script demonstrates proper ML experiment tracking:
- Log all hyperparameters
- Log all metrics (train and test)
- Log the trained model as an artifact
- Register the model in the Model Registry

Compare this to train_naive.py to see the difference!
"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score
)
import pickle
from datetime import datetime

# Configure MLflow to use our tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create or get the experiment
# All runs will be grouped under this experiment name
mlflow.set_experiment("fraud-detection")

def load_and_preprocess_data():
    """Load and preprocess the training and test data."""
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # Encode categorical feature
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    # Prepare features
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    return X_train, y_train, X_test, y_test, encoder

def train_and_log_model(
    n_estimators: int = 100,
    max_depth: int = 10,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1
):
    """
    Train a model and log everything to MLflow.
    
    Args:
        n_estimators: Number of trees in the forest
        max_depth: Maximum depth of each tree
        min_samples_split: Minimum samples required to split a node
        min_samples_leaf: Minimum samples required at a leaf node
    """
    X_train, y_train, X_test, y_test, encoder = load_and_preprocess_data()
    
    # Start an MLflow run - everything logged will be associated with this run
    with mlflow.start_run():
        # Add a descriptive run name
        run_name = f"rf_est{n_estimators}_depth{max_depth}_{datetime.now().strftime('%H%M%S')}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        # Log all hyperparameters
        # These are the "knobs" we can tune
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)
        mlflow.log_param("min_samples_leaf", min_samples_leaf)
        mlflow.log_param("model_type", "RandomForestClassifier")
        
        # Log data information
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples", len(X_test))
        mlflow.log_param("fraud_ratio", float(y_train.mean()))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Train the model
        print(f"\nTraining model: n_estimators={n_estimators}, max_depth={max_depth}")
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Evaluate and log metrics for BOTH train and test sets
        # This helps detect overfitting
        for dataset_name, X, y in [("train", X_train, y_train), ("test", X_test, y_test)]:
            y_pred = model.predict(X)
            y_prob = model.predict_proba(X)[:, 1]
            
            # Calculate all metrics
            accuracy = accuracy_score(y, y_pred)
            precision = precision_score(y, y_pred, zero_division=0)
            recall = recall_score(y, y_pred, zero_division=0)
            f1 = f1_score(y, y_pred, zero_division=0)
            roc_auc = roc_auc_score(y, y_prob)
            
            # Log metrics with dataset prefix
            mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
            mlflow.log_metric(f"{dataset_name}_precision", precision)
            mlflow.log_metric(f"{dataset_name}_recall", recall)
            mlflow.log_metric(f"{dataset_name}_f1", f1)
            mlflow.log_metric(f"{dataset_name}_roc_auc", roc_auc)
            
            print(f"  {dataset_name.upper()} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, ROC-AUC: {roc_auc:.4f}")
        
        # Log feature importance
        for feature, importance in zip(
            ["amount", "hour", "day_of_week", "merchant_encoded"],
            model.feature_importances_
        ):
            mlflow.log_metric(f"importance_{feature}", importance)
        
        # Log the model to MLflow AND register it in the Model Registry
        # This creates a new version of the model automatically
        print("\nRegistering model in MLflow Model Registry...")
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name="fraud-detection-model",
            input_example=X_train.iloc[:5]  # Example input for documentation
        )
        
        # Save and log the encoder as a separate artifact
        # We need this for inference
        with open("encoder.pkl", "wb") as f:
            pickle.dump(encoder, f)
        mlflow.log_artifact("encoder.pkl")
        
        # Get the run ID for reference
        run_id = mlflow.active_run().info.run_id
        print(f"\nMLflow Run ID: {run_id}")
        print(f"View this run: http://localhost:5000/#/experiments/1/runs/{run_id}")
        
        return model, encoder

def run_experiment_sweep():
    """
    Run multiple experiments with different hyperparameters.
    
    This demonstrates how MLflow helps compare different configurations.
    """
    print("="*60)
    print("RUNNING HYPERPARAMETER EXPERIMENT SWEEP")
    print("="*60)
    
    # Define different configurations to try
    experiments = [
        {"n_estimators": 50, "max_depth": 5},
        {"n_estimators": 100, "max_depth": 10},
        {"n_estimators": 100, "max_depth": 15},
        {"n_estimators": 200, "max_depth": 10},
        {"n_estimators": 200, "max_depth": 20},
    ]
    
    for i, params in enumerate(experiments, 1):
        print(f"\n--- Experiment {i}/{len(experiments)} ---")
        train_and_log_model(**params)
    
    print("\n" + "="*60)
    print("EXPERIMENT SWEEP COMPLETE!")
    print("="*60)
    print("\nView all experiments at: http://localhost:5000")
    print("Compare runs to find the best hyperparameters!")

if __name__ == "__main__":
    run_experiment_sweep()

This script:

Connects to MLflow: mlflow.set_tracking_uri("http://localhost:5000")
Creates an experiment: mlflow.set_experiment("fraud-detection")
Logs parameters: All hyperparameters and data info
Logs metrics: Accuracy, precision, recall, F1, ROC-AUC for both train and test sets
Logs the model: Saves the trained model as an artifact
Registers the model: Adds it to the Model Registry with automatic versioning

Run the experiment sweep:

python src/train_mlflow.py

You'll see output for each experiment:

============================================================
RUNNING HYPERPARAMETER EXPERIMENT SWEEP
============================================================

--- Experiment 1/5 ---
Loading data...
Training model: n_estimators=50, max_depth=5
  TRAIN - Accuracy: 0.9821, F1: 0.6545, ROC-AUC: 0.9234
  TEST - Accuracy: 0.9795, F1: 0.5714, ROC-AUC: 0.8956

Registering model in MLflow Model Registry...
MLflow Run ID: abc123...

--- Experiment 5/5 ---
Training model: n_estimators=200, max_depth=20
  TRAIN - Accuracy: 0.9856, F1: 0.7123, ROC-AUC: 0.9567
  TEST - Accuracy: 0.9810, F1: 0.6667, ROC-AUC: 0.9234

============================================================
EXPERIMENT SWEEP COMPLETE!
============================================================

All 5 runs are now logged to MLflow with full metrics comparison available in the UI.

Now refresh the MLflow UI at http://localhost:5000. You'll see:

Experiments tab: Shows the "fraud-detection" experiment with 5 runs
Each run: Shows parameters, metrics, and artifacts
Compare: You can select multiple runs and compare them side-by-side
Models tab: Shows "fraud-detection-model" with 5 versions

MLflow Tracking UI: Compare runs, metrics, and models at a glance

3.3 How to Use the Model Registry

The Model Registry provides a central hub for managing model versions and their lifecycle stages.

In the MLflow UI:

Click the "Models" tab in the top navigation
Click "fraud-detection-model"
You'll see all 5 versions listed with their metrics

Model Aliases: MLflow now uses aliases instead of stages. If you've seen older tutorials using "Staging" and "Production" stages, aliases are the newer, more flexible approach.

@champion: The production model serving live traffic
@challenger: Candidate model being tested
You can create custom aliases like @baseline, @latest and so on.

Assign an alias:

Open MLflow UI → Models → fraud-detection-model
Click on the version you want to promote
Click "Add Alias"
Enter champion and save

Now you've assigned the @champion alias to your best model. Your API will load whichever version has this alias, making rollbacks as simple as moving the alias to a different version.

Figure 3: MLflow Model Lifecycle — From Training to Production

3.4 Update API to Load from Registry

Now let's update our API to load the champion model from the MLflow Registry instead of a pickle file. Create src/serve_mlflow.py:

# src/serve_mlflow.py
"""
Serve fraud detection model from MLflow Model Registry.

This version loads the @champion model from MLflow, which means:
- Always serves the latest @champion model
- Can roll back by changing the @champion alias
- No manual file copying needed
"""
import mlflow
import mlflow.sklearn
import pickle
import os
from fastapi import FastAPI
from pydantic import BaseModel, Field

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

print("Loading model from MLflow Model Registry...")

# Load the champion model from the registry
# This automatically gets whichever version has the @champion alias
try:
    model = mlflow.sklearn.load_model("models:/fraud-detection-model@champion")
    print("Successfully loaded champion model from MLflow!")
except Exception as e:
    print(f"Error loading from MLflow: {e}")
    print("Make sure you've assigned the @champion alias to a model in the MLflow UI")
    raise

# Load the encoder (saved as an artifact)
# In a real system, you might also version this in MLflow
with open("encoder.pkl", "rb") as f:
    encoder = pickle.load(f)
print("Encoder loaded successfully!")

app = FastAPI(
    title="Fraud Detection API (MLflow)",
    description="""
    Fraud detection API that loads models from MLflow Model Registry.
    
    This version always serves the model with the @champion alias.
    To update the model:
    1. Train a new model with train_mlflow.py
    2. Compare metrics in MLflow UI
    3. Promote the best model to Production
    4. Restart this API
    
    To roll back: Move the @champion alias to a previous version in MLflow UI.
    """,
    version="2.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount in dollars", example=150.00)
    hour: int = Field(..., description="Hour of the day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Monday, 6=Sunday)", example=3)
    merchant_category: str = Field(..., description="Type of merchant", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_source: str = "MLflow Production"

@app.post("/predict", response_model=PredictionResponse)
def predict(tx: Transaction):
    """Predict whether a transaction is fraudulent using the champion model."""
    data = tx.dict()
    
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        data["merchant_encoded"] = 0
    
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        model_source="MLflow Production"
    )

@app.get("/health")
def health():
    return {"status": "healthy", "model_source": "MLflow Registry"}

@app.get("/model-info")
def model_info():
    """Get information about the currently loaded model."""
    return {
        "registry": "MLflow",
        "model_name": "fraud-detection-model",
        "alias": "champion",
        "tracking_uri": "http://localhost:5000"
    }

Stop your old API (Ctrl+C) and start this new one:

uvicorn src.serve_mlflow:app --reload --host 0.0.0.0 --port 8000

Now deploying a new model is a controlled, auditable process:

Train new model → Automatically registered as new version
Compare metrics → Use MLflow UI to compare with current Production
Set as champion → Assign @champion alias in MLflow UI
Restart API → Loads new Production model
Roll back if needed → Move @champion alias to previous version

Checkpoint:

MLflow UI (http://localhost:5000) should show the "fraud-detection" experiment with 5 runs
The "Models" tab should show "fraud-detection-model" with 5 versions
One version should have @champion alias
The API should load and serve @champion model

4. Ensure Feature Consistency with Feast

⚠️ First time hearing about feature stores? Don't worry.
You don't need to master every Feast detail on the first read.
Focus on why feature consistency matters — you can revisit the implementation later.
Key takeaway: Training and serving must compute features the same way, or your model silently fails.

What breaks without this: Your model sees different feature values in production than it saw during training. Accuracy drops silently. This is called "training-serving skew" and it's one of the most common causes of ML system failures.

One subtle but critical issue in ML systems is training-serving skew – when data transformations at training time differ from inference time. Even small discrepancies can severely degrade performance.

Why This Matters: Imagine you're computing "average transaction amount per merchant category" as a feature. During training, you compute it using pandas in a notebook. During serving, you compute it using SQL in a different system. Small differences in how these computations handle edge cases (nulls, rounding, time windows) cause the model to see different features in production than it was trained on.

The result? Silent failures where accuracy drops but nothing errors out. Your model is making predictions based on features it's never seen before, and you have no idea.

In our naive implementation, we did handle one simple case: we saved the LabelEncoder to ensure merchant_category is encoded the same way in training and serving. But imagine if we had more complex feature engineering:

Rolling averages over time windows
User-level aggregations
Cross-feature interactions
Real-time features from streaming data

Maintaining consistency manually becomes impossible.

4.1 What is Feast and Why Use It?

In production ML platforms, teams use a feature store to guarantee feature consistency between training and serving. Feast is one popular open-source option.

In this tutorial, we use Feast not because you must, but because it makes the training-serving contract explicit and teachable. The principles apply whether you use Feast, Tecton, Featureform, or a custom solution.

Feast provides:

Capability	Description
Single source of truth	Define features once, use everywhere
Offline/online consistency	Same features for training and serving
Point-in-time correctness	Prevents data leakage in training
Low-latency serving	Millisecond feature retrieval
Feature versioning	Track changes to feature definitions

How Feast works:

Define features in Python code (feature definitions)
Materialize features from your data sources to the online store
Retrieve features using the same API for both training (offline) and serving (online)

This ensures that training and serving use exactly the same feature computation logic.

4.2 Install and Initialize Feast

We already installed Feast via requirements.txt. Now let's initialize a feature repository.

# Navigate to the feature_repo directory
cd feature_repo

# Initialize Feast (this creates template files)
feast init . --minimal

# Go back to project root
cd ..

This creates the basic Feast structure:

feature_repo/
├── feature_store.yaml    # Feast configuration
└── __init__.py

4.3 Define Feature Definitions

First, let's create the Feast configuration file:

# feature_repo/feature_store.yaml
project: fraud_detection
registry: ../data/registry.db
provider: local
online_store:
  type: sqlite
  path: ../data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 3

This configuration:

Names our project "fraud_detection"
Uses SQLite for the online store (for production, you'd use Redis or DynamoDB)
Uses local files for the offline store (for production, you'd use BigQuery or Snowflake)

Now create the feature definitions:

# feature_repo/features.py
"""
Feast feature definitions for fraud detection.

This file defines:
- Entities: The keys we use to look up features (merchant_category)
- Data Sources: Where the raw feature data comes from (Parquet file)
- Feature Views: The features themselves and their schemas

The key insight: These definitions are the SINGLE SOURCE OF TRUTH.
Both training and serving use these exact definitions.
"""
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# =============================================================================
# ENTITIES
# =============================================================================
# An entity is the "key" we use to look up features.
# For merchant-level features, the entity is merchant_category.

merchant = Entity(
    name="merchant_category",
    description="Merchant category for the transaction (for example, 'online', 'grocery')",
    value_type=ValueType.STRING,
)

# =============================================================================
# DATA SOURCES
# =============================================================================
# Data sources tell Feast where to find the raw feature data.
# For local development, we use a Parquet file.
# For production, this could be BigQuery, Snowflake, S3, etc.

merchant_stats_source = FileSource(
    name="merchant_stats_source",
    path="../data/merchant_features.parquet",  # We'll create this file
    timestamp_field="event_timestamp",       # Required for point-in-time joins
)

# =============================================================================
# FEATURE VIEWS
# =============================================================================
# A Feature View defines a group of related features.
# It specifies:
# - Which entity the features are for
# - The schema (names and types of features)
# - Where the data comes from
# - How long features are valid (TTL)

merchant_stats_fv = FeatureView(
    name="merchant_stats",
    description="Aggregated statistics per merchant category",
    entities=[merchant],
    ttl=timedelta(days=7),  # Features are valid for 7 days
    schema=[
        Field(name="avg_amount", dtype=Float32, description="Average transaction amount"),
        Field(name="transaction_count", dtype=Int64, description="Number of transactions"),
        Field(name="fraud_rate", dtype=Float32, description="Historical fraud rate"),
    ],
    source=merchant_stats_source,
    online=True,  # Enable online serving (low-latency retrieval)
)

4.4 Materialize Features to Online Store

Now we need to:

Compute the features from our training data
Save them in a format Feast can read
Apply the Feast definitions
Materialize features to the online store

Create src/prepare_feast_features.py:

# src/prepare_feast_features.py
"""
Prepare feature data for Feast.

This script:
1. Computes aggregated merchant features from training data
2. Saves them in Parquet format (Feast's offline store format)
3. Applies Feast feature definitions
4. Materializes features to the online store for low-latency serving

Run this whenever your training data changes or you want to refresh features.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import subprocess
import os

def compute_merchant_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute aggregated features by merchant category.
    
    THIS IS THE SINGLE SOURCE OF TRUTH FOR FEATURE COMPUTATION.
    
    Both training and serving will use features computed by this exact logic.
    Any change here automatically applies everywhere.
    
    Args:
        df: Transaction DataFrame with columns: amount, merchant_category, is_fraud
        
    Returns:
        DataFrame with computed features per merchant category
    """
    print("Computing merchant-level features...")
    
    # Group by merchant category and compute aggregates
    stats = df.groupby('merchant_category').agg({
        'amount': ['mean', 'count'],
        'is_fraud': 'mean'
    }).reset_index()
    
    # Flatten column names
    stats.columns = ['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']
    
    # Add timestamp for Feast (required for point-in-time correct joins)
    stats['event_timestamp'] = datetime.now()
    
    # Convert types to match Feast schema
    stats['avg_amount'] = stats['avg_amount'].astype('float32')
    stats['transaction_count'] = stats['transaction_count'].astype('int64')
    stats['fraud_rate'] = stats['fraud_rate'].astype('float32')
    
    return stats

def main():
    print("="*60)
    print("FEAST FEATURE PREPARATION")
    print("="*60)
    
    # Load training data
    print("\n1. Loading training data...")
    train_df = pd.read_csv('data/train.csv')
    print(f"   Loaded {len(train_df):,} transactions")
    
    # Compute merchant features
    print("\n2. Computing merchant features...")
    merchant_features = compute_merchant_features(train_df)
    
    print("\n   Computed features:")
    print(merchant_features.to_string(index=False))
    
    # Save as Parquet (required format for Feast file source)
    print("\n3. Saving features to Parquet...")
    os.makedirs('data', exist_ok=True)
    output_path = 'data/merchant_features.parquet'
    merchant_features.to_parquet(output_path, index=False)
    print(f"   Saved to {output_path}")
    
    # Apply Feast feature definitions
    print("\n4. Applying Feast feature definitions...")
    try:
        result = subprocess.run(
            ['feast', 'apply'],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Feature definitions applied successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error applying Feast: {e.stderr}")
        raise
    
    # Materialize features to online store
    print("\n5. Materializing features to online store...")
    try:
        result = subprocess.run(
            ['feast', 'materialize-incremental', datetime.now().isoformat()],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Features materialized successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error materializing: {e.stderr}")
        raise
    
    print("\n" + "="*60)
    print("FEAST FEATURE PREPARATION COMPLETE!")
    print("="*60)
    print("\nYou can now:")
    print("  - Retrieve features for training: get_training_features()")
    print("  - Retrieve features for serving: get_online_features()")
    print("  - View feature stats: feast feature-views list")

if __name__ == "__main__":
    main()

Run the feature preparation:

python src/prepare_feast_features.py

You should see:

============================================================
FEAST FEATURE PREPARATION
============================================================

1. Loading training data... 8,000 transactions
2. Computing merchant features...
   grocery: avg=$31.24, fraud_rate=0.85%
   online: avg=$98.45, fraud_rate=4.87%
   restaurant: avg=$28.12, fraud_rate=0.50%
   retail: avg=$45.67, fraud_rate=1.02%
   travel: avg=$156.23, fraud_rate=4.18%
3. Saving to data/merchant_features.parquet ✓
4. Applying Feast definitions... ✓
5. Materializing to online store... ✓

FEAST FEATURE PREPARATION COMPLETE!

4.5 Retrieve Features for Training and Serving

Now let's create utilities to retrieve features consistently for both training and serving:

# src/feast_features.py
"""
Feast feature retrieval for training and serving.

This module provides functions to retrieve features from Feast:
- get_training_features(): For offline training (historical features)
- get_online_features(): For real-time serving (low-latency)

IMPORTANT: Both functions use the SAME feature definitions,
ensuring consistency between training and serving.
"""
import pandas as pd
from feast import FeatureStore
from datetime import datetime

# Initialize Feast store (points to our feature_repo)
store = FeatureStore(repo_path="feature_repo")

def get_training_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Get features for training using Feast's offline store.
    
    Uses point-in-time correct joins to prevent data leakage.
    This means features are looked up as of the time each transaction occurred,
    not as of "now" - preventing you from accidentally using future data.
    
    Args:
        df: DataFrame with at least 'merchant_category' column
        
    Returns:
        DataFrame with original columns plus Feast features
    """
    print("Retrieving training features from Feast offline store...")
    
    # Prepare entity dataframe with timestamps
    # Each row needs: entity key(s) + event_timestamp
    entity_df = df[['merchant_category']].copy()
    entity_df['event_timestamp'] = datetime.now()  # See note below
    entity_df = entity_df.drop_duplicates()
    
    # ⚠️ Simplification: For clarity, we use the current timestamp here.
    # In real systems, this would be the actual event time of each transaction.
    
    # Retrieve historical features
    # Feast handles the point-in-time join automatically
    training_data = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
    ).to_df()
    
    # Merge features back with original dataframe
    result = df.merge(
        training_data[['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']],
        on='merchant_category',
        how='left'
    )
    
    print(f"Retrieved features for {len(entity_df)} unique merchants")
    return result

def get_online_features(merchant_category: str) -> dict:
    """
    Get features for real-time serving using Feast's online store.
    
    This is optimized for low-latency retrieval (milliseconds).
    Use this in your prediction API for real-time inference.
    
    Args:
        merchant_category: The merchant category to look up
        
    Returns:
        Dictionary with feature names and values
    """
    # Retrieve from online store (low-latency)
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": merchant_category}],
    ).to_dict()
    
    # Format the response
    return {
        'merchant_avg_amount': feature_vector['avg_amount'][0],
        'merchant_tx_count': feature_vector['transaction_count'][0],
        'merchant_fraud_rate': feature_vector['fraud_rate'][0],
    }

def get_online_features_batch(merchant_categories: list) -> pd.DataFrame:
    """
    Get features for multiple merchants at once (batch serving).
    
    More efficient than calling get_online_features() in a loop.
    
    Args:
        merchant_categories: List of merchant categories to look up
        
    Returns:
        DataFrame with features for each merchant
    """
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": mc} for mc in merchant_categories],
    ).to_df()
    
    return feature_vector

if __name__ == "__main__":
    # Test the feature retrieval functions
    print("="*60)
    print("TESTING FEAST FEATURE RETRIEVAL")
    print("="*60)
    
    # Test offline retrieval (for training)
    print("\n1. Testing OFFLINE feature retrieval (for training)...")
    train_df = pd.read_csv('data/train.csv').head(10)
    enriched = get_training_features(train_df)
    print("\n   Sample enriched training data:")
    print(enriched[['amount', 'merchant_category', 'avg_amount', 'fraud_rate']].head())
    
    # Test online retrieval (for serving)
    print("\n2. Testing ONLINE feature retrieval (for serving)...")
    for category in ['online', 'grocery', 'travel', 'restaurant', 'retail']:
        features = get_online_features(category)
        print(f"   {category}: avg_amount=${features['merchant_avg_amount']:.2f}, "
              f"fraud_rate={features['merchant_fraud_rate']:.2%}")
    
    # Test batch retrieval
    print("\n3. Testing BATCH online retrieval...")
    batch_features = get_online_features_batch(['online', 'grocery', 'travel'])
    print(batch_features)
    
    print("\n" + "="*60)
    print("FEAST FEATURE RETRIEVAL TEST COMPLETE!")
    print("="*60)

Test the feature retrieval:

python src/feast_features.py

You should see:

============================================================
TESTING FEAST FEATURE RETRIEVAL
============================================================

1. Testing OFFLINE feature retrieval (for training)...
Retrieving training features from Feast offline store...
Retrieved features for 5 unique merchants

   Sample enriched training data:
   amount merchant_category  avg_amount  fraud_rate
    45.23           grocery       31.24      0.0085
   123.45            online       98.45      0.0487
    ...

2. Testing ONLINE feature retrieval (for serving)...
   online: avg_amount=$98.45, fraud_rate=4.87%
   grocery: avg_amount=$31.24, fraud_rate=0.85%
   travel: avg_amount=$156.23, fraud_rate=4.18%
   restaurant: avg_amount=$28.12, fraud_rate=0.50%
   retail: avg_amount=$45.67, fraud_rate=1.02%

3. Testing BATCH online retrieval...
  merchant_category  avg_amount  transaction_count  fraud_rate
               online       98.45               1234      0.0487
              grocery       31.24               2345      0.0085
               travel      156.23                478      0.0418

Why Feast Over Custom Code?

Aspect	Custom Code	Feast
Consistency	Manual effort to keep in sync	Automatic - same definitions everywhere
Point-in-time correctness	Must implement yourself	Built-in
Online serving	Must build your own cache	Built-in online store
Feature versioning	Not supported	Built-in
Scalability	Limited	Production-ready (BigQuery, Redis, etc.)
Team collaboration	Difficult	Feature registry with documentation
Monitoring	Manual	Built-in feature statistics

💡 Mental Model: Treat feature definitions like database schemas.
You wouldn't compute a column one way in your application and a different way in your reports. Features deserve the same discipline — define once, use everywhere.

Checkpoint: After running prepare_feast_features.py, you should have:

data/merchant_features.parquet (computed features)
data/registry.db (Feast registry)
data/online_store.db (SQLite online store)

Running python src/feast_features.py should successfully retrieve features for all merchant categories.

5. Add Data Validation with Great Expectations

What breaks without this: Your API accepts garbage input (negative amounts, invalid hours) and returns meaningless predictions. Worse, you have no idea it happened.

Recall that our API currently trusts input blindly. We saw how garbage data produces a prediction with no warning. Great Expectations is an open-source tool for data quality testing – defining rules (expectations) and testing data against them.

Why This Matters: Data validation acts as a gatekeeper. Bad data is rejected before it can harm predictions. As the saying goes, "Garbage in, garbage out" – feeding unreliable data yields unreliable results. With validation, we transform this to "Garbage in, error out" – much better for debugging and reliability.

5.1 Define Expectations

What are reasonable expectations for our transaction data? Based on domain knowledge:

Field	Expectation	Reason
`amount`	Positive (> 0)	Negative transactions don't make sense
`amount`	Below $50,000	Extremely large amounts are outliers/errors
`hour`	0-23 inclusive	Valid hours in a day
`day_of_week`	0-6 inclusive	Valid days (Mon=0, Sun=6)
`merchant_category`	One of known categories	Must match training data
All fields	Not null	Required for prediction

Create src/data_validation.py:

# src/data_validation.py
"""
Data validation for fraud detection.

This module provides functions to validate input data BEFORE making predictions.
Invalid data is rejected with clear error messages.

The key insight: It's better to reject bad input than to make garbage predictions.
"""
import pandas as pd
from typing import Dict, List, Any, Optional

# Define the valid merchant categories (must match training data!)
VALID_CATEGORIES = ["grocery", "restaurant", "retail", "online", "travel"]

def validate_transaction(data: Dict[str, Any]) -> Dict[str, Any]:
    """
    Validate a single transaction for fraud prediction.
    
    Checks all business rules and data quality requirements.
    Returns a dictionary with 'valid' (bool) and 'errors' (list).
    
    Args:
        data: Dictionary with transaction fields
        
    Returns:
        {"valid": bool, "errors": list of error messages}
        
    Example:
        >>> validate_transaction({"amount": -100, "hour": 25, ...})
        {"valid": False, "errors": ["amount must be positive", "hour must be 0-23"]}
    """
    errors = []
    
    # ==========================================================================
    # Amount Validation
    # ==========================================================================
    amount = data.get("amount")
    if amount is None:
        errors.append("amount is required")
    elif not isinstance(amount, (int, float)):
        errors.append(f"amount must be a number (got {type(amount).__name__})")
    elif amount <= 0:
        errors.append("amount must be positive")
    elif amount > 50000:
        errors.append(f"amount exceeds maximum allowed value of \(50,000 (got \){amount:,.2f})")
    
    # ==========================================================================
    # Hour Validation
    # ==========================================================================
    hour = data.get("hour")
    if hour is None:
        errors.append("hour is required")
    elif not isinstance(hour, int):
        errors.append(f"hour must be an integer (got {type(hour).__name__})")
    elif not (0 <= hour <= 23):
        errors.append(f"hour must be between 0 and 23 (got {hour})")
    
    # ==========================================================================
    # Day of Week Validation
    # ==========================================================================
    day = data.get("day_of_week")
    if day is None:
        errors.append("day_of_week is required")
    elif not isinstance(day, int):
        errors.append(f"day_of_week must be an integer (got {type(day).__name__})")
    elif not (0 <= day <= 6):
        errors.append(f"day_of_week must be between 0 (Monday) and 6 (Sunday) (got {day})")
    
    # ==========================================================================
    # Merchant Category Validation
    # ==========================================================================
    category = data.get("merchant_category")
    if category is None:
        errors.append("merchant_category is required")
    elif not isinstance(category, str):
        errors.append(f"merchant_category must be a string (got {type(category).__name__})")
    elif category not in VALID_CATEGORIES:
        errors.append(
            f"merchant_category must be one of {VALID_CATEGORIES} (got '{category}')"
        )
    
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }

def validate_batch(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Validate a batch of transactions using Great Expectations.
    
    This is useful for validating training data or batch prediction requests.
    Uses Great Expectations for more sophisticated validation.
    
    Args:
        df: DataFrame with transaction data
        
    Returns:
        Dictionary with validation results
    """
    import great_expectations as gx
    
    # Convert to Great Expectations dataset
    ge_df = gx.from_pandas(df)
    
    results = []
    
    # Amount expectations
    r = ge_df.expect_column_values_to_be_between(
        'amount', min_value=0.01, max_value=50000, mostly=0.99
    )
    results.append(('amount_range', r.success, r.result))
    
    # Hour expectations
    r = ge_df.expect_column_values_to_be_between(
        'hour', min_value=0, max_value=23
    )
    results.append(('hour_range', r.success, r.result))
    
    # Day of week expectations
    r = ge_df.expect_column_values_to_be_between(
        'day_of_week', min_value=0, max_value=6
    )
    results.append(('day_range', r.success, r.result))
    
    # Merchant category expectations
    r = ge_df.expect_column_values_to_be_in_set(
        'merchant_category', VALID_CATEGORIES
    )
    results.append(('category_valid', r.success, r.result))
    
    # No nulls in critical fields
    for col in ['amount', 'hour', 'day_of_week', 'merchant_category']:
        r = ge_df.expect_column_values_to_not_be_null(col)
        results.append((f'{col}_not_null', r.success, r.result))
    
    # Summarize results
    passed = sum(1 for _, success, _ in results if success)
    total = len(results)
    
    return {
        'success': passed == total,
        'passed': passed,
        'total': total,
        'pass_rate': passed / total,
        'details': {name: {'passed': success, 'result': result} 
                   for name, success, result in results}
    }

if __name__ == "__main__":
    print("="*60)
    print("TESTING DATA VALIDATION")
    print("="*60)
    
    # Test single transaction validation
    print("\n1. Single Transaction Validation")
    print("-"*40)
    
    test_cases = [
        {
            "name": "Valid transaction",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Negative amount",
            "data": {"amount": -100.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Invalid hour",
            "data": {"amount": 50.0, "hour": 25, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Unknown merchant",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "unknown"}
        },
        {
            "name": "Everything wrong",
            "data": {"amount": -999, "hour": 99, "day_of_week": 15, "merchant_category": "fake"}
        },
    ]
    
    for tc in test_cases:
        result = validate_transaction(tc["data"])
        status = "PASS" if result["valid"] else "FAIL"
        print(f"\n{tc['name']}: {status}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  - {error}")
    
    # Test batch validation
    print("\n\n2. Batch Validation with Great Expectations")
    print("-"*40)
    
    train_df = pd.read_csv('data/train.csv')
    results = validate_batch(train_df)
    
    print(f"\nTraining data validation: {results['passed']}/{results['total']} checks passed")
    print(f"Pass rate: {results['pass_rate']:.1%}")
    
    if not results['success']:
        print("\nFailed checks:")
        for name, detail in results['details'].items():
            if not detail['passed']:
                print(f"  - {name}")

When to Use Which Validation Approach

Approach	Use Case	Latency	When to Use
Custom Python (`validate_transaction`)	Real-time API requests	<1ms	Every prediction request
Great Expectations	Batch data quality	Seconds	Training data, periodic audits, CI/CD

We use both in this tutorial because they serve different purposes:

Custom validation is your runtime gatekeeper — fast enough for every request
Great Expectations is your batch auditor — thorough checks on datasets

5.2 Integrate Validation into FastAPI

Now let's update our API to reject invalid input with clear error messages:

# src/serve_validated.py
"""
Serve fraud detection model with input validation.

This version adds data validation BEFORE making predictions:
- Invalid inputs are rejected with HTTP 400 and clear error messages
- Valid inputs are processed and predictions returned

This is much safer than the naive version which accepted garbage.
"""
import pickle
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.data_validation import validate_transaction

# Load model
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)

app = FastAPI(
    title="Fraud Detection API (Validated)",
    description="""
    Fraud detection API with input validation.
    
    All inputs are validated before prediction:
    - amount: Must be positive and below $50,000
    - hour: Must be 0-23
    - day_of_week: Must be 0-6
    - merchant_category: Must be one of: grocery, restaurant, retail, online, travel
    
    Invalid inputs return HTTP 400 with detailed error messages.
    """,
    version="3.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount (must be positive)", example=150.00)
    hour: int = Field(..., description="Hour of day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Mon, 6=Sun)", example=3)
    merchant_category: str = Field(..., description="Merchant type", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    validation_passed: bool = True

class ValidationErrorResponse(BaseModel):
    detail: dict

@app.post("/predict", response_model=PredictionResponse, responses={400: {"model": ValidationErrorResponse}})
def predict(tx: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Input is validated before prediction. Invalid inputs return HTTP 400.
    """
    data = tx.dict()
    
    # VALIDATE INPUT BEFORE MAKING PREDICTION
    validation = validate_transaction(data)
    
    if not validation["valid"]:
        raise HTTPException(
            status_code=400,
            detail={
                "message": "Validation failed",
                "errors": validation["errors"],
                "input": data
            }
        )
    
    # Input is valid - make prediction
    data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        validation_passed=True
    )

@app.get("/health")
def health():
    return {"status": "healthy", "validation": "enabled"}

Start the validated API:

uvicorn src.serve_validated:app --reload --host 0.0.0.0 --port 8000

Now test with bad data:

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}'

Response (HTTP 400):

{
  "detail": {
    "message": "Validation failed",
    "errors": [
      "amount must be positive",
      "hour must be between 0 and 23 (got 25)",
      "day_of_week must be between 0 (Monday) and 6 (Sunday) (got 10)",
      "merchant_category must be one of ['grocery', 'restaurant', 'retail', 'online', 'travel'] (got 'fake')"
    ],
    "input": {"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}
  }
}

This is a huge improvement! Instead of silently accepting garbage and returning meaningless predictions, we now:

Reject invalid input immediately
Provide clear, actionable error messages
Return the original input for debugging
Use proper HTTP status codes (400 for client error)

Checkpoint: Your validated API should:

Accept valid transactions and return predictions
Reject invalid transactions with HTTP 400 and detailed error messages
Show validation errors for each invalid field

6. Monitor Model Performance and Data Drift

What breaks without this: Your model's accuracy drops from 98% to 70% over two months. Nobody notices until customers complain. By then, significant damage has occurred.

Even with a great model and clean input data, time can be an enemy. Model performance can decline as real-world data evolves – this is known as model drift or model decay.

Why This Matters: In traditional software, you monitor CPU, memory, error rates, and response times. In ML, you must also monitor:

Data quality (are inputs within expected ranges?)
Model performance (is accuracy holding up?)
Data drift (has input distribution changed?)
Prediction drift (has the distribution of predictions changed?)

Without monitoring, your model could be silently failing for weeks before anyone notices. By then, significant damage may have occurred – fraud slipping through, good customers blocked, revenue lost.

6.1 The Four Pillars of ML Observability

Pillar	What to Monitor	Why It Matters
Data Quality	Are inputs valid? Nulls? Outliers?	Bad data causes bad predictions
Model Performance	Accuracy, precision, recall, F1	Is the model still working?
Data Drift	Has input distribution changed from training?	Model may not generalize to new data
Prediction Drift	Has prediction distribution changed?	May indicate data or concept drift

6.2 Build a Drift Monitor with Evidently

Evidently is an open-source library specifically designed for ML monitoring. It can detect drift, generate reports, and integrate with monitoring systems.

Create src/monitoring.py:

# src/monitoring.py
"""
Model monitoring with Evidently.

This module provides tools to:
1. Detect data drift between training and production data
2. Generate detailed HTML reports
3. Track drift over time
4. Alert when drift exceeds thresholds

In production, you would run drift checks periodically (hourly, daily)
and alert when significant drift is detected.
"""
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric
)
from datetime import datetime
from typing import List, Dict, Any, Optional

class DriftMonitor:
    """
    Monitor for detecting data drift between reference (training) and current data.
    
    Implementation Note: We use two approaches here:
    1. Scipy's KS-test — A lightweight statistical method that works anywhere (our fallback)
    2. Evidently — A full-featured library with beautiful reports (our primary tool)
    
    The KS-test is included as defensive coding — if Evidently fails to generate 
    a report, we still get drift detection.
    
    Usage:
        monitor = DriftMonitor(training_data)
        result = monitor.check_drift(production_data)
        if result['drift_detected']:
            alert("Drift detected!")
    """
    
    def __init__(self, reference_data: pd.DataFrame, feature_columns: Optional[List[str]] = None):
        """
        Initialize the drift monitor with reference (training) data.
        
        Args:
            reference_data: The training data to compare against
            feature_columns: Columns to monitor (default: all numeric columns)
        """
        self.reference = reference_data
        self.feature_columns = feature_columns or reference_data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        self.history: List[Dict[str, Any]] = []
        
        print(f"Drift monitor initialized with {len(self.reference):,} reference samples")
        print(f"Monitoring columns: {self.feature_columns}")
    
    def check_drift(self, current_data: pd.DataFrame, threshold: float = 0.1) -> Dict[str, Any]:
        """
        Check for drift between reference and current data.
        
        Args:
            current_data: Current/production data to check
            threshold: Drift share threshold for alerting (default 10%)
            
        Returns:
            Dictionary with drift results
        """
        from scipy import stats
        
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        # Simple statistical drift detection using KS test
        drifted_columns = []
        for col in self.feature_columns:
            statistic, p_value = stats.ks_2samp(
                ref_subset[col].dropna(),
                cur_subset[col].dropna()
            )
            if p_value < 0.05:  # 5% significance level
                drifted_columns.append(col)
        
        n_features = len(self.feature_columns)
        n_drifted = len(drifted_columns)
        drift_share = n_drifted / n_features if n_features > 0 else 0
        
        result = {
            'timestamp': datetime.now().isoformat(),
            'drift_detected': n_drifted > 0,
            'drift_share': drift_share,
            'drifted_columns': drifted_columns,
            'n_features': n_features,
            'n_drifted': n_drifted,
            'current_samples': len(current_data),
            'threshold': threshold,
            'alert': drift_share > threshold
        }
        
        self.history.append(result)
        
        return result
    
    def generate_report(self, current_data: pd.DataFrame, output_path: str = "drift_report.html"):
        """
        Generate a detailed HTML drift report using Evidently.
        
        Opens in browser for visual inspection of drift patterns.
        """
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        try:
            report = Report(metrics=[DataDriftPreset()])
            report.run(reference_data=ref_subset, current_data=cur_subset)
            
            # Save HTML report
            with open(output_path, 'w') as f:
                f.write(report.show(mode='inline').data)
            
            print(f"Drift report saved to {output_path}")
            print(f"Open this file in a browser to view detailed visualizations.")
        except Exception as e:
            print(f"Could not generate Evidently report: {e}")
            print(f"Using simplified drift detection instead.")
    
    def get_alerts(self, threshold: float = 0.1) -> List[Dict[str, Any]]:
        """
        Get all alerts from history where drift exceeded threshold.
        """
        return [
            {
                'timestamp': r['timestamp'],
                'severity': 'HIGH' if r['drift_share'] > 0.3 else 'MEDIUM',
                'drift_share': r['drift_share'],
                'message': f"Drift detected: {r['drift_share']:.1%} of features drifted",
                'drifted_columns': r['drifted_columns']
            }
            for r in self.history
            if r['drift_share'] > threshold
        ]
    
    def summary(self) -> Dict[str, Any]:
        """Get summary statistics from monitoring history."""
        if not self.history:
            return {"message": "No drift checks performed yet"}
        
        drift_shares = [r['drift_share'] for r in self.history]
        alerts = [r for r in self.history if r['alert']]
        
        return {
            'total_checks': len(self.history),
            'total_alerts': len(alerts),
            'avg_drift_share': np.mean(drift_shares),
            'max_drift_share': np.max(drift_shares),
            'first_check': self.history[0]['timestamp'],
            'last_check': self.history[-1]['timestamp']
        }


def simulate_drift_scenarios():
    """
    Demonstrate drift detection with different scenarios.
    
    This simulates what happens when production data differs from training data.
    """
    from src.generate_data import generate_transactions
    
    print("="*70)
    print("DRIFT DETECTION SIMULATION")
    print("="*70)
    
    # Load reference (training) data
    print("\n1. Loading reference data (training set)...")
    reference = pd.read_csv('data/train.csv')
    feature_cols = ['amount', 'hour', 'day_of_week']
    
    # Initialize drift monitor
    monitor = DriftMonitor(reference, feature_cols)
    
    # Scenario 1: Similar data (should show minimal drift)
    print("\n" + "-"*70)
    print("SCENARIO 1: Test data (similar distribution)")
    print("-"*70)
    test_data = pd.read_csv('data/test.csv')
    result = monitor.check_drift(test_data)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 2: Fraud spike (10% fraud instead of 2%)
    print("\n" + "-"*70)
    print("SCENARIO 2: Fraud spike (10% fraud rate instead of 2%)")
    print("-"*70)
    fraud_spike = generate_transactions(n_samples=2000, fraud_ratio=0.10, seed=101)
    result = monitor.check_drift(fraud_spike)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 3: Amount inflation (everything costs more)
    print("\n" + "-"*70)
    print("SCENARIO 3: Amount inflation (2x multiplier)")
    print("-"*70)
    inflated = test_data.copy()
    inflated['amount'] = inflated['amount'] * 2
    result = monitor.check_drift(inflated)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 4: Time shift (more late-night transactions)
    print("\n" + "-"*70)
    print("SCENARIO 4: Time shift (mostly late-night transactions)")
    print("-"*70)
    night_shift = test_data.copy()
    night_shift['hour'] = np.random.choice([0, 1, 2, 3, 22, 23], size=len(night_shift))
    result = monitor.check_drift(night_shift)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Generate detailed report for the most drifted scenario
    print("\n" + "-"*70)
    print("GENERATING DETAILED REPORT")
    print("-"*70)
    monitor.generate_report(night_shift, "drift_report.html")
    
    # Print summary
    print("\n" + "-"*70)
    print("MONITORING SUMMARY")
    print("-"*70)
    summary = monitor.summary()
    print(f"  Total checks: {summary['total_checks']}")
    print(f"  Total alerts: {summary['total_alerts']}")
    print(f"  Average drift share: {summary['avg_drift_share']:.1%}")
    print(f"  Maximum drift share: {summary['max_drift_share']:.1%}")
    
    # Print alerts
    alerts = monitor.get_alerts()
    if alerts:
        print(f"\n  Alerts ({len(alerts)}):")
        for alert in alerts:
            print(f"    [{alert['severity']}] {alert['message']}")
    
    print("\n" + "="*70)
    print("DRIFT DETECTION SIMULATION COMPLETE")
    print("="*70)
    print("\nOpen drift_report.html in your browser to see detailed visualizations!")


if __name__ == "__main__":
    simulate_drift_scenarios()

Run the drift simulation:

python src/monitoring.py

You'll see output showing how drift detection works in different scenarios. Then open drift_report.html in your browser to see beautiful visualizations of the drift patterns.

6.3 Production Monitoring Strategy

In a production environment, you would:

Log all predictions to a database or data warehouse
Run drift checks periodically (hourly for high-traffic systems, daily for lower traffic)
Set up alerts when drift exceeds thresholds (integrate with PagerDuty, Slack, etc.)
Trigger retraining if drift is severe or sustained
Create dashboards to track drift over time (Grafana, Datadog, etc.)

Checkpoint: Running python src/monitoring.py should:

Show minimal drift for similar data (test set)
Show significant drift for modified data (fraud spike, inflation, time shift)
Generate an HTML report that you can view in your browser

7. Automate Testing and Deployment with CI/CD

What breaks without this: A typo in your code breaks the API. You deploy on Friday at 5 PM. Nobody notices until Monday. Fraud losses spike over the weekend.

CI/CD (Continuous Integration/Continuous Deployment) ensures reliable, repeatable releases. As JFrog notes: "A strong CI/CD pipeline enables ML teams to build robust, bug-free models more quickly and efficiently."

Why This Matters: In ML, changes aren't just code – they're also data and models. CI/CD ensures that when you change training logic, data preprocessing, or hyperparameters, tests verify the change doesn't break anything before it reaches production. It's the difference between deploying with confidence and deploying with crossed fingers.

7.1 Write Tests for Data and Model

Create tests/test_data_and_model.py:

# tests/test_data_and_model.py
"""
Tests for data quality and model performance.

These tests run in CI/CD to ensure:
1. Data meets quality requirements
2. Model meets performance thresholds
3. No regressions are introduced

Run with: pytest tests/test_data_and_model.py -v
"""
import pandas as pd
import pickle
import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

class TestDataQuality:
    """Tests for training data quality."""
    
    @pytest.fixture
    def train_data(self):
        return pd.read_csv("data/train.csv")
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_train_data_has_expected_columns(self, train_data):
        """Training data must have all required columns."""
        required_columns = {"amount", "hour", "day_of_week", "merchant_category", "is_fraud"}
        actual_columns = set(train_data.columns)
        missing = required_columns - actual_columns
        assert not missing, f"Missing columns: {missing}"
    
    def test_train_data_not_empty(self, train_data):
        """Training data must have rows."""
        assert len(train_data) > 0, "Training data is empty"
        assert len(train_data) >= 1000, f"Training data too small: {len(train_data)} rows"
    
    def test_no_negative_amounts(self, train_data):
        """Transaction amounts must be non-negative."""
        negative_count = (train_data["amount"] < 0).sum()
        assert negative_count == 0, f"Found {negative_count} negative amounts"
    
    def test_amounts_reasonable(self, train_data):
        """Transaction amounts should be within reasonable bounds."""
        max_amount = train_data["amount"].max()
        assert max_amount <= 100000, f"Max amount {max_amount} exceeds reasonable limit"
    
    def test_hours_valid(self, train_data):
        """Hours must be 0-23."""
        invalid = train_data[(train_data["hour"] < 0) | (train_data["hour"] > 23)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid hours"
    
    def test_days_valid(self, train_data):
        """Days of week must be 0-6."""
        invalid = train_data[(train_data["day_of_week"] < 0) | (train_data["day_of_week"] > 6)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid days"
    
    def test_merchant_categories_valid(self, train_data):
        """Merchant categories must be from known set."""
        valid_categories = {"grocery", "restaurant", "retail", "online", "travel"}
        actual_categories = set(train_data["merchant_category"].unique())
        invalid = actual_categories - valid_categories
        assert not invalid, f"Invalid merchant categories: {invalid}"
    
    def test_fraud_ratio_reasonable(self, train_data):
        """Fraud ratio should be realistic (between 0.1% and 50%)."""
        fraud_ratio = train_data["is_fraud"].mean()
        assert 0.001 <= fraud_ratio <= 0.5, f"Fraud ratio {fraud_ratio:.2%} is unrealistic"
    
    def test_no_nulls_in_critical_columns(self, train_data):
        """Critical columns must not have null values."""
        critical = ["amount", "hour", "day_of_week", "merchant_category", "is_fraud"]
        for col in critical:
            null_count = train_data[col].isnull().sum()
            assert null_count == 0, f"Column {col} has {null_count} null values"


class TestModelPerformance:
    """Tests for model performance thresholds."""
    
    @pytest.fixture
    def model_and_encoder(self):
        with open("models/model.pkl", "rb") as f:
            return pickle.load(f)
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_model_loads_successfully(self, model_and_encoder):
        """Model file must load without errors."""
        model, encoder = model_and_encoder
        assert model is not None, "Model is None"
        assert encoder is not None, "Encoder is None"
    
    def test_model_can_predict(self, model_and_encoder, test_data):
        """Model must be able to make predictions."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        predictions = model.predict(X)
        assert len(predictions) == len(X), "Prediction count mismatch"
    
    def test_accuracy_threshold(self, model_and_encoder, test_data):
        """Model accuracy must be at least 90%."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        accuracy = model.score(X, y)
        assert accuracy >= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"
    
    def test_f1_threshold(self, model_and_encoder, test_data):
        """Model F1-score must be at least 0.3 (sanity check for imbalanced data)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred)
        assert f1 >= 0.3, f"F1-score {f1:.2f} below 0.3 threshold"
    
    def test_precision_not_zero(self, model_and_encoder, test_data):
        """Model precision must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        precision = precision_score(y, y_pred, zero_division=0)
        assert precision > 0, "Model has zero precision (predicts no fraud)"
    
    def test_recall_not_zero(self, model_and_encoder, test_data):
        """Model recall must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        recall = recall_score(y, y_pred, zero_division=0)
        assert recall > 0, "Model has zero recall (misses all fraud)"

Create tests/test_api.py:

# tests/test_api.py
"""
Tests for the FastAPI prediction service.

These tests ensure the API:
1. Returns correct responses for valid inputs
2. Rejects invalid inputs with proper error messages
3. Health check works

Run with: pytest tests/test_api.py -v
Note: Requires the API to be running on localhost:8000
"""
import pytest
import httpx

BASE_URL = "http://localhost:8000"

class TestPredictionEndpoint:
    """Tests for the /predict endpoint."""
    
    def test_valid_prediction_returns_200(self):
        """Valid input should return HTTP 200 with prediction."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        assert "is_fraud" in data
        assert "fraud_probability" in data
        assert isinstance(data["is_fraud"], bool)
        assert 0 <= data["fraud_probability"] <= 1
    
    def test_high_risk_transaction(self):
        """High-risk transaction should have higher fraud probability."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 500.0,
            "hour": 3,  # Late night
            "day_of_week": 1,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        # High-risk transactions should have elevated probability
        # (not asserting exact value as model may vary)
        assert data["fraud_probability"] >= 0.0
    
    def test_negative_amount_rejected(self):
        """Negative amount should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": -100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
        assert "errors" in response.json()["detail"]
    
    def test_invalid_hour_rejected(self):
        """Invalid hour should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 25,  # Invalid
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_invalid_merchant_rejected(self):
        """Unknown merchant category should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "unknown_category"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_missing_field_rejected(self):
        """Missing required field should be rejected."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14
            # Missing day_of_week and merchant_category
        }, timeout=10)
        
        assert response.status_code == 422  # Pydantic validation error


class TestHealthEndpoint:
    """Tests for the /health endpoint."""
    
    def test_health_returns_200(self):
        """Health endpoint should return 200."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        assert response.status_code == 200
    
    def test_health_returns_healthy_status(self):
        """Health endpoint should indicate healthy status."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        data = response.json()
        assert data["status"] == "healthy"

Run tests locally:

# Run data and model tests (API not needed)
pytest tests/test_data_and_model.py -v

# Run API tests (requires API to be running)
pytest tests/test_api.py -v

7.2 GitHub Actions Workflow

⚠️ Note for Production Teams
In real ML teams, you typically don't retrain full models inside CI — it's slow and resource-intensive.
Here we do it to keep everything local, reproducible, and self-contained for learning.
Production pipelines usually separate training (scheduled jobs) from testing (CI/CD).

Create .github/workflows/ci.yml:

# .github/workflows/ci.yml
name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      
      - name: Generate training data
        run: python src/generate_data.py
      
      - name: Train model
        run: python src/train_naive.py
      
      - name: Run data quality tests
        run: pytest tests/test_data_and_model.py -v --tb=short
      
      - name: Build Docker image
        run: docker build -t fraud-detection-api .
      
      - name: Run container for API tests
        run: |
          docker run -d -p 8000:8000 --name test-api fraud-detection-api
          sleep 10  # Wait for API to start
          curl -f http://localhost:8000/health || exit 1
      
      - name: Run API tests
        run: pytest tests/test_api.py -v --tb=short
      
      - name: Cleanup
        if: always()
        run: docker stop test-api || true

7.3 Dockerize the Application

Create Dockerfile:

# Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ src/
COPY models/ models/
COPY data/ data/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the API
CMD ["uvicorn", "src.serve_validated:app", "--host", "0.0.0.0", "--port", "8000"]

Create .dockerignore:

# .dockerignore
venv/
__pycache__/
*.pyc
.git/
.github/
mlruns/
*.db
*.html
.pytest_cache/

Build and run locally:

# Build the Docker image
docker build -t fraud-detection-api .

# Run the container
docker run -p 8000:8000 fraud-detection-api

# Test it
curl http://localhost:8000/health

Checkpoint:

All tests pass: pytest tests/test_data_and_model.py -v
Docker image builds successfully
Container runs and responds to health checks

8. Incident Response Playbook

When things go wrong in production (and they will), you need a plan. This section provides playbooks for common ML incidents.

Scenario: False Positive Spike

Symptoms: Your fraud model suddenly flags 40% of legitimate transactions as fraud, blocking customers and overwhelming your manual review team.

Severity: HIGH - Direct customer impact

Phase 1: Mitigation (0-5 minutes)

Acknowledge the incident - Notify stakeholders that you're aware and responding
Roll back to previous model - In MLflow UI, move the @champion alias to the previous model version
Restart the API - docker restart fraud-api or redeploy
Verify - Check that false positive rate has returned to normal
Communicate - "Issue detected and mitigated. Investigating root cause."

Phase 2: Diagnosis (5-60 minutes)

Check drift report - Run python src/monitoring.py with recent production data
Check data validation logs - Did upstream data format change?
Check recent deployments - Was there a new model or code deployed recently?
Compare metrics - What's different between the rolled-back and problematic model?

Example root causes:

Upstream system sent amounts in cents instead of dollars
New merchant category appeared that wasn't in training data
Holiday shopping patterns differed significantly from training data

Phase 3: Remediation (1-24 hours)

Fix the root cause - Add validation for the edge case, or update training data
Retrain if needed - Include new patterns in training data
Add test case - Prevent this from happening again
Document - Add to runbook for future reference

Scenario: Gradual Performance Decay

Symptoms: Monitoring shows fraud recall dropping 2% per week over a month. No sudden failures, just slow degradation.

Severity: MEDIUM - Gradual impact, time to respond

Response:

Investigate drift report - Look for gradual distribution changes
```
python src/monitoring.py
```
Collect recent labeled data - Get confirmed fraud cases from the past month
Analyze patterns - What's different about recent fraud?
- New attack vectors?
- Different time patterns?
- New merchant categories?
Retrain on combined data - Include both old and new patterns
```
python src/train_mlflow.py
```
Deploy via canary - Route 10% of traffic to the new model first
- Monitor metrics for 1-2 days
- If metrics improve, increase to 50%, then 100%
- If metrics worsen, roll back
Set up recurring retraining - Schedule weekly or monthly retraining

Scenario: Upstream Data Schema Change

Symptoms: API starts returning 500 errors. Logs show KeyError: 'merchant_category'.

Severity: HIGH - Service is down

Response:

Check error logs - Identify the exact error
```
KeyError: 'merchant_category'
```
Check upstream data - Did the field name change?
- merchant_category -> category
- amount -> transaction_amount

Immediate fix - Add field name mapping

# Quick fix in API
if 'category' in data and 'merchant_category' not in data:
    data['merchant_category'] = data['category']

Long-term fix - Add validation that catches schema changes

required_fields = ['amount', 'hour', 'day_of_week', 'merchant_category']
missing = [f for f in required_fields if f not in data]
if missing:
    raise ValidationError(f"Missing fields: {missing}")

Add integration test - Test with upstream system in CI/CD

9. How to Put It All Together

Let's step back and appreciate what we've built. Our initial naive system has transformed into a local ML platform with production-grade components.

💡 Mental Model: Each tool in this stack is a "catch net" for a specific failure mode:

MLflow catches "which model is this?"

Feast catches "are features consistent?"

Great Expectations catches "is this data valid?"

Evidently catches "has the world changed?"

CI/CD catches "did we break something?"

Together, they form defense-in-depth for ML systems.

Component	Tool	Problem Solved
Experiment Tracking	MLflow	Every run logged, reproducible
Model Registry	MLflow	Versioned models, rollback capability
Feature Store	Feast	Consistent features, no training-serving skew
Data Validation	Great Expectations	Bad data rejected with clear errors
Monitoring	Evidently	Drift detected before it causes problems
Containerization	Docker	Environment consistency everywhere
CI/CD	GitHub Actions	Automated testing and safe deployments

The Complete Workflow

Here's how all the pieces work together in practice:

Data arrives - New transaction data comes in from upstream systems
Validation gate - Great Expectations rules check data quality. Bad data is rejected with clear error messages before it can cause harm.
Feature computation - Feast computes features using the same definitions for both training and serving. No more training-serving skew.
Training - When you retrain, MLflow logs all parameters, metrics, and artifacts. Every experiment is reproducible and comparable.
Model registry - Trained models are automatically versioned. You can compare metrics, promote the best to Production, and roll back if needed.
Serving - FastAPI loads the @champion model from MLflow. Each request is validated, features are retrieved from Feast, and predictions are returned.
Monitoring - Evidently checks for drift periodically. If input distributions change significantly, alerts are triggered.
Retraining loop - When drift is detected, you retrain on new data, compare metrics, and promote if better. The cycle continues.
CI/CD safety net - All code changes go through automated tests. Docker ensures environment consistency. Nothing reaches production without passing the pipeline.

10. What's Next: Scale to Production

This project runs locally, but the principles and tools extend directly to production deployments. Here's how each component scales:

Scaling Feast for Production

We used Feast with local SQLite stores. For production:

Component	Local	Production
Online Store	SQLite	Redis, DynamoDB, or PostgreSQL
Offline Store	Parquet files	BigQuery, Snowflake, or Redshift
Feature Server	Embedded	Dedicated Feast serving cluster

Benefits at scale:

Sub-10ms feature retrieval
Horizontal scaling for high throughput
Feature monitoring and statistics
Point-in-time joins at petabyte scale

Scaling MLflow for Production

Component	Local	Production
Backend Store	SQLite	PostgreSQL or MySQL
Artifact Store	Local filesystem	S3, GCS, or Azure Blob
Tracking Server	Single instance	Load-balanced cluster

Kubernetes Deployment

When you outgrow Docker Compose:

KServe or Seldon for serverless model serving with auto-scaling
Horizontal Pod Autoscaler to scale based on CPU/memory/custom metrics
Canary deployments to safely roll out new models (route 10% traffic first)
GPU scheduling for inference-heavy models

Advanced Monitoring

Expand observability with:

Prometheus + Grafana for real-time dashboards
OpenTelemetry for distributed tracing
PagerDuty/Slack integration for alerts
Labeled data collection for continuous model evaluation

A/B Testing and Multi-Armed Bandits

How to Use the Model Registry:

Serve multiple models concurrently (champion vs challengers)
Route traffic dynamically based on context
Collect metrics for each model variant
Automatically promote the best performer

Conclusion

Congratulations on building a production-ready ML system on your local machine!

What we assembled here is a microcosm of real-world ML platforms:

We started with just a model saved to a pickle file
We ended up with MLOps best practices: experiment tracking, model versioning, feature stores, data validation, monitoring, containerization, and CI/CD

The tools we used are production-grade:

MLflow powers ML platforms at companies like Microsoft, Facebook, and Databricks
Feast is used by companies like Gojek, Shopify, and Robinhood
FastAPI is one of the fastest Python web frameworks
Great Expectations is used at companies like GitHub and Shopify
Evidently is used for monitoring ML in production at scale

The principles apply at any scale:

Always track experiments
Always version models
Always validate data
Always monitor for drift
Always containerize for consistency
Always automate testing

Next Steps You Can Try

Deploy to the cloud - Push your Docker container to AWS ECS, Google Cloud Run, or Azure Container Instances
Add model explainability - Use SHAP or LIME to explain individual predictions
Implement A/B testing - Serve multiple models and compare performance
Add feature importance monitoring - Track how feature importance changes over time
Set up real-time alerting - Connect Evidently to Slack or PagerDuty
Implement continuous training - Automatically retrain when drift is detected
Add bias and fairness monitoring - Ensure your model treats all groups fairly

Remember that productionizing ML is an iterative process. There's always another layer of robustness to add, another edge case to handle, another metric to track. But with the foundation you've built here, you're well on your way to taking models from promising notebook experiments to deployed, monitored, and maintainable production applications.

Happy building, and may your models be accurate and your pipelines resilient!

Get the Complete Code

The entire project from this handbook is available as a public GitHub repository:

🔗 github.com/sandeepmb/freecodecamp-local-ml-platform

The repository includes:

All source code (src/ directory)
Test files (tests/ directory)
Feast feature definitions (feature_repo/)
Docker and CI/CD configuration
Ready-to-run scripts

Quick Start:

git clone https://github.com/sandeepmb/freecodecamp-local-ml-platform.git
cd freecodecamp-local-ml-platform
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python src/generate_data.py
python src/train_naive.py

References

MLflow Documentation - Experiment tracking and model registry
Feast Documentation - Feature store
Feast Quickstart - Getting started with Feast
FastAPI Documentation - Modern Python web framework
Great Expectations - Data validation
Evidently AI Documentation - ML monitoring
CI/CD for Machine Learning (JFrog) - CI/CD best practices
Training-Serving Skew Explained - Understanding skew
Docker Documentation - Containerization
GitHub Actions Documentation - CI/CD automation

A Comprehensive Guide to Financial Storytelling using Data Visualization

Nikhil Adithyan — Wed, 11 Mar 2026 20:21:36 +0000

In any analysis project, raw tables of numbers often don’t tell the full story. Visualisations simplify complexity by transforming data into shapes that our brains can quickly understand, emphasising trends, outliers, and regime shifts that might be overlooked in raw data.

This is especially vital in finance and trading, where clear visuals can uncover risks, opportunities, and patterns, directly affecting decisions on position sizing, timing, and confidence.

Today, we'll use FMP APIs to interpret earnings data: extracting announcements, surprises, and price reactions across almost 1,000 stocks to identify actionable patterns in post‑earnings movements.

Here’s exactly what we’ll build:

Sector heatmap: Maps strongest 3/10-day post-earnings reactions by sector/market-cap buckets.
EPS scatter: Tests if earnings beats drive returns (sector-colored, with regression).
Return violins: Shows 3-day post-earnings volatility/skew by sector and market-cap.
Mega-tech time series: Tracks AAPL/MSFT/NVDA post-earnings patterns over time.
Monthly seasonality: Reveals calendar edges in post-earnings returns/surprises.
Regime cross-section: Tests sector robustness across bull/bear/sideways markets.

What we'll cover:

Prerequisites
Data Extraction
Storytelling with Charts and Visuals
What Did We Get Out of All This Storyline?
Final Thoughts

Prerequisites

To follow along, you should be comfortable with Python and basic data manipulation in pandas.

This is a code-first guide. I’ll focus on the workflow and the story the charts reveal, and I won’t explain every line of Python. You should be comfortable reading pandas code, loops, and basic plotting logic so you can follow along without needing a step-by-step breakdown of each block.

You’ll need:

Python 3.10+
A Financial Modeling Prep (FMP) API key
pandas, numpy, matplotlib, seaborn, scipy installed
Enough local compute and patience to run API loops across a large stock universe

Data Extraction

In the first part of this article, we need to collect all the data required for our visualisation exercise. Using FMP’s Stock Screener API, we will retrieve NASDAQ stocks. The first API call will return 1,000 stocks.

import requests
import pandas as pd
import numpy as np
import json
from datetime import datetime, timedelta
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

token = 'YOUR FMP TOKEN'

url = f'https://financialmodelingprep.com/stable/company-screener'
querystring = {"apikey":token,"country":"US", "exchange": "NASDAQ", "isActiveTrading": True, "isEtf": False, "isFund": False}
resp = requests.get(url, querystring).json()

df_universe = pd.DataFrame(resp)
df_universe = df_universe[df_universe['exchangeShortName'] == 'NASDAQ']
df_universe

This will give us 1,000 stocks! Next, we'll bin the market capitalisation to gain a better understanding of the results later on, and we will keep only four columns that are necessary: the symbol, name, market cap, and sector.

bins = [0,
        250_000_000,    # 250M
        2_000_000_000,  # 2B
        10_000_000_000, # 10B
        200_000_000_000,# 200B
        float("inf")]

labels = ["Micro", "Small", "Mid", "Large", "Mega"]

df_universe["marketCap"] = pd.cut(df_universe["marketCap"], bins=bins, labels=labels, right=False)
df_universe = df_universe[['symbol', 'companyName', 'marketCap', 'sector']]
df_universe

Now it is time to retrieve the earnings using FMP’s Earnings Report API. We'll loop through each symbol and collect all the earnings the endpoint provides to us.

symbols = df_universe['symbol'].to_list()

all_dfs = []

for symbol in symbols:
    url = f"https://financialmodelingprep.com/stable/earnings?symbol={symbol}"
    params = {"apikey": token}
    resp = requests.get(url, params=params)

    if resp.status_code != 200:
        print(f"Error for {symbol}: {resp.status_code} - {resp.text}")
        continue

    data = resp.json()
    if not data:
        print(f"No data for {symbol}")
        continue

    df_symbol = pd.DataFrame(data)
    df_symbol["symbol"] = symbol
    all_dfs.append(df_symbol)

# Single DataFrame with all earnings
df_earnings = pd.concat(all_dfs, ignore_index=True)
df_earnings = df_earnings.dropna(subset=['epsActual', 'epsEstimated', 'revenueActual','revenueEstimated'])
df_earnings

Now we'll calculate the surprise, both for earnings and revenue in percentage terms, so we can later compare apples with apples! We'll keep everything from 2010 onwards.

df_earnings["eps_surprise"] = ((df_earnings["epsActual"] - df_earnings["epsEstimated"]) /
                               abs(df_earnings["epsEstimated"]) * 100).round(2)

df_earnings["revenue_surprise"] = ((df_earnings["revenueActual"] - df_earnings["revenueEstimated"]) /
                                   abs(df_earnings["revenueEstimated"]) * 100).round(2)

df_earnings = df_earnings[['symbol', 'date', 'eps_surprise', 'revenue_surprise']]

df_earnings["date"] = pd.to_datetime(df_earnings["date"])
df_earnings = df_earnings[df_earnings["date"] > "2009-12-31"]

Lastly, as a final step in gathering the data needed for visualization, using FMP’s Historical Index Full Chart API, we'll loop through the stocks in our dataframe, retrieve the historical daily prices, and calculate the return of the stock 3 and 10 trading days before and after the earnings announcement.

unique_symbols = df_earnings["symbol"].unique()

price_results = []

print(f"Processing {len(unique_symbols)} symbols...")

for symbol in unique_symbols:
    # Fetch full historical prices
    url = f"https://financialmodelingprep.com/stable/historical-price-eod/full"
    params = {"apikey":token, "symbol":symbol, "from":'2009-10-01'}
    resp = requests.get(url, params=params)

    if resp.status_code != 200:
        print(f"Error for {symbol}: {resp.status_code}")
        continue

    data = resp.json()

    hist_df = pd.DataFrame(data)
    hist_df["date"] = pd.to_datetime(hist_df["date"])
    hist_df = hist_df.sort_values("date").reset_index(drop=True)

    # Get matching earnings rows
    earnings_symbol = df_earnings[df_earnings["symbol"] == symbol].copy()

    for _, row in earnings_symbol.iterrows():
        earn_date = pd.to_datetime(row["date"]).date()

        # === 3-DAY WINDOWS ===
        pre3_mask = (hist_df["date"].dt.date < earn_date) & \
                    (hist_df["date"].dt.date >= earn_date - timedelta(days=10))
        pre3 = hist_df[pre3_mask].tail(3)

        post3_mask = (hist_df["date"].dt.date > earn_date) & \
                     (hist_df["date"].dt.date <= earn_date + timedelta(days=10))
        post3 = hist_df[post3_mask].head(3)

        pre3_start = pre3["close"].iloc[0] if len(pre3) >= 3 else None
        pre3_end = pre3["close"].iloc[-1] if len(pre3) >= 1 else None
        post3_end = post3["close"].iloc[-1] if len(post3) >= 3 else None

        pct_pre_3d = ((pre3_end - pre3_start) / pre3_start * 100) if pre3_start and pre3_end else None
        pct_post_3d = ((post3_end - pre3_end) / pre3_end * 100) if pre3_end and post3_end else None

        # === 10-DAY WINDOWS ===
        pre10_mask = (hist_df["date"].dt.date < earn_date) & \
                     (hist_df["date"].dt.date >= earn_date - timedelta(days=20))
        pre10 = hist_df[pre10_mask].tail(10)

        post10_mask = (hist_df["date"].dt.date > earn_date) & \
                      (hist_df["date"].dt.date <= earn_date + timedelta(days=20))
        post10 = hist_df[post10_mask].head(10)

        pre10_start = pre10["close"].iloc[0] if len(pre10) >= 10 else None
        pre10_end = pre10["close"].iloc[-1] if len(pre10) >= 1 else None
        post10_end = post10["close"].iloc[-1] if len(post10) >= 10 else None

        pct_pre_10d = ((pre10_end - pre10_start) / pre10_start * 100) if pre10_start and pre10_end else None
        pct_post_10d = ((post10_end - pre10_end) / pre10_end * 100) if pre10_end and post10_end else None

        price_results.append({
            "symbol": symbol,
            "earn_date": earn_date,
            "month": earn_date.month,
            "pct_pre_3d": round(pct_pre_3d, 2) if pct_pre_3d else None,
            "pct_post_3d": round(pct_post_3d, 2) if pct_post_3d else None,
            "pct_pre_10d": round(pct_pre_10d, 2) if pct_pre_10d else None,
            "pct_post_10d": round(pct_post_10d, 2) if pct_post_10d else None,
            "eps_surprise": row["eps_surprise"],
            "revenue_surprise": row["revenue_surprise"]
        })



df_earnings = pd.DataFrame(price_results)
df_earnings.dropna(inplace=True)
df_earnings = df_universe.merge(df_earnings, on="symbol")
df_earnings

As you can see, at the end of the code, we have also merged the initial dataset, so all the information, such as name, marketCap, and sector, is now in a single dataset.

Storytelling with Charts and Visuals

Sector Heatmap

First, we'll present the Sector Heatmap of average 3-day post-earnings returns segmented by sector and market-cap category. This basic visualisation highlights areas with the most significant reactions, enabling traders to swiftly identify high-alpha sectors and market caps for earnings strategies.

# Aggregate: average post-earnings returns and EPS surprise
agg = (
    df_earnings
    .dropna(subset=['pct_post_3d', 'pct_post_10d', 'eps_surprise', 'marketCap', 'sector'])
    .groupby(['sector', 'marketCap'])
    .agg(
        avg_post3d=('pct_post_3d', 'mean'),
        avg_post10d=('pct_post_10d', 'mean'),
        avg_eps_surprise=('eps_surprise', 'mean')
    )
    .reset_index()
)

# Heatmap: average 3-day post-earnings return
heatmap_3d = agg.pivot(index='sector', columns='marketCap', values='avg_post3d')

plt.figure(figsize=(12, 8))
sns.heatmap(
    heatmap_3d,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0,
    linewidths=0.5,
    linecolor='grey'
)
plt.title('Average 3-Day Post-Earnings Return by Sector and Market-Cap Bucket')
plt.xlabel('Market-cap bucket')
plt.ylabel('Sector')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Consumer Cyclical and Materials are performing really well, with small and mid caps seeing positive reactions over 1.1%. Real Estate is also doing great, jumping up to +4.0% in mid caps. Energy and Financials are holding steady, staying close to zero. Technology, on the other hand, is showing more muted gains, under 1.1%, indicating there might be limited immediate upside from the big tech earnings.

Building on the 3‑day heatmap, we'll now look at the Sector Heatmap for average 10‑day post‑earnings returns by sector and market‑cap category. This extends the timeframe to capture momentum persistence, revealing which sectors maintain or reverse short‑term reactions.

# Heatmap: average 10-day post-earnings return
heatmap_10d = agg.pivot(index='sector', columns='marketCap', values='avg_post10d')

plt.figure(figsize=(12, 8))
sns.heatmap(
    heatmap_10d,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0,
    linewidths=0.5,
    linecolor='grey'
)
plt.title('Average 10-Day Post-Earnings Return by Sector and Market-Cap Bucket')
plt.xlabel('Market-cap bucket')
plt.ylabel('Sector')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Consumer Cyclical stands out with peaks at 3.2% (mega caps), and Industrials and Health Care show consistent gains in mid and large caps around 1.1%. Real Estate has eased after its 3-day surge. Technology has seen a small boost in mega caps (+1.8%) but remains less active overall compared to cyclicals.

Mega‑Cap Tech Time Series

Extending the heatmaps, we’ll now look at a Mega-Cap Tech time series. It tracks 10-day post-earnings returns over time for AAPL, MSFT, NVDA, and a few other mega-cap tech names.

A bubble chart works well here because it encodes more than one thing at once. The x-axis is the earnings date, the y-axis is the 10-day post-earnings return, the bubble size scales with the absolute EPS surprise magnitude, and the color shows whether the surprise was a beat or a miss. This makes it easy to spot outlier quarters and see whether big surprises consistently lead to bigger post-earnings moves.

# Define mega-cap tech tickers (top ones from data: AAPL, MSFT, NVDA, AMZN, GOOG/GOOGL, META)
tech_tickers = ['AAPL', 'MSFT', 'NVDA', 'AMZN', 'GOOG', 'GOOGL', 'META']

# Filter data for mega-cap tech
df_tech = (
    df_earnings[df_earnings['symbol'].isin(tech_tickers)]
    .dropna(subset=['earn_date', 'pct_post_10d', 'eps_surprise'])
    .sort_values('earn_date')
    .assign(
        earn_date=lambda x: pd.to_datetime(x['earn_date'])
    )
)

# Create time-series plot: pct_post_10d vs earn_date, sized/color by eps_surprise
plt.figure(figsize=(14, 8))

# Scatter plot
scatter = plt.scatter(
    df_tech['earn_date'],
    df_tech['pct_post_10d'],
    s=np.abs(df_tech['eps_surprise']) * 50 + 20,  # Size by abs(eps_surprise)
    c=df_tech['eps_surprise'],
    cmap='RdYlBu_r',
    alpha=0.7,
    edgecolors='black',
    linewidth=0.5
)

plt.colorbar(scatter, label='EPS Surprise (%)')
plt.xlabel('Earnings Date')
plt.ylabel('10-Day Post-Earnings Return (%)')
plt.title('Mega-Cap Tech: 10-Day Post-Earnings Returns vs Time\n(Point size/color by EPS Surprise)')
plt.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(pd.to_numeric(df_tech['earn_date']), df_tech['pct_post_10d'], 1)
p = np.poly1d(z)
plt.plot(df_tech['earn_date'], p(pd.to_numeric(df_tech['earn_date'])), "r--", alpha=0.8, linewidth=2, label=f'Trend: {z[0]:.3f}x + {z[1]:.1f}')

plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

That large red bubble around 2018 is almost certainly AAPL’s Q4 2018 earnings miss (Jan 2019 announcement, but fiscal Q4 2018 data) and it stands out because:

Large size = massive EPS surprise magnitude (Apple cut guidance dramatically, ~10% miss)
Red colour = negative surprise
Low Y position = poor 10‑day return (~-10% range visible)

This was Apple’s infamous “iPhone demand warning” that triggered the January 2019 market panic. Perfect example of how one outlier event can anchor the whole trend line downward in your visualisation.

EPS Surprise Scatter Plot

After identifying major tech trends, let's now look at the EPS Surprise Scatter plots. This plot checks a simple hypothesis. Do earnings beats lead to positive returns, and do misses lead to negative returns? We plot EPS surprise on the x-axis and post-earnings returns on the y-axis, then add a regression line to show the average relationship.

# Prepare data: drop NaNs and convert earn_date if needed (not used here)
df_plot = (
    df_earnings
    .dropna(subset=['eps_surprise', 'pct_post_3d', 'pct_post_10d', 'sector'])
    .copy()
)

# 1. Scatter: EPS Surprise vs 3-Day Post-Return, colored by sector
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.scatterplot(
    data=df_plot,
    x='eps_surprise',
    y='pct_post_3d',
    hue='sector',
    alpha=0.6,
    s=40
)

# Regression line (overall)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_plot['eps_surprise'], df_plot['pct_post_3d'])
line = slope * df_plot['eps_surprise'] + intercept
plt.plot(df_plot['eps_surprise'], line, 'red', linestyle='--', linewidth=2,
         label=f'y = {slope:.3f}x + {intercept:.2f}\nR²={r_value**2:.3f}')
plt.xlabel('EPS Surprise (%)')
plt.ylabel('3-Day Post-Earnings Return (%)')
plt.title('EPS Surprise vs 3-Day Post-Return by Sector')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

# 2. Scatter: EPS Surprise vs 10-Day Post-Return, colored by sector
plt.subplot(1, 2, 2)
sns.scatterplot(
    data=df_plot,
    x='eps_surprise',
    y='pct_post_10d',
    hue='sector',
    alpha=0.6,
    s=40
)

# Regression line (overall)
slope10, intercept10, r_value10, p_value10, std_err10 = stats.linregress(df_plot['eps_surprise'], df_plot['pct_post_10d'])
line10 = slope10 * df_plot['eps_surprise'] + intercept10
plt.plot(df_plot['eps_surprise'], line10, 'red', linestyle='--', linewidth=2,
         label=f'y = {slope10:.3f}x + {intercept10:.2f}\nR²={r_value10**2:.3f}')
plt.xlabel('EPS Surprise (%)')
plt.ylabel('10-Day Post-Earnings Return (%)')
plt.title('EPS Surprise vs 10-Day Post-Return by Sector')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Optional: Summary table of correlations by sector
corr_3d = df_plot.groupby('sector')[['eps_surprise', 'pct_post_3d']].corr().unstack().xs('pct_post_3d', level=1, axis=1)['eps_surprise']
corr_10d = df_plot.groupby('sector')[['eps_surprise', 'pct_post_10d']].corr().unstack().xs('pct_post_10d', level=1, axis=1)['eps_surprise']

corr_df = pd.DataFrame({
    'Corr_EPS_3Day': corr_3d.round(3),
    'Corr_EPS_10Day': corr_10d.round(3)
}).sort_values('Corr_EPS_10Day', ascending=False)

The red dashed trend line illustrates the typical relationship: for every 1% EPS beat, stocks tend to gain about 0.05–0.1% over 3 to 10 days. The gentle slope suggests that while surprises can give a little boost, they don’t guarantee large moves.

You’ll notice that Consumer Cyclical dots mainly cluster in the upper right (beats leading to gains), and Real Estate shows a steeper increase. The wide spread around the line indicates that other factors often influence stock movements beyond surprises.

Return Distribution Violins

Heatmaps show averages, but averages can hide risk. Violin plots show the full distribution of returns, including how wide the outcomes are and whether the tails are heavy. Here we plot 3-day post-earnings return distributions by sector and by market-cap bucket.

# Prepare data
df_plot = (
    df_earnings
    .dropna(subset=['pct_post_3d', 'sector', 'marketCap'])
    .copy()
)

# 1. Violin plot: 3-day post-returns by sector
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.violinplot(
    data=df_plot,
    x='sector',
    y='pct_post_3d',
    inner='quartile',
    palette='Set2'
)
plt.title('Distribution of 3-Day Post-Earnings Returns by Sector (Violin)')
plt.xlabel('Sector')
plt.ylabel('3-Day Post-Earnings Return (%)')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3)

# 2. Violin plot: 3-day post-returns by market-cap group
plt.subplot(1, 2, 2)
sns.violinplot(
    data=df_plot,
    x='marketCap',
    y='pct_post_3d',
    inner='quartile',
    palette='Set3'
)
plt.title('Distribution of 3-Day Post-Earnings Returns by Market-Cap (Violin)')
plt.xlabel('Market-cap bucket')
plt.ylabel('3-Day Post-Earnings Return (%)')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


plt.show()

# Summary statistics table
summary = df_plot.groupby(['sector', 'marketCap'])['pct_post_3d'].agg(['mean', 'median', 'std', 'count']).round(2)
print("Summary Statistics: Mean/Median/Std/Count of 3-Day Returns by Sector & Market-Cap")
print(summary)

All violins concentrate near zero with modest variations (±5%), indicating that post-earnings reactions are generally noisy and lack a clear direction. Markets efficiently incorporate expectations, resulting in little predictable advantage. Consumer Cyclical and Materials sectors display slightly more frequent upside surprises, while small caps exhibit the greatest variability, reflecting higher risk and occasional gains. Not every visualization reveals alpha; this one honestly illustrates the difficulty involved.

Monthly Seasonality

After observing narrow return distributions near zero, let's now look at Monthly Seasonality in four panels: average 3/10‑day post‑returns, EPS surprises, and event counts by month. This reveals calendar effects, systematic seasonal biases , that can influence timing of entries despite noisy individual responses.

# 1. Ensure earn_date is datetime
df_month = (
    df_earnings
    .dropna(subset=['earn_date', 'pct_post_3d', 'pct_post_10d', 'eps_surprise'])
    .copy()
)

df_month['earn_date'] = pd.to_datetime(df_month['earn_date'])

# 2. Derive month number and name
df_month['month_num'] = df_month['earn_date'].dt.month
df_month['month_name'] = df_month['earn_date'].dt.strftime('%b')

# 3. Aggregate averages by month
monthly_agg = (
    df_month
    .groupby('month_num')
    .agg(
        pct_post_3d_mean=('pct_post_3d', 'mean'),
        pct_post_10d_mean=('pct_post_10d', 'mean'),
        eps_surprise_mean=('eps_surprise', 'mean'),
        n_obs=('earn_date', 'count')
    )
    .reset_index()
    .sort_values('month_num')
)

# Keep a stable month order and names
month_order = monthly_agg['month_num'].tolist()
month_labels = df_month.drop_duplicates('month_num').set_index('month_num')['month_name'].reindex(month_order)

monthly_agg['month_name'] = month_labels.values

# 4. Plot bar charts
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Monthly Seasonality of Post-Earnings Returns and EPS Surprise', fontsize=16)

# Avg 3-day return
axes[0, 0].bar(monthly_agg['month_name'], monthly_agg['pct_post_3d_mean'], color='skyblue')
axes[0, 0].set_title('Avg 3-Day Post-Earnings Return by Month')
axes[0, 0].set_ylabel('Return (%)')
axes[0, 0].grid(alpha=0.3)

# Avg 10-day return
axes[0, 1].bar(monthly_agg['month_name'], monthly_agg['pct_post_10d_mean'], color='lightgreen')
axes[0, 1].set_title('Avg 10-Day Post-Earnings Return by Month')
axes[0, 1].set_ylabel('Return (%)')
axes[0, 1].grid(alpha=0.3)

# Avg EPS surprise
axes[1, 0].bar(monthly_agg['month_name'], monthly_agg['eps_surprise_mean'], color='salmon')
axes[1, 0].set_title('Avg EPS Surprise by Month')
axes[1, 0].set_ylabel('EPS Surprise')
axes[1, 0].grid(alpha=0.3)

# Number of observations
axes[1, 1].bar(monthly_agg['month_name'], monthly_agg['n_obs'], color='gold')
axes[1, 1].set_title('Number of Earnings Events by Month')
axes[1, 1].set_ylabel('Count')
axes[1, 1].grid(alpha=0.3)

for ax in axes.ravel():
    ax.set_xlabel('Month')
    ax.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

Jan/Oct tend to have the best 3‑day returns, about 0.8%, while May/Jul usually see weaker results. The 10‑day trends show a similar but gentler pattern, with February and August reaching peaks. EPS surprises are slightly negative in January and May, possibly due to tough comparisons, and there are fewer events in July, August, and December because of holidays. While there’s a hint of seasonality, its impact is quite small, around 0.5%.

Regime Cross-Section

Finally, after subtle monthly patterns, we'll look at the Regime Cross‑Section: sector 10‑day post‑earnings returns by market regime (heatmap at the top, bars below). This stress‑tests earlier findings ( do patterns persist across bull, bear, and COVID eras), revealing rotation opportunities and regime dependence.

# Prepare data with year extraction
df_regimes = (
    df_earnings
    .dropna(subset=['earn_date', 'pct_post_10d', 'sector'])
    .copy()
)

df_regimes['earn_date'] = pd.to_datetime(df_regimes['earn_date'])
df_regimes['year'] = df_regimes['earn_date'].dt.year

# Define market regimes (adjust years based on your data/market history)
# Example: Bull (2023-2025), Bear/Transition (2022), COVID (2020-2021), etc.
def assign_regime(year):
    if year >= 2023:
        return 'Bull (2023+)'
    elif year == 2022:
        return 'Bear (2022)'
    elif 2020 <= year <= 2021:
        return 'COVID Recovery'
    elif 2018 <= year <= 2019:
        return 'Pre-COVID'
    else:
        return 'Earlier'

df_regimes['market_regime'] = df_regimes['year'].apply(assign_regime)

# 1. Aggregate: average 10-day returns by sector and regime/year
agg_data = (
    df_regimes
    .groupby(['sector', 'market_regime'])['pct_post_10d']
    .agg(['mean', 'count'])
    .reset_index()
    .query('count >= 5')  # Filter low-sample regimes
)

# 2. Visualization: Heatmap first (quick overview)
plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
pivot_heatmap = agg_data.pivot(index='sector', columns='market_regime', values='mean')
sns.heatmap(pivot_heatmap, annot=True, fmt='.2f', cmap='RdYlGn', center=0, linewidths=0.5)
plt.title('Average 10-Day Post-Earnings Returns: Sector x Market Regime Heatmap')

# 3. Bar charts: By regime (stacked by sector)
plt.subplot(2, 1, 2)
regime_order = agg_data.groupby('market_regime')['mean'].mean().sort_values(ascending=False).index
sns.barplot(data=agg_data, x='market_regime', y='mean', hue='sector',
            palette='Set2', order=regime_order)
plt.title('Average 10-Day Returns by Market Regime (Colored by Sector)')
plt.ylabel('10-Day Post-Return (%)')
plt.xlabel('Market Regime')
plt.xticks(rotation=45, ha='right')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# 5. Summary tables
print("Average Returns by Sector x Market Regime (min 5 obs):")
print(agg_data.pivot(index='sector', columns='market_regime', values='mean').round(2))

# 6. Ranking: Best/worst performing sectors by regime
print("\nTop/Bottom Sectors by Regime:")
for regime in regime_order:
    regime_data = agg_data[agg_data['market_regime'] == regime].sort_values('mean', ascending=False)
    print(f"\n{regime}:")
    print(regime_data[['sector', 'mean', 'count']].round(2).head(3))

Consumer Cyclical does well during Bull (2023+) and COVID Recovery (~~1.5–2%), but it’s less favorable in Bear 2022. Utilities turned negative before COVID. The bottom bars show the COVID era led overall gains (~~1%), with Basic Materials and Industrials being the strongest. The recent Bull remains positive but less so. Sector leadership shifts depending on the market regime , there are no consistent winners.

What Did We Get Out of All This Storyline?

Guiding you through six interconnected visualizations, we’ve turned 15 years of earnings data into a clear and engaging story.

Each chart responds to a specific question, yet together, they paint a bigger picture: earnings surprises influence markets, but not in the same way everywhere. Some sectors, periods, and regimes often provide consistent advantages, while others don’t.

Here’s what the data shows us:

No definitive alpha here, but specific opportunities are present: Markets are mostly efficient, returns hover near zero with weak surprise correlations , yet Consumer Cyclicals and Materials consistently show upside potential across different timeframes and market sizes. Timing your sector choice is important.
Timing windows alter the story: 3-day reactions benefit Real Estate mid-caps (+4%), while 10-day reactions shift leadership to Consumer Cyclical mega-caps (+3.2%). Don’t assume all earnings reactions occur at the same pace.
Mega-tech hype isn’t eternal: The bubble chart shows AAPL/MSFT/NVDA delivered strong returns from 2020–2022, but the falling trend since then indicates waning market enthusiasm. Don’t chase yesterday’s overhyped stocks.
Calendar patterns reward patience: January and October deliver slightly stronger post-earnings returns (~0.8%), while July and August tend to have lower liquidity. Combine seasonal timing with sector choices for additional gains.
Market regimes change winners: Cyclicals underperformed during COVID recovery and the bull run (2023+), while Industrials peaked during the recovery. There are no universal “best performers,” only the best performers for now. Adjust to the regime.
The actionable setup: Small to mid-cap cyclical longs in January during bull markets combine all these signals for maximum conviction , where sector timing, seasonality, and regime alignment converge.

Final Thoughts

This exercise shows why visualization is important in finance: raw tables of returns and surprises wouldn’t reveal these patterns.

Heatmaps instantly highlighted sector winners.
Scatter plots demonstrated the weak surprise‑return connection. Bubble charts narrated the mega‑tech story over time.
Violins unveiled the harsh truth that markets are noisy. Cross‑sectional regime analysis reminded us that yesterday’s approach doesn’t ensure tomorrow’s returns.

The effort to interpret this data pays off: you shift from passive observation to active pattern recognition. You see not just what occurred, but where and when it happened. In trading and analysis, understanding the shape of complexity often surpasses having a perfect formula.

Visual storytelling turns data into intuition . And intuition, based on evidence, outperforms guesswork every time.

How to Build a Spam Email Detector with Python and Naive Bayes Classifier

Maku Gideon — Tue, 10 Mar 2026 23:27:52 +0000

Ever wondered how Gmail knows that an email promising you $10 million is spam? Or how it catches those "You've won a free iPhone!" messages before they reach your inbox?

In this tutorial, you'll build your own spam email classifier from scratch using the Naive Bayes algorithm. By the end, you'll have a working model that achieves over 97% accuracy—and you'll understand exactly how it works under the hood.

This project was inspired by the Python Machine Learning Workbook for Beginners by AI Publishing, which offers excellent hands-on ML projects for those starting their journey. (Note: I have no affiliation with the authors — I simply found it a useful resource.)

Prerequisites
Why Naive Bayes for Spam Detection?
How to Set Up Your Environment
How to Load and Explore the Dataset
How to Visualize the Data Distribution
How to Analyze Word Patterns with Word Clouds
Preprocessing the Text Data
How to Convert Text to Numerical Features
How to Train the Naive Bayes Classifier
How to Evaluate Model Performance
Testing on Individual Emails
Key Takeaways

What You'll Learn

How email spam filters actually work
The intuition behind the Naïve Bayes algorithm
Text preprocessing techniques for machine learning
How to evaluate classification models
Building a complete spam detection pipeline in Python

Prerequisites

You should have basic familiarity with Python and some understanding of fundamental machine learning concepts. Don't worry if you're still learning—I'll explain everything as we go.

Why Naive Bayes for Spam Detection?

Before we dive into code, let's understand why Naive Bayes is particularly well-suited for this task.

Imagine you receive an email containing words like "free," "winner," "click here," and "limited time offer." Your brain immediately flags this as suspicious. The Naive Bayes algorithm does something similar—it calculates the probability that an email is spam based on the words it contains.

The algorithm is called "naive" because it makes a simplifying assumption: it treats each word as independent of every other word. In reality, word combinations matter (think "free trial" vs. "free money"), but this simplification works remarkably well in practice.

Why Choose Naive Bayes?

Speed: It trains incredibly fast, even on large datasets
Efficiency: Requires minimal training data to produce reliable results
Simplicity: Easy to implement and interpret
Performance: Despite its simplicity, it often outperforms more complex algorithms for text classification

Limitations to keep in mind:

The independence assumption means it can't capture relationships between words
If a word appears in the test data but never appeared in training, the algorithm assigns it zero probability (though there are ways to handle this)

Now let's build our spam detector.

How to Set Up Your Environment

First, install the required libraries. Open your terminal or run this in a Jupyter notebook cell:


%pip install regex wordcloud numpy pandas seaborn matplotlib scikit-learn

Here's a quick summary of what each library does:

regex / re — for cleaning text using pattern matching
wordcloud — for visualizing which words appear most frequently
numpy and pandas — for data loading and manipulation
seaborn and matplotlib — for charts and visualizations
scikit-learn — provides the Naive Bayes classifier, vectorizer, and evaluation tools

Once installation is complete, import everything at the top of your script or notebook. Grouping all imports at the top is a Python best practice — it makes dependencies easy to spot at a glance.

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Natural language processing
import nltk
import re
from nltk.corpus import stopwords

# Machine learning
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Word cloud visualization
from wordcloud import WordCloud

How to Load and Explore the Dataset

We'll use a dataset of labeled emails. You can download it from Kaggle or use any similar email dataset with text and spam columns.

Use pandas' read_csv() function to load the dataset from a CSV file into a DataFrame — a table-like structure that makes it easy to inspect and manipulate data. The head() method then displays the first 5 rows so you can confirm the data loaded correctly and understand its structure.

message_dataset = pd.read_csv('emails.csv')
message_dataset.head()

Output:

	text	spam
0	Subject: naturally irresistible your corporate...	1
1	Subject: the stock trading gunslinger fanny i...	1
2	Subject: unbelievable new homes made easy im ...	1
3	Subject: 4 color printing special request add...	1
4	Subject: do not have money , get software cds ...	1

Next, call shape on the DataFrame to check its dimensions — this returns a tuple of (rows, columns) and is a quick way to confirm you loaded the full dataset without truncation.

# Get the dimensions of our dataset (rows, columns)
message_dataset.shape

Output:

(5728, 2)

The dataset contains 5,728 emails with two columns: text (the email content) and spam (1 for spam, 0 for legitimate emails).

How to Visualize the Data Distribution

Before training any model, it's crucial to understand your data. Let's see how spam and legitimate emails are distributed.

value_counts() tallies how many emails belong to each class (spam vs. legitimate). Chaining .plot(kind="pie") on the result converts those counts directly into a pie chart. The autopct="%1.0f%%" argument tells matplotlib to label each slice with its percentage, rounded to the nearest whole number.

plt.rcParams["figure.figsize"] = [8, 10]
message_dataset.spam.value_counts().plot(kind="pie", autopct="%1.0f%%")

Output:

You'll see that approximately 24% of emails in the dataset are spam, while 76% are legitimate. This is a moderately imbalanced dataset, which we'll keep in mind when evaluating our model.

How to Analyze Word Patterns with Word Clouds

Word clouds provide an intuitive visualization of the most frequent words in a text corpus. Words that appear more often are rendered larger. Let's create separate word clouds for spam and legitimate emails to identify distinguishing patterns.

First, we need to remove stop words — common words like "the," "is," and "at" that appear everywhere and carry no meaningful signal for classification. NLTK's stopwords.words("english") returns a pre-built list of these words. The apply() method runs a function across every row in the column, and the lambda inside it splits each email into individual words, filters out any stop words, then rejoins the remaining words into a clean string.

stop = stopwords.words("english")

message_dataset["text_without_sw"] = message_dataset["text"].apply(
    lambda x: "".join([item for item in x.split() if item not in stop])
)

Now let's visualize the spam emails. We filter the DataFrame to rows where spam == 1, join all that text into a single large string, and pass it to WordCloud().generate(). The imshow() function renders the resulting image, and axis("off") hides the x/y axes since they're not meaningful for an image display.

message_dataset_spam = message_dataset[message_dataset["spam"] == 1]

plt.rcParams["figure.figsize"] = [8, 10]
text = ' '.join(message_dataset_spam['text_without_sw'])
wordcloud2 = WordCloud().generate(text)

plt.imshow(wordcloud2)
plt.axis("off")
plt.show()

Output:

Now do the same for legitimate emails by filtering to rows where spam == 0:

message_dataset_ham = message_dataset[message_dataset["spam"] == 0]

plt.rcParams["figure.figsize"] = [8, 10]
text = ' '.join(message_dataset_ham['text_without_sw'])
wordcloud2 = WordCloud().generate(text)

plt.imshow(wordcloud2)
plt.axis("off")
plt.show()

Output:

Key observations:

Spam emails frequently contain promotional language: "free," "money," "offer," "click," "please"
Legitimate emails contain more conversational and work-related terms: "company," "time," "thanks"

You'll also notice the word "enron" appearing prominently in the legitimate emails cloud. This is because the non-spam emails in this dataset are drawn from the publicly available Enron email corpus — a large collection of real internal emails from Enron Corporation that was released during their 2001 fraud investigation. It has since become one of the most widely used benchmark datasets in NLP research, which is why "enron" shows up so frequently as a word in legitimate email content.

These patterns give us confidence that word-based classification will work well.

How to Preprocess the Text Data

Raw text needs cleaning before machine learning algorithms can process it effectively. Let's first separate our features from our labels. In ML terminology, X holds the inputs (the email text we use to make predictions) and y holds the target labels (1 for spam, 0 for legitimate).

X = message_dataset["text"]
y = message_dataset["spam"]

Now we'll define a function to clean the text. The re.sub() function from Python's built-in re module performs pattern-based substitution using regular expressions. We call it three times in sequence:

re.sub('[^a-zA-Z]', ' ', doc) — replaces anything that isn't a letter (numbers, punctuation, symbols) with a space. This strips noise that doesn't help with classification.
re.sub(r'\s+[a-zA-Z]\s+', ' ', document) — removes isolated single characters (like "I" or "a" left behind after removing punctuation) by matching any single letter surrounded by whitespace.
re.sub(r'\s+', ' ', document) — collapses multiple consecutive spaces into a single space, tidying up any extra gaps created by the previous two steps.

def clean_text(doc):
    document = re.sub('[^a-zA-Z]', ' ', doc)
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    document = re.sub(r'\s+', ' ', document)
    return document

Apply this cleaning function to every email in the dataset. We first convert the pandas Series to a plain Python list using list(), then loop through each email, clean it, and collect the results in X_sentences.

# Create an empty list to store cleaned emails
X_sentences = []

# Convert the pandas Series to a list for iteration
reviews = list(X)

# Clean each email and add it to our list
for rev in reviews:
    X_sentences.append(clean_text(rev))

How to Convert Text to Numerical Features

Machine learning algorithms work with numbers, not text. We need to transform our cleaned text into a numerical representation.

TF-IDF (Term Frequency-Inverse Document Frequency) is a great choice for this. It assigns each word a score that reflects how important it is to a particular document relative to the entire dataset. A word that appears often in one email but rarely across all emails gets a high score — meaning it's distinctive and likely meaningful. Common words that appear everywhere get a lower score.

TfidfVectorizer from scikit-learn handles this transformation. The parameters we set control what gets included:

max_features=2500 — only keeps the 2,500 most frequent words, discarding rare ones that don't generalize well
min_df=5 — ignores words that appear in fewer than 5 emails (too rare to be useful)
max_df=0.7 — ignores words that appear in more than 70% of all emails (too common to be distinctive)
stop_words=stopwords.words('english') — removes common English words like "the" and "is"

fit_transform() does two things in one step: it learns the vocabulary from our text (fit), then converts each email into a numerical vector based on that vocabulary (transform). Calling .toarray() on the result converts the sparse matrix output — which stores only non-zero values for efficiency — into a regular dense NumPy array that scikit-learn classifiers expect.

vectorizer = TfidfVectorizer(
    max_features=2500,
    min_df=5,
    max_df=0.7,
    stop_words=stopwords.words('english')
)

X = vectorizer.fit_transform(X_sentences).toarray()

Each email is now represented as a vector of 2,500 numbers, where each number is the TF-IDF score for a specific word.

How to Train the Naive Bayes Classifier

Now comes the exciting part — training our model! First, split the data into training and test sets using train_test_split(). This function randomly shuffles and divides both X and y simultaneously, keeping labels aligned with their corresponding emails. Setting test_size=0.20 reserves 20% of the data for testing. Setting random_state=42 seeds the random number generator so you get the same split every time you run the code, making your results reproducible.

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42
)

Now train the Multinomial Naive Bayes classifier. We use MultinomialNB specifically because it's designed for features that represent counts or frequencies — exactly what TF-IDF scores are. Calling fit(X_train, y_train) trains the model by having it calculate the probability of each word appearing in spam versus legitimate emails across the training set. Those probability tables are what the model uses later to classify new emails.


spam_detector = MultinomialNB()
spam_detector.fit(X_train, y_train)

That's it! The Naive Bayes algorithm is remarkably fast—training completes in milliseconds even with thousands of emails.

How to Evaluate Model Performance

Let's see how well our spam detector performs on emails it has never seen before. The predict() method takes the test set features and returns a predicted label (0 or 1) for each email, based on the probability tables the model learned during training.


y_pred = spam_detector.predict(X_test)

Now evaluate the predictions using three different tools from scikit-learn's metrics module:

confusion_matrix() — produces a 2×2 grid comparing actual vs. predicted labels, showing exactly where the model gets things right and wrong
classification_report() — prints precision, recall, and F1-score for each class, giving a more complete picture than accuracy alone
accuracy_score() — returns the overall percentage of correct predictions


print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

Output:

[[849   7]
 [ 18 272]]

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       856
           1       0.97      0.94      0.96       290

    accuracy                           0.98      1146
   macro avg       0.98      0.96      0.97      1146
weighted avg       0.98      0.98      0.98      1146

0.9781849912739965

Our model achieves 97.82% accuracy! Let's break down what the confusion matrix tells us:

849: Legitimate emails correctly identified as legitimate (True Negatives)
7: Legitimate emails incorrectly marked as spam (False Positives)
18: Spam emails that slipped through as legitimate (False Negatives)
272: Spam emails correctly caught (True Positives)

The classification report shows:

For legitimate emails (class 0): 98% precision, 99% recall
For spam emails (class 1): 97% precision, 94% recall

These numbers are impressive, especially considering the simplicity of our approach.

How to Test on Individual Emails

Let's verify our model works by testing it on a specific email. We'll first print the cleaned text at index 56 and its actual label to see what we're working with. Then we'll ask the model to predict it.


print(X_sentences[56])
print(y[56])

Output:

Subject localized software all languages available hello we would like to offer localized software versions german french spanish uk and many others aii iisted software is available for immediate downioad no need to wait week for cd deiivery just few exampies norton lnternet security pro windows xp professionai with sp fuil version corei draw graphics suite dreamweaver mx homesite inciudinq macromedia studio mx just browse our site and find any software you need in your native ianguaqe best reqards kayieen 
1

This is clearly a spam email trying to sell pirated software. The actual label is 1 (spam). Now pass this single email through the same pipeline — first transforming it into a TF-IDF vector using the already-fitted vectorizer, then calling predict() on the result. It's important to use the same vectorizer that was fitted on the training data, so the word-to-index mapping is consistent.


print(spam_detector.predict(vectorizer.transform([X_sentences[56]])))

Output:

[1]

The model correctly identifies this promotional email as spam.

Key Takeaways

Naive Bayes is powerful for text classification despite its simplifying assumptions. For spam detection, it achieves excellent accuracy with minimal computational cost.
Text preprocessing matters. Removing noise (special characters, numbers, extra spaces) helps the algorithm focus on meaningful patterns.
TF-IDF captures word importance effectively. It gives higher weight to distinctive words that help differentiate spam from legitimate emails.
Always evaluate with multiple metrics. Accuracy alone can be misleading, especially with imbalanced datasets. Precision, recall, and F1-score give a complete picture.
Start simple. Before reaching for complex deep learning models, try classical algorithms like Naïve Bayes. They're interpretable, fast, and often surprisingly effective.

Next Steps

Want to improve this spam detector further? Here are some ideas:

Experiment with different vectorizers: Try CountVectorizer or word embeddings (Word2Vec, GloVe)
Handle class imbalance: Use techniques like SMOTE or adjust class weights
Feature engineering: Add features like email length, number of links, or sender domain
Try other algorithms: Compare with SVM, Random Forest, or gradient boosting
Deploy the model: Build a simple API using Flask or FastAPI

Conclusion

You've built a spam email classifier that achieves over 97% accuracy using the Naïve Bayes algorithm. Along the way, you learned about text preprocessing, feature extraction with TF-IDF, and model evaluation techniques.

The beauty of this approach is its simplicity. With just a few dozen lines of code, you've created something that actually works—and now you understand the principles behind commercial spam filters.

Feel free to experiment with the code, try different parameters, and see how the results change. That's the best way to deepen your understanding.

References

Python Machine Learning Workbook for Beginners: 10 Machine Learning Projects Explained from Scratch by AI Publishing

How to Create Boxplots and Model Data in R Using ggplot2

Tiffany Mojo Omondi — Thu, 15 Jan 2026 18:48:32 +0000

In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn about statistical modelling using linear regression and logistic regression in R.

By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into a real-world analytics workflow.

Prerequisites
How to Set Up Your R Environment
How to Load and Inspect the Data
How to Clean and Prepare the Data
How to Use Boxplots
How to Create Boxplots with ggplot2
How to Perform Exploratory Data Analysis
How to Build Linear Regression Models
How to Build Logistic Regression Models
Why Visualization Comes Before Modeling
Conclusion

Prerequisites

Before you begin, you should be comfortable with the following:

Basic R syntax (variables, functions, data frames).
Installing and loading R packages.
Understanding what rows and columns represent in a dataset.
Very basic statistics (mean, median, distributions).

How to Set Up Your R Environment

Start by installing and loading the packages you will need.

install.packages(c("tidyverse", "ggplot2"))
library(tidyverse)
library(ggplot2)

tidyverse provides tools for data manipulation and visualization. ggplot2 is the visualization engine you will use for boxplots. Loading the libraries makes their functions available for use

How to Load and Inspect the Data

First, download the HR Analytics dataset by Saad Haroon from Kaggle.

Assuming the downloaded dataset is saved as "C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv", load the path file into R.

You can view a sample of the the dataset by running the head function. To view the structure of the dataset, you can run the str function.

hr <- read.csv("C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv")
head(hr)
str(hr)

The read.csv function imports the dataset into R. The head function shows the first six rows so you can preview the data. The str function reveals data types, helping you spot categorical versus numeric variables early.

Remember that understanding your data structure early prevents errors later when plotting or modeling. Once you run the head function, you should see the following in your console:

From the head function, you can see:

Structure

Each row represents one employee.
Each column represents a feature/variable about the employee.

Key Columns & Meaning

EmpID → Employee identifier
Age → Age in years
AgeGroup → Age category (for example, 18-25)
Attrition → Whether the employee left or not (Yes/No)
BusinessTravel → Travel frequency (Travel_Rarely, Travel_Frequently, Non-Travel)
Department → Employee department
DistanceFromHome → Distance from home to office (km)
Education / EducationField → Level and field of education
EmployeeCount → Usually 1 per employee (redundant)
Gender → Male / Female
JobRole / JobSatisfaction → Job title and satisfaction level
MonthlyIncome / SalarySlab → Salary amount and category
YearsAtCompany / YearsInCurrentRole → Experience metrics
OverTime → Works overtime (Yes/No)
Other features: PerformanceRating, TrainingTimesLastYear, WorkLifeBalance, StockOptionLevel, and so on.

Data Types

Numeric → Age, DistanceFromHome, MonthlyIncome, YearsAtCompany
Categorical / Character → Attrition, Gender, Department, JobRole

Observations

The dataset is tabular, like a spreadsheet.
There are multiple categorical columns
There are multiple numeric columns
Some columns seem redundant or constant; doesn’t provide useful information because of the same values (for example, EmployeeCount)

From the str function, you can gather that:

The dataset contains 1,480 observations and 38 variables. Each row represents one employee, and each column represents a feature about that employee.

Each column has a name, data type, and example values. For instance, Age and DistanceFromHome are numeric (int), with values like 28 or 12. EmpID and Department are character strings (chr), with examples like Research & Development or Sales. Other features include JobRole (Analyst, Manager) and Attrition (Yes/No).

The dataset contains mixed data types. Some columns are numeric, such as MonthlyIncome or YearsAtCompany. Some are character or categorical, like Gender (Male/Female) and BusinessTravel (Travel_Rarely, Travel_Frequently). A few columns are redundant or constant. For example, EmployeeCount has the same value of 1 for all rows and does not provide useful information.

How to Clean and Prepare the Data

Before visualization, you must clean your data. In order to find out what you need to clean you can investigate the data.

Run the summary function to view the statistics of the dataset. You also need to run the is.na function to identify missing values to be removed.

summary(hr)
colSums(is.na(hr))

The summary function gives quick statistics and flags suspicious values. The is.na function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you are working with is critical.

After running the summary function, the following will appear in your console:

This shows the basic statistics of each column. After running the is.na function, the following will also appear in your console:

From this output, you can see that only YearsWithCurrManager has 57, meaning that 57 employees don’t have a value for this column.

You can drop this whole column along with the other redundant columns we saw earlier on. You can do this with the code below.

hr <- hr %>% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))

To verify if the columns are gone, use this code:

colnames(hr)

Now we need to convert important categorical variables to factors. Doing this tells R that the column has two categories (‘Yes’ and ‘No’), not continuous text.

hr$Attrition <- as.factor(hr$Attrition)
hr$JobRole <- as.factor(hr$JobRole)
hr$Department <- as.factor(hr$Department)

This also ensures ggplot2 treats them correctly when grouping.

How to Use Boxplots

A boxplot displays key features of a dataset. The median is shown by the line in the middle of the box. The interquartile range is represented by the box itself while the whiskers show the spread of the data. Outliers appear as individual points.

Boxplots are mostly useful when you want to compare distributions across groups, such as income by job role or age by attrition status.

Let’s start with a simple boxplot of monthly income.

ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = "blue") +
  labs(
    title = "Distribution of Monthly Income",
    y = "Monthly Income")

The aes function tells ggplot what variable to plot. geom_boxplot draws the boxplot. The labs function labels parts of the plot drawn, that is the x axis, y axis, and the title.

How to Create Boxplots with ggplot2

Now lets compare income across job roles.

ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = "lightblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Monthly Income by Job Role",
    x = "Job Role",
    y = "Monthly Income")

The x aesthetic lists all the job roles. The labels are rotated to improve readability. This visualization quickly reveals income differences across roles.

How to Perform Exploratory Data Analysis (EDA)

Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of the data.

We can use the example of Years at company by department.

ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = "darkblue") +
  labs(
    title = "Years at Company by Department",
    y = "Years at Company")

How to Build Linear Regression Models

To understand how to build linear regression models, you have to model MonthlyIncome using YearsAtCompany with the command below.

The first one creates the model while the second displays it.

hr_lm<- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)

Linear regression estimates how income changes with tenure. This works when the variables are numeric.

After running the code, your console should show you this output:

Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -9506  -2488  -1186   1403  15483 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3734.47     159.41   23.43   <2e-16 ***
YearsAtCompany   395.25      17.14   23.07   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4032 on 1478 degrees of freedom
Multiple R-squared:  0.2647,    Adjusted R-squared:  0.2642 
F-statistic:   532 on 1 and 1478 DF,  p-value: < 2.2e-16

Let’s interpret this model.

If an employee has 0 years at the company, their base monthly income is $3734.47. This comes from the intercept.

For each year an employee spends at the company, their monthly income is predicted to increase by $395.25.

Both coefficients have p-values < 2e-16. This means they are highly significant. It strongly shows that the years an employee spends at a company affects their income.

The model’s R-squared is 0.2647. This means about 26% of the variation in monthly income is explained by the years an employee spends at the company. This is low, so other factors like role, department, or education likely affect income too.

The model’s F-statistic is 532, with a p-value < 2.2e-16. This means the model is statistically significant overall.

In general, the longer an employee stays at a company, the more they earn, roughly $395 extra per year. But years at the company alone explain only about a quarter of their income. You need to consider other variables for better predictions.

How to Build Logistic Regression Models

You can now learn how to predict attrition. The first command generates the model while the second displays it.

hr_glm<- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)

Your console should show this as an output when you run both commands.

Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.094e-01  1.375e-01  -5.886 3.96e-09 ***
MonthlyIncome  -9.449e-05  2.302e-05  -4.104 4.05e-05 ***
YearsAtCompany -5.047e-02  1.792e-02  -2.817  0.00485 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1305.4  on 1479  degrees of freedom
Residual deviance: 1252.5  on 1477  degrees of freedom
AIC: 1258.5

Number of Fisher Scoring iterations: 5

Logistic regression is used for binary outcomes, that is, yes or no. It estimates probability.

Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (Attrition) based on their Monthly Income and Years at Company.

The intercept is -0.809. This is the baseline log-odds of leaving when their income and years at the company are zero.

The employees’ Monthly Income has a coefficient of -0.0000945. This means that as their income increases, their chance of leaving decreases slightly. An increase in income makes them less likely to quit.

The employees’ Years at Company have a coefficient of -0.0505. This shows that the longer they stay, the less likely they are to leave. Each additional year reduces their attrition probability.

All coefficients are statistically significant. Monthly Income and Years at Company both strongly affect their likelihood to stay.

The model’s residual deviance is 1252.5, lower than the null deviance of 1305.4. This means the model explains some of the variation in attrition.

The key takeaway is that if an employee earns more and stays longer at the company, they are less likely to leave. These factors matter, but other elements also influence attrition.

Why Visualization Comes Before Modeling

Boxplots help you to:

Detect outliers: Boxplots highlight extreme values that interfere with model results.
Compare groups: Boxplots allow quick comparison of distributions across different categories.
Form hypotheses: Visual patterns assist in identifying relationships worth testing in a model.
Validate modeling assumptions: Boxplots help check distribution shape and variance before modeling.

Modeling without visualization often leads to misinterpretation or false confidence.

Conclusion

In this tutorial, you learned how to load and clean data, understand boxplots and their importance. You also learned how to use ggplot2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.

How Neural Networks Work – Explained Using the Straight Line Equation y = ax + b

Samyukta Hegde — Thu, 08 Jan 2026 00:02:44 +0000

Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“

A straight line equation y = ax+b answers it in the simplest way possible. y can increase, decrease, or stay the same when x changes.

On the other hand, a deep neural network tries to answer it in a flexible way. It’s only possible because of multiple layers of straight line calculations stacked one over another along with non linear adjustments to help the network adapt and produce the desired result.

Since a straight line is the essence of neural networks, I think it’s time we try to understand the subtle details of y = ax+b, which I refer to as the magical equation. We’ll also go through the basics of linear regression and classification, which should help you understand the progression of a simple straight line to a complex deep neural network.

Prerequisites
y=ax+b
Linear Regression
Linear Classification
Comparison
Key Additions to Help Build Deep Neural Networks
Modelling a Deep Neural Network
Final Thoughts

Prerequisites

A basic understanding of linear algebra, particularly y=ax+b.
General idea about linear regression and classification.
Familiarity with the concept of deep neural networks.

y=ax+b

A straight line simply means that output changes steadily as input changes. There are no surprises (that is, no non linearity). Let’s analyze it properly.

y => Output variable
x => Input variable
a => Amount by which y changes when x changes (slope)
b => Value of y when x is 0 (y intercept)

We can take an example and model it in the same form to understand it better.

Ms. Poly is a math teacher who wants to formulate a study plan for her students to excel in an upcoming final exam. For simplicity, she creates a rule of thumb using only one factor: the number of hours studied per week. It has a direct impact on the marks scored by a student.

Before beginning, she makes certain assumptions:

Every student is capable of scoring at least 30 without studying.
For every hour a student studies, an additional 3 marks can be scored.

She then comes up with the following equation based on her ideas: y = 3x+30

y => Marks scored.
x => Number of hours studied.
a=3 => Increase in marks for every hour studied
b=30 => Minimum marks

In the above graph, she plots the points based on the results of the equation. As expected, it is a straight line. If she needs the marks scored for 9 hours of study, she can get it by just substituting x=9 in y=3x+30. Note that the data (x and y) are just based on her hunch and aren’t real.

But Ms. Poly wants to guide her students on how to prepare for the final exam based on actual data. So she conducts a pop quiz and grades it. In order to formulate a study plan, she interviews her students and collects information on how many hours they study math per week. She creates a table with two columns: number of hours studied (x) per week and marks scored (y). She tries her old formula y=3x+30, but it doesn’t seem to work. Thus, she doesn’t have any sensible equation describing the relation between x and y.

Let’s assume that a new student who hasn’t attended any exam (no y available) joins the class the next day, and Ms. Poly only knows the number of hours dedicated per week (x). How can she answer the question below?

If the new student studies for a certain number of hours (x), what can be the marks scored (y) in the exam?

It’s impossible unless there’s an equation defining the sample data. So, her task is to find one that fits the given points. This process is called curve fitting or regression.

Linear Regression

The core idea of linear regression to find a straight line that captures the trend of the existing data to facilitate predictions for new input data. Now, let’s dive straight into the example to understand the concept better.

Ms. Poly is determined to arrive at a solution. She plots the collected data on a graph to get a better picture.

She has absolutely no idea how x and y are related. So, she must figure out a formula, by trial and error, that roughly fits the points. She has to start with an intuitive guess, try to improve it in the subsequent steps and then arrive at the best possible solution.

Trial 1: Ms. Poly begins with her previous straight line equation.

y = 3x+30

She substitutes different values of x and plots it alongside the collected input data. This way she can get a clear picture of the differences in her assumption and reality.

Trial 2: She observes that the line needs a little more slope. This simply means that, in reality, more marks are being scored for every additional hour of study. By changing it from 3 to 4, the equation becomes:

y = 4x+30

The following graph depicts the new line alongside the sample data:

Trial 3: It looks better but she feels there is a need to shift the whole line upwards. This means that higher marks are being scored even if a student doesn’t dedicate any time for math in a week. She decides to retain the previous slope but changes the starting marks by 10, thus arriving at:

y = 4x+40

This particular line covers most of the points and can be considered the best possible solution.

Now, if she wishes to ascertain the marks scored by the new student who studied for 3.5 hours, she pins the value inside the formula and calculates the answer: y = 4*(3.5)+40=54

We saw how Ms. Poly arrived at a straight line equation to predict the output for an unknown input. Now she can chalk out a study plan for her class based on the equation.

Here, an expression is formulated to ascertain the change in output when the input changes. It looks like Ms. Poly is thinking like a data scientist. She has in fact modelled a very simple neural network for regression. The equation y=4x+40 can be considered as the only neuron (processing unit) within it. She’s adjusted the parameters a (weight) and b (bias) to arrive at the final formula which covers most of the points (thus minimizing the loss).

Here’s a breakdown of the y = 4x+40 equation:

y => Marks scored.
x => Number of hours studied.
a=4 => Increase in marks for every hour studied
b=40 => Minimum marks

At present, it is a rudimentary neural network which has no layering and non-linearity.

Now let’s shift our attention to a completely different scenario. Ms. Poly, being a teacher, wants to ensure that all her students pass the exam. Assuming, as an end result, she’s not interested in predicting the marks scored. She just wants to know:

If a student studies for a certain number of hours (x), will the student pass/fail(y) the exam?

This leads her to the process of classification.

Linear Classification

The linear classification process uses a simple straight line to divide the data into categories or classes. The line acts as a boundary so that the classes fall on either side of it. First, Ms. Poly defines the boundary condition for pass and fail.

If marks scored>=50, pass

If marks scored<50, fail

According to the data table, x=3 corresponds to y=52 (boundary condition). Therefore she considers x=3 as the classification line***.***

x=3 seems to segregate the points into the categories properly. She tries to confirm it by substituting another value. Thus, if a student studied for 9 hours, the score would lie towards the right side of x=3. So, they’d pass as per the classification equation.

Again, she’s arrived at an expression to ascertain the change in output when the input changes. But here, she has modelled a basic neural network for classification. The equation x=3 is the only neuron within it. It can be considered to be having two parts as explained below.

Pre-Activation Part: This portion of the neuron computes an intermediate value which is helpful in further processing. She’s figured out the parameters a (weight) and b (bias) to arrive at the following formula: z = x-3
```
 z => Intermediate Value.
 x => Number of hours studied.
 a=1 => Influence of the number of hours studied on the marks scored
 b=-3 => Minimum number of hours to study to pass the exam = 3
```
Activation Part: This portion triggers the neuron to make decisions based on a threshold value. The following equation segregates the points into two classes.
```
 y = 1 (Pass) if z>=0
 y = 0 (Fail) if z<0
```

This is a very plain neural network which has no layering and non-linearity but has pre-activation and activation parts inside a neuron.

Comparison

We looked at the examples of both linear regression and classification used by Ms. Poly. Regression helps in predicting a value while Classification helps in decision making. Let’s draw a small table to summarize the differences.

Upon careful observation we notice that both answer the question of how input change affects output.

But at a slightly higher level of complexity than a straight line. Because in the case of both regression and classification, we try to figure out the equation parameters by trial and error.

Here, since the requirements are simple, Ms. Poly just uses a straight line to solve both. A simple linear equation can handle only one steady trend. But in real life, problems that need solving are far more challenging and unpredictable. Some examples are:

Image Classification: An output label is produced based on the input images.

Text Translation: An English sentence can be given as an input to be translated to say, Spanish.

Chatbots: A text prompt is typed in by a user and a meaningful and relevant output is generated.

She probably should have to use a deep neural network if both data and task were complex. That presents another question: How does one build a deep neural network?

We will explore it further by extending the same example to a more realistic version.

Key Additions to Help Build Deep Neural Networks

In the above sections, we noted that Ms. Poly was interested in predicting the exam results of a student using just one factor - number of hours studied. However, in practice, is that one factor sufficient in determining the marks scored or whether the student passes the exam?

No. It’s not enough. She needs to take into account a lot of aspects like:

Number of hours studied
Number of hours of sleep/rest
Burnout due to over-studying
Difficulty level of topics in math
Pattern of the exam, and so on.

All the above neither act independently nor do they have a simple linear relation with the marks scored. So, she has to solve this problem by stacking the contributing factors one above the other in layers and also adding the element of non linearity. Let’s take a look at each in detail.

Layering

Burnout leads to lower score whereas good sleep increases score. But burnout can be reduced if the student is well rested. So, the impact on the final score when these two factors interact should be taken into account. This is possible only when the system solves it in layers. The first layer can deal with how they independently influence the score, the next layer can explore the interaction between them.

Non-Linearity

If the number of hours studied increases, the score might increase but when burnout overpowers the effect of study hours, the score reduces. The combined effect results in a non-linear graph. There is a rise and then dip in the score based on number of hours studied. It’s evident that the relationship is not straightforward as in a straight line. That’s where it becomes necessary to add non-linearity in the calculations. It helps the system to respond differently according to the conditions, allowing for flexibility in dealing with real world data and conditions.

Thus, Ms. Poly would have to extend the idea of linear regression/classification by including layering and non-linearity to build a fully functional neural network to help build a practical study plan.

Modelling a Deep Neural Network

Ms. Poly should start the work on modelling a deep neural network by following the steps mentioned below:

Step #1 - Define the Problem Clearly

The following factors should be considered before she begins the process of modelling:

What are the input features?
What are the output features?
What type of problem is it (regression/classification)?

Step #2 - Define the Input Layer

The input features form the first layer. There is no computation in this stage. They are represented as:

x1: Number of hours studied
x2: Number of hours of sleep/rest
x3: Burnout due to over-studying
x4: Difficulty level of topics in Maths
x5: Pattern of the exam

Step #3 - Define the First Hidden Layer

This step consists of two parts:

Apply Linear Transformation: The actual learning begins here. A straight line equation is used to understand the combined effect of the inputs. The general formula is z=Wx+b.

z: Intermediate value or Pre-activation
W: Weight matrix which consists of values corresponding to the impact of
each input feature
x: Matrix consisting of input features, [x1, x2, x3, x4, x5]
b: Bias which represents the initial assumptions of the teacher(when x=0)

It looks similar to a linear regression/classification equation. At first W and b are initialized to random values. Then in the subsequent steps, they are adjusted like it was done in earlier examples. We can consider the following combinations assuming we have two neurons in this layer:

Neuron 1: It can focus on study hours, burnout, and rest, with other features contributing less significantly.

Neuron 2: It can emphasize more on the difficulty level of the topic and the exam type compared to other inputs.

It’s important to note that this layer doesn’t calculate the interactions between the features but only on the way different linear combinations work together but independently. To make it clearer, how they contribute independently are added together. We don’t know how one input feature influences the other. For example, we know sleep increases score and burnout reduces score, but what we don’t know at this stage is if sleep reduces burnout, which in turn can influence the final score.

Add Non-Linearity: This step, also called activation, helps in capturing the complexities in different combinations of the features. Less study results in low marks, and too much burnout also results in low marks. It means there is a curve in the score graph which can’t be represented by a linear equation. The activation function is applied to the intermediate value and can be expressed as:

a = g(z)

a: Activation output
g: Activation function
z: Intermediate value or Pre-activation

For example: ReLU is an activation function which outputs z only if z is positive, else 0.

y = ReLU(z)=max(0,z)

We can see that it has no steady slope and is a non-linear activation function. It can suit this scenario as it lets the value pass through to the next layer only if the combined effect of features is greater than 0. Neuron 1 will let it’s output go to the next layer only if the intermediate value (z) that results from study hours, burnout and rest, is large enough to be influencing the final decision, else it’s ignored. There are multiple options for non-linear activation functions that one can choose from.

Step #4 - Stack Layers One Above the Other

This step helps in learning the mutual interactions between the inferences learned from the first hidden layer. The network attempts to understand the intricate details of the influencing factors and build a stable system. It is here that details of whether sleep reduces burnout are figured out. Every layer consists of linear and non linear transformations applied on the input, which are values obtained from the previous layer. Likewise multiple layers can be stacked one over the other based on the requirements. In this example, for representation, we have taken two hidden layers with two neurons each. The number of layers and neurons can vary based on requirements.

Step #5 - Define the Output Feature(s)

This appears to be the final stage in a deep neural network. Ms. Poly can decide what she wants for output: predict the marks scored by a student or predict if the student passes/fails the exam. If she wants the final marks scored, she just has to apply linear transformation in the neuron in the final layer to produce the output. If she wants pass/fail status, she has to apply both linear and non-linear transformations to achieve the desired results.

The diagram below shows an abstract representation of the deep neural network.

The next steps are:

Training the model: The network is trained in the following way:

Random weights and biases are assigned to the linear transformation portions of the network.
Then the network makes a prediction which is compared with the expected result.
If there are gaps between the actual result and the predicted result, corrections are made in weights and biases (this step is similar to what was done in linear regression and classification).
The steps above are repeated until the results improve.

Using the model: After the model has been trained, it is capable of yielding results for new input values.

Final Thoughts

In this article, we began with the basics of a straight line equation. Then we gradually navigated through slightly more elaborate concepts like linear regression and classification. They laid the groundwork for delving into the seemingly mysterious deep neural networks. But they are in fact built by stacking layers of linear transformations and non-linear activations, which help understand sophisticated real world patterns.

Despite all the complexities and layers, we can see that the straight line remains the foundation upon which neural networks are built. As we saw earlier, the equation that a deep neural network begins with is our magical equation: y = ax+b.

Common Pitfalls to Avoid When Analyzing and Modeling Data

Oyedele Tioluwani — Tue, 14 Oct 2025 13:48:34 +0000

Working with data at any level, whether as an analyst, engineer, scientist, or decision-maker, involves going through a range of challenges. Even experienced teams can run into issues that quietly affect the quality of their work. A mislabeled column, an unclear definition, or a data leak that slips by unnoticed can all lead to results that do not hold up when it matters most.

Reliable analysis depends on how data is handled throughout the process. From collection and preparation to modeling and interpretation, each step carries its own risks. Many of the most persistent problems come not from technical gaps, but from missing checks or assumptions that go unspoken.

This guide highlights some of the most common pitfalls in data analysis and shows where they tend to appear. Along the way, it covers:

Biased or unclear inputs that cause trouble early on
Validation mistakes that distort model performance
Misinterpretation of results that leads to the wrong conclusions
Workflow gaps that slow teams down or create confusion
Practical steps you can take to catch and correct these issues

Data Collection Pitfalls
Data Preparation Pitfalls
Modeling and Validation Pitfalls
Interpretation and Communication Pitfalls
Organizational and Workflow Pitfalls
Conclusion

Data Collection Pitfalls

A lot of data issues begin before any modeling takes place. The way data is collected helps shape what your analysis can reveal. Once the inputs are biased or inconsistent, even solid techniques may lead to unreliable results.

One common issue is the bias in data sources. When a large portion of the data comes from digital channels like websites or apps, it creates an imbalance. For instance, if a model is trained only on web traffic, it could miss users who engage through offline means, like in-person visits or phone support. This then results in blind spots that limit how well the model performs once deployed.

Inconsistent definitions across systems also pose a major challenge. A simple label like “customer” could represent various things - it could refer to an active user in one database, a prospect in another, or even a past buyer elsewhere. Without shared definitions, one can end up using the same terms to mean very different things, and this leads to confusion and misaligned metrics.

A third issue is the lack of metadata or data provenance. Without clear records of where the data came from or how well it has changed over time, it becomes harder to trace issues, explain outputs, or reproduce results.

The way out:

Combine data from multiple sources to build a more complete and representative picture
Use stratified sampling to reduce bias where possible
Set up regular audits to catch data drift or gaps early
Maintain a shared data dictionary and align terms across teams
Track data lineage with tools like dbt, Apache Atlas, or OpenMetadata

Getting data collection right sets a strong foundation for analysis and helps prevent issues down the line.

Data Preparation Pitfalls

Once the data has been collected, the next step involves cleaning and shaping it for use. This is another delicate stage where data analysts often encounter an issue. Some choices that seem helpful at first can create problems later, especially when they aren’t documented or tested properly.

Silent Data Leakage

Data leakage occurs when a model learns from information that it would not have access to at prediction time. Let’s say for example, you’re building a model in January to predict whether a customer will make a purchase in February. If your dataset includes transactions from February, and you use that to calculate a feature like “days since last purchase”, then your model is learning from data it wouldn’t realistically have at prediction time.

Improper Handling of Missing Values

Quite a number of data explorers think missing values are just gaps to be filled. In certain cases, the fact that data is missing can be just as meaningful as the value itself. In a customer churn dataset, some users might have blank entries for recent activities because they have already stopped engaging with the product. Filling those gaps with averages and zeros without context could make the model treat them the same as users who simply haven’t generated enough data yet, which can be misleading.

Over-aggressive Outlier Removal

It’s tempting to remove extreme values to simplify modeling, but outliers often represent, although rare, yet important events. In fraud detection, for instance, the anomalies are the very signals the models need to learn from. Discarding them automatically based on z-scores or quantiles may improve the short-term accuracy while weakening long-term reliability.

The way out

To avoid data leakage, create training and test splits before engineering features. Make use of chronological splits when modeling time-based behavior, and regularly audit feature logic.
For missing values, go through the missingness patterns first. Use indicator variables where necessary, and treat the missingness as a signal, rather than just a defect.
With outliers, analyze their sources before removing them. If they are recognized, try using robust models that can handle skewed data or flag them for downstream use instead of deleting them.

Getting this stage right protects your models from brittle and unstable behavior.

Modeling and Validation Pitfalls

A common thought in this field is that models are only as reliable as the assumptions built into them. Mistakes at this phase are often reflected late, sometimes after the models have been deployed, making them harder to catch and more expensive to fix.

Overfitting Through Hyperparameter Tuning

Trying to make a model perfect with the training data can lead to patterns that don’t hold up in practice. When one tests hundreds of hyperparameter combinations without proper checks, the model often ends up learning noise rather than signals in the data, thereby resulting in excellent scores during cross-validation but weak performance in production. For instance, a churn model might show an excellent performance during development, but once it is deployed to a new region with a slight difference in customer behavior, it then starts to miss the mark.

Validation Leakage

Leakage can occur when the validation process accidentally gives the model access to target-related information. One common case is target encoding, where features like average purchase per customer group are calculated on the full dataset rather than only on the training set. This can lead to inflated validation scores and a false sense of confidence.

Ignoring Data Drift and Concept Drift

Data changes over time, and so do the basic relationships that models rely on. A model trained on behavior from eight months ago may not reflect current realities. Imagine a fraud detection model built before a major policy shift or change of product; the possibility that the model may fail to catch new fraud patterns that arise afterwards is extremely high.

The Way Out

Use nested cross-validation (a technique that separates hyperparameter tuning from final evaluation by using two loops of cross-validation) to avoid overfitting during the model selection. After this, you can then compare results against simple baselines to keep complexity in check.
Treat feature engineering as part of the pipeline and apply it within each training fold to avoid leakage. For time-sensitive data, validate progressively to reflect real-world use.
Check for drift using techniques like the Kolmogorov-Smirnov test or the Population Stability Index, and link alerts to retraining processes so models can evolve with data.

These steps go a long way in keeping your models solid in production and ready for whatever the data throws at them.

Interpretation and Communication Pitfalls

Clear, responsible communication is just as important as accurate modeling. But it is very easy to slip into habits that make results look more certain, more compelling, more reliable than they really are. These missteps can lead teams to act on insights that don’t hold up.

Overconfidence in Statistical Significance

Testing lots of variables without making adjustments can make weak signals look important. Imagine you run a dozen A/B tests and pick the one with a p-value below 0.05. Without correcting for multiple comparisons, there’s a good chance that result is just noise.

Ignoring Practical Significance

A result can be significant statistically but still meaningless when viewed in context. For example, finding a 0.1% lift in clickthrough rate, which is technically real but not worth the cost of rolling out a change across the product.

Model Explainability Missteps

When explanation tools are used without context, they can confuse rather than clarify. Showing a ranked list of SHAP values might look impressive, but if the stakeholders don’t understand what the features mean or how they interact, the takeaway is lost.

The Way Out

Be cautious with statistical significance. If you’re running several tests, apply corrections for multiple comparisons (Bonferroni or Benjamini-Hochberg methods, for instance) and avoid selectively reporting only the findings that look significant and ignoring those that don’t.
Look beyond what is statistically true and ask whether it is practically useful. A small, significant change might not be worth acting on at the end of the day.
When using explainability tools like SHAP or LIME, don’t assume the outputs speak for themselves. Add plain-language summaries, relevant examples, and business contexts to make them actionable. It is better to explain less with clarity than more with confusion.

These habits make your results easier to trust, interpret, and apply, which is ultimately the point of the work.

Organizational and Workflow Pitfalls

A major fact is that analytics is most effective when it is collaborative and responsive. Gaps in team structure or feedback processes can slow progress and limit the value of your work.

Teams working in isolation are a frequent issue. When analysts, engineers, and business stakeholders do not share tools or goals, efforts get duplicated and insights become fragmented. For example, one team might define active users based on weekly logins, while another uses monthly engagements, resulting in mismatched reports.

Lack of feedback from deployed models is another pitfall. If no one tracks what happens after predictions are made, teams miss the opportunity to refine and improve their processes. Imagine if a loan approval model is deployed, but there’s no follow-up on repayment behavior, it becomes difficult to tell whether the model is supporting sound lending decisions or increasing default risk.

The way out

Encourage collaboration by forming cross-functional teams and coordinating around shared planning cycles. Align on definitions early and rely on centralized dashboards to ensure that everyone is working from the same source of truth.
Create feedback loops and make them a standard part of your workflow, Track real-world outcomes, and schedule regular post-deployment reviews to understand what is working and what is not.
Include end users alongside data teams and treat their input as essential to improving the system.

Taking these actions helps analytics stay practical, consistent, and responsive to real needs.

Conclusion

Each stage of the data workflow benefits from clarity, structure, and shared understanding. The table below shows all the mentioned pitfalls, together with the way out to help teams build more reliable models and deliver results that hold up in real-world settings.

Category	Pitfall	Consequences	Recommended Approach
Data collection	Unreliable sources	Skewed insights	Validate source quality and apply consistent standards
Data preparation	Silent data leakage	Inflated model performance without real-world value	Use proper data splits and audit derived features
Modeling & validation	Overfitting through hyperparameter tuning	Strong validation results that don’t translate to reality	Use nested cross-validation (a structure where tuning happens inside training folds) and keep simple baselines for comparison
Interpretation & communication	Overconfidence in statistical significance	Misleading conclusions from small or selective effects	Adjust for multiple comparisons and report confidence intervals alongside p-values
Organizational & workflow	Fragmented teams	Redundant work and inconsistent metrics	Encourage collaboration with shared planning, dashboards, and definitions

Strong analytic practice is built over time. Keeping these pitfalls in view helps teams stay consistent, improve delivery, and create results that stay useful across projects and contexts.

How to Forecast Time Series Data with Python Darts

Adejumo Ridwan Suleiman — Mon, 06 Oct 2025 18:37:01 +0000

When analyzing time series data, your main objective is to consider the period during which the data is collected and how your variable of interest changes over time.

There are various libraries for time series forecasting in Python, and Darts is one of them. Unlike other forecasting libraries, Darts is a high-level forecasting library with algorithms to handle various time series data, regardless of the kind of trend they portray.

This tutorial will walk you through how you can forecast time series data using Python Darts. This will help you make meaningful insights whenever you come across time series data such as stock prices, weather measurements, and so on.

What is Python Darts?

Python Darts is an open-source library for time series analysis and forecasting. It has various models ranging from statistical time series models like ARIMA, and SARIMA, to machine learning and deep learning models like Prophet, and LSTM.

It has various algorithms for handling missing imputations in time series data, and can handle time series problems ranging from univariate, multivariate to hierarchical time series.

Prerequisites

Before we proceed, you will need to have the following:

Python 3.9+ installed.
Jupyter Notebook, Google Colab, or Positron to run your code.
Download the Netflix stock data.
Have the following libraries installed:
- darts for time series analysis
- pandas for data wrangling
- matplotlib for data visualization.

How to Set Up Dependencies

Load the following libraries.

import matplotlib.pyplot as plt
import pandas as pd
import darts
from darts import TimeSeries
from darts.models import ARIMA
from darts.models import RegressionModel
from lightgbm import LGBMRegressor
from darts.models import RNNModel
from darts.metrics import mape
import itertools

Understanding the Dataset

The Netflix stock data contains historical daily prices of Netflix stock from the year 2002 till date.

Load the data and have a preview of it.

netflix = pd.read_csv("/kaggle/input/netflix-stock-data-live-and-latest/Netflix_stock_history.csv")
netflix['Date'] = pd.to_datetime(netflix['Date'], utc=True).dt.tz_convert(None)
netflix.head()

To forecast a time series data, we need a Date column, which we already have, and then the variable of interest. We have several variables, but for this tutorial, we will focus on the Close variable of Netflix stocks.

Let’s visualize the data to see how Netflix closing price performed over the years.

netflix.plot(x='Date', y='Close', figsize=(10,5))
plt.show()

From the chart above, you can see that Netflix stock showed exponential growth in recent years. This means that the data is non-stationary, implying that there are no consistent changes over time.

There are a lot of random fluctuations in the data, which might make it difficult to forecast. Such data usually requires advanced models to handle the various fluctuations or noise present in the data.

How to Prepare the Data for Darts

Before preparing the data for Darts, you need to take note of few things.

First of all, if you look at our data preview earlier on, you would notice that it is recorded daily, we also need to fill in missing dates.

Copy and paste this code into your notebook.

start = netflix['Date'].min()
end = netflix['Date'].max()

netflix = (
    netflix.set_index('Date')
           .reindex(pd.date_range(start=start, end=end, freq='D'))
           .ffill()
           .reset_index()
           .rename(columns={'index': 'Date'})
)
netflix.head()

The code above ensures the netflix dataset has a continuous daily time series by filling in missing dates.

First, it finds the earliest start and latest end dates in the data, then creates a full daily date range between them.

By setting the Date column as the index and using .reindex() method, it inserts rows for any missing dates, which initially contain NaN.

The .ffill() method (forward fill) replaces these gaps by carrying forward the last known value, which is common for stock data when markets are closed, such as weekends.

Finally, the index is reset, and the column is renamed back to Date, producing a clean, continuous dataset ready for time series analysis.

Next, we need to convert the data to a Darts Timeseries object to make it usable by the Darts library.

 = TimeSeries.from_dataframe(
    netflix,
    time_col='Date',
    value_cols='Close',
)

The code above converts the netflix DataFrame into a Darts TimeSeries object, which is optimized for time series modeling and forecasting.

It takes the Date column (time_col='Date') as the timeline and the Close column (value_cols='Close') as the target values to forecast.

The resulting series object is now structured for use with Darts’ advanced forecasting models like ARIMA, Prophet, RNNs, and other time series algorithms.

Just like you would with any other machine learning model, you need to split your data into a training set and a validation set.

train, val = series.split_before(0.8)

How to Build a Forecasting Model

When building a forecasting model, you have the privilege of trying various models and picking the best-performing one.

The Darts library has various algorithms for time series analysis, from popular statistical algorithms like the Auto Regressive Integrated Moving Average (ARIMA) and Moving Average (MA) models, to machine learning and deep learning algorithms like Prophet and Long Short Term Memory (LSTM).

Note, I will only demonstrate how these algorithms work - it’s not necessary that we get accurate model metrics. But with further feature engineering, hyperparameter tuning, and cross-validation, you can get good results on your own.

Classical Model

The classical mode is the use of statistical time series models such as ARIMA. ARIMA is made up of the following components:

AR (AutoRegressive): Predict past values by looking at previous ones.
I (Integrated): Remove trends by focusing on changes instead of raw values.
MA (Moving Average): Learn from the errors of past predictions to improve accuracy.

Run the code below in your notebook to fit an ARIMA model.

arima_model = ARIMA()
arima_model.fit(train)
arima_forecast = arima_model.predict(len(val))

To visualize the forecast by the model, call the .plot() method on the forecast object.

series.plot(label='actual')
arima_forecast.plot(label='forecast')
plt.legend()

You can improve the model by adding some additional parameters to the ARIMA() class. You can read more about that in the Darts documentation.

Machine Learning Models

Classical models like ARIMA can’t handle non-linear data. Machine learning models fill this gap. We’ll use the LightGBM model as an example.

The LightGBM is a machine learning model that builds models sequentially based on decision trees. It adds new decision trees that correct the errors of previous trees.

Although it was not designed to handle time series, with some feature engineering such as lags, rolling statistics, and seasonal indicators, you can make it learn patterns from time series data.

Run this code on your notebook to fit a LightGBM model on the Netflix data.

lgbm = LGBMRegressor()
lgbm_model = RegressionModel(lags=12, model=lgbm)
lgbm_model.fit(train)
lgbm_forecast = lgbm_model.predict(len(val))

From the code above, the lag argument is set to 12, which is the value of the Netflix stock price for 12 days before a selected day.

Let’s have a view of the forecast by running the following code.

series.plot(label='actual')
lgbm_forecast.plot(label='forecast')
plt.legend()

You can read more about tuning the LightGBM model from the Darts documentation to improve the above model.

How to Forecast with Deep Learning models

You can go for deep learning models designed for time series, such as LSTM, a kind of Recurrent Neural Network (RNN) designed to capture long-term dependencies in sequential data.

Run the following code to build the LSTM model.

lstm_model = RNNModel(model='LSTM', input_chunk_length=12, output_chunk_length=6, n_epochs=100)
lstm_model.fit(train)
lstm_forecast = rnn_model.predict(len(val))

Now let’s visualize the forecast and see what we have.

series.plot(label='actual')
lstm_forecast.plot(label='forecast')
plt.legend()

You can look up the Darts documentation to improve the model and check out other deep learning models also.

Model Evaluation

Now that you have three models, you need to select the best one among them using the Mean Absolute Percentage Error (MAPE).

It expresses the average absolute error as a percentage of the actual values, and the closer your value is to 0, the better your model.

Run the following to print the MAPE of each respective model.

arima_error = mape(val, arima_forecast)
print("MAPE:", arima_error)
lgbm_error = mape(val, lgbm_forecast)
print("MAPE:", lgbm_error)
lstm_error = mape(val, lstm_forecast)
print("MAPE:", lstm_error)

> MAPE: 38.33262525601514
> MAPE: 39.00241495209449
> MAPE: 38.82910057097827

The model with the lowest MAPE is the ARIMA model with approximately 38.33, which means it’s our best-performing model.

BackTesting

Darts has a feature called backtesting that allows you to evaluate your models based on historical data, using a rolling forecast.

Backtesting is like a time machine for forecasting. It simulates how your model would have performed in the past by repeatedly training it on historical data up to a certain point, making a prediction for the next step, then moving forward, and repeating the process.

This rolling evaluation simulates how the model would behave in real-world conditions, where future data is unknown, helping you measure its consistency and reliability over time, instead of just testing it once on a single validation set.

Since the ARIMA model is currently our best-performing model, run the code below to implement backtesting.


# Perform backtesting on the training + validation series
backtest_series = train.concatenate(val)

# Backtest
backtest_forecast = arima_model.historical_forecasts(
    series=backtest_series,
    start=0.8,          # fraction of the series to start forecasting from
    forecast_horizon=len(val),
    stride=1,           # step size of rolling forecast
    retrain=True,       # retrain the model at each step
    verbose=True
)

# Compute metrics
error = mape(backtest_series[-len(val):], backtest_forecast)
print(f"MAPE: {error:.2f}%")

> historical forecasts: 100%|██████████| 1/1 [00:02<00:00,  2.69s/it]MAPE: 47.27%

In the code above,

The start argument defines where to start backtesting, which in this case is the last 20% series of the data.
The forecast_horizon is how many steps ahead to forecast at each point.
The stride is how frequently to retrain/forecast.
The retrain=True refits the model at each step for realistic evaluation.

You can see that the MAPE, after backtesting, is higher because backtesting is more realistic, and it is more difficult to achieve a lower MAPE.

On your own, you can try to replicate backtesting for the other models.

Hyper Parameter Tuning

The ARIMA model has three hyperparameter:

p which is the AR order
d which is the differencing order
q which is the MA order

You can use either grid or random search to tune your ARIMA model in Darts.

# Define possible values
p_values = range(0, 4)
d_values = range(0, 3)
q_values = range(0, 4)

best_mape = float('inf')
best_params = None

for p, d, q in itertools.product(p_values, d_values, q_values):
    try:
        arima_model = ARIMA(p=p, d=d, q=q)
        arima_model.fit(train)
        arima_forecast = arima_model.predict(len(val))
        arima_error = mape(val, arima_forecast)
        if arima_error < best_mape:
            best_mape = arima_error
            best_params = (p, d, q)
    except Exception as e:
        # Some combinations may fail
        continue

print(f"Best ARIMA params: p={best_params[0]}, d={best_params[1]}, q={best_params[2]} with MAPE={best_mape:.2f}%")

> Best ARIMA params: p=2, d=0, q=3 with MAPE=35.95%

In the above code, you define a range of possible values for the p, d , and q components, iterating over each combination of those values and choosing the model with the best MAPE among them.

Note that each model has its specific parameter you would have to tune, and you will need to check the Darts documentation for the hyperparameters of other models.

Real-World Use Cases

Forecasting time series data has a lot of real-world applications, some of which are:

Stock price prediction: Like the dataset used in this tutorial, forecasting is used in finance for stock price prediction, allowing investors to manage risk.
Demand forecasting for inventory: As a store owner, you can forecast product demands based on past sales of a product. This lets you know products that are in high demand.
Energy consumption prediction: Governments, industries, and consumers can plan and manage energy production, distribution, and consumption efficiently, based on data from past usage. This helps to avoid blackouts and wastage, enabling them to prepare ahead.

Best Practices

Always visualize residuals: Residuals are the difference between forecasted values and actual values. You must visualize them to detect outliers and unusual events.
Perform proper backtesting: Backtesting lets you see a more realistic model, subjected to various changes that can occur in real life. When you backtest all your models, you end up getting a model that performs well when forecasting.
Avoid data leakage: Do not train your models on validation sets to avoid bias, and always use cross-validation where necessary.
Use domain knowledge for feature engineering: Ensure you understand the data you are working with. This comes in handy in feature engineering, when you want to come up with new features to help your forecasting model, especially in multivariate time series forecasting.

Conclusion

This tutorial is more like an overview, especially if you are new to time series, but you can build a lot just from what you have learned.

You already have an idea of what time series and forecasting are, and how you can use the Darts Python library to achieve that.

You also learned of various models for forecasting time series data, and how you can apply techniques such as backtesting and hyperparameter tuning to achieve better results.

Another interesting thing with Darts is its ability to handle hierarchical time series. Here, data is structured at aggregated levels.

Darts is one of the most powerful time series libraries in Python and has a lot of models to handle various cases. You can proceed to explore models such as Transformers and also multi-series forecasting, which are used for special use cases.

If you are interested in more data science and statistics articles, don’t forget to check out my blog.

Graph Algorithms in Python: BFS, DFS, and Beyond

Oyedele Tioluwani — Wed, 03 Sep 2025 16:25:04 +0000

Have you ever wondered how Google Maps finds the fastest route or how Netflix recommends what to watch? Graph algorithms are behind these decisions.

Graphs, made up of nodes (points) and edges (connections), are one of the most powerful data structures in computer science. They help model relationships efficiently, from social networks to transportation systems.

In this guide, we will explore two core traversal techniques: Breadth-First Search (BFS) and Depth-First Search (DFS). Moving on from there, we will cover advanced algorithms like Dijkstra’s, A*, Kruskal’s, Prim’s, and Bellman-Ford.

Understanding Graphs in Python

A graph consists of nodes (vertices) and edges (relationships).

For examples, in a social network, people are nodes and friendships are edges. Or in a roadmap, cities are nodes and roads are edges.

There are a few different types of graphs:

Directed: edges have direction (one-way streets, task scheduling).
Undirected: edges go both ways (mutual friendships).
Weighted: edges have values (distances, costs).
Unweighted: edges are equal (basic subway routes).

Now that you know what graphs are, let’s look at the different ways they can be represented in Python.

Ways to Represent Graphs in Python

Before diving into traversal and pathfinding, it’s important to know how graphs can be represented. Different problems call for different representations.

Adjacency Matrix

An adjacency matrix is a 2D array where each cell (i, j) shows whether there is an edge from node i to node j.

In an unweighted graph, 0 means no edge, and 1 means an edge exists.
In a weighted graph, the cell holds the edge weight.

This makes it very quick to check if two nodes are directly connected (constant-time lookup), but it uses more memory for large graphs.

graph = [
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 0]
]

Here, the matrix shows a fully connected graph of 3 nodes. For example, graph[0][1] = 1 means there is an edge from node 0 to node 1.

Adjacency List

An adjacency list represents each node along with the list of nodes it connects to.

This is usually more efficient for sparse graphs (where not every node is connected to every other node). It saves memory because only actual edges are stored instead of an entire grid.

graph = {
    'A': ['B','C'],
    'B': ['A','C'],
    'C': ['A','B']
}

Here, node A connects to B and C, and so on. Checking connections takes a little longer than with a matrix, but for large, sparse graphs, it’s the better option.

Using NetworkX

When working on real-world applications, writing your own adjacency lists and matrices can get tedious. That’s where NetworkX comes in, a Python library that simplifies graph creation and analysis.

With just a few lines of code, you can build graphs, visualize them, and run advanced algorithms without reinventing the wheel.

import networkx as nx
import matplotlib.pyplot as plt

G = nx.Graph()
G.add_edges_from([('A','B'), ('A','C'), ('B','C')])
nx.draw(G, with_labels=True)
plt.show()

This builds a triangle-shaped graph with nodes A, B, and C. NetworkX also lets you easily run algorithms like shortest paths or spanning trees without manually coding them.

Now that we’ve seen different ways to represent graphs, let’s move on to traversal methods, starting with Breadth-First Search (BFS).

Breadth-First Search (BFS)

The basic idea behind BFS is to explore a graph one layer at a time. It looks at all the neighbors of a starting node before moving on to the next level. A queue is used to keep track of what comes next.

BFS is particularly useful for:

Finding the shortest path in unweighted graphs
Detecting connected components
Crawling web pages

Here’s an example:

from collections import deque

def bfs(graph, start):
    visited = {start}
    queue = deque([start])

    while queue:
        node = queue.popleft()
        print(node, end=" ")
        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append(neighbor)


graph = {
    'A': ['B','C'],
    'B': ['A','D','E'],
    'C': ['A','F'],
    'D': ['B'],
    'E': ['B','F'],
    'F': ['C','E']
}

bfs(graph, 'A')

Here’s what’s going on in this code:

graph is a dict where each node maps to a list of neighbors.
deque is used as a FIFO queue so we visit nodes level-by-level.
visited keeps track of nodes we’ve already processed so we don’t loop forever on cycles.
In the loop, we pop a node, print it, then for each unvisited neighbor, we mark it visited and enqueue it.

And here’s the output:

A B C D E F

Now that we have seen how BFS works, let’s turn to its counterpart: Depth-First Search (DFS).

Depth-First Search (DFS)

DFS works differently from BFS. Instead of moving level by level, it follows one path as far as it can go before backtracking. Think of it as diving deep down a trail, then returning to explore the others.

We can implement DFS in two ways:

Recursive DFS, which uses the function call stack
Iterative DFS, which uses an explicit stack

DFS is especially useful for:

Cycle detection
Maze solving and puzzles
Topological sorting

Here’s an example of recursive DFS:

def dfs_recursive(graph, node, visited=None):
    if visited is None:
        visited = set()
    if node not in visited:
        print(node, end=" ")
        visited.add(node)
        for neighbor in graph[node]:
            dfs_recursive(graph, neighbor, visited)

graph = {
    'A': ['B','C'],
    'B': ['A','D','E'],
    'C': ['A','F'],
    'D': ['B'],
    'E': ['B','F'],
    'F': ['C','E']
}

dfs_recursive(graph, 'A')

visited is a set that tracks nodes already processed so you don’t loop forever on cycles.
On each call, if node hasn’t been seen, it’s printed, marked visited, then the function recurses into each neighbor.

Traversal order:

A B D E F C

Explanation: DFS visits B after A, goes deeper into D, then backtracks to explore E and F, and finally visits C.

And here’s an example of iterative DFS:

def dfs_iterative(graph, start):
    visited = set()
    stack = [start]

    while stack:
        node = stack.pop()
        if node not in visited:
            print(node, end=" ")
            visited.add(node)
            stack.extend(reversed(graph[node]))

dfs_iterative(graph, 'A')

visited tracks nodes you’ve already processed so you don’t loop on cycles.
stack is LIFO (last in, first out) – you pop() the top node, process it, then push its neighbors.
reversed(graph[node]) pushes neighbors in reverse so they’re visited in the original left-to-right order (mimicking the usual recursive DFS).

Here’s the output:

A B D E F C

With BFS and DFS explained, we can now move on to algorithms that solve more complex problems, starting with Dijkstra’s shortest path algorithm.

Dijkstra’s Algorithm

Dijkstra’s algorithm is built on a simple rule: always visit the node with the smallest known distance first. By repeating this, it uncovers the shortest path from a starting node to all others in a weighted graph that doesn’t have negative edges.

import heapq

def dijkstra(graph, start):
    heap = [(0, start)]
    shortest_path = {node: float('inf') for node in graph}
    shortest_path[start] = 0

    while heap:
        cost, node = heapq.heappop(heap)
        for neighbor, weight in graph[node]:
            new_cost = cost + weight
            if new_cost < shortest_path[neighbor]:
                shortest_path[neighbor] = new_cost
                heapq.heappush(heap, (new_cost, neighbor))
    return shortest_path

graph = {
    'A': [('B',1), ('C',4)],
    'B': [('A',1), ('C',2), ('D',5)],
    'C': [('A',4), ('B',2), ('D',1)],
    'D': [('B',5), ('C',1)]
}

print(dijkstra(graph, 'A'))

Here’s what’s going on in this code:

graph is an adjacency list: each node maps to a list of (neighbor, weight) pairs.
shortest_path stores the current best-known distance to each node (∞ initially, 0 for start).
heap (priority queue) holds frontier nodes as (cost, node), always popping the smallest cost first.
For each popped node, it relaxes its edges: for each (neighbor, weight), compute new_cost. If new_cost beats shortest_path[neighbor], update it and push the neighbor with that cost.

And here’s the output:

{'A': 0, 'B': 1, 'C': 3, 'D': 4}

Moving on, let’s look at an extension of this algorithm: A Search.*

A* Search

A* works like Dijkstra’s but adds a heuristic function that estimates how close a node is to the goal. This makes it more efficient by guiding the search in the right direction.

import heapq

def heuristic(node, goal):
    heuristics = {'A': 4, 'B': 2, 'C': 1, 'D': 0}
    return heuristics.get(node, 0)

def a_star(graph, start, goal):
    g_costs = {node: float('inf') for node in graph}
    g_costs[start] = 0
    came_from = {}

    heap = [(heuristic(start, goal), start)]

    while heap:
        f, node = heapq.heappop(heap)

        if f > g_costs[node] + heuristic(node, goal):
            continue

        if node == goal:
            path = [node]
            while node in came_from:
                node = came_from[node]
                path.append(node)
            return path[::-1], g_costs[path[0]]

        for neighbor, weight in graph[node]:
            new_g = g_costs[node] + weight
            if new_g < g_costs[neighbor]:
                g_costs[neighbor] = new_g
                came_from[neighbor] = node
                heapq.heappush(heap, (new_g + heuristic(neighbor, goal), neighbor))

    return None, float('inf')

graph = {
    'A': [('B',1), ('C',4)],
    'B': [('A',1), ('C',2), ('D',5)],
    'C': [('A',4), ('B',2), ('D',1)],
    'D': []
}

print(a_star(graph, 'A', 'D'))

This one’s a little more complex, so here’s what’s going on:

graph: adjacency list – each node maps to [(neighbor, weight), ...].
heuristic(node, goal): returns an estimate h(node) (lower is better). It’s passed goal but in this demo uses a fixed dict.
g_costs: best known cost from start to each node (∞ initially, 0 for start).
heap: min-heap of (priority, node) where priority = g + h.
came_from: backpointers to reconstruct the path once we pop the goal.

Then in the main loop:

We pop the node with smallest priority.
If it’s the goal, we backtrack via came_from to build the path and return it with g_costs[goal].
Otherwise, we relax the edges: for each (neighbor, weight), compute new_cost = g_costs[node] + weight. If new_cost improves g_costs[neighbor], update it, set came_from[neighbor] = node, and push (new_cost + heuristic(neighbor, goal), neighbor).

Output:

(['A', 'B', 'C', 'D'], 4)

Next up, let’s move from shortest paths to spanning trees. This is where Kruskal’s algorithm comes in.

Kruskal’s Algorithm

Kruskal’s algorithm builds a Minimum Spanning Tree (MST) by sorting all edges from smallest to largest and adding them one at a time, as long as they don’t create a cycle. This makes it a greedy algorithm as it always picks the cheapest option available at each step.

The implementation uses a Disjoint Set (Union-Find) data structure to efficiently check whether adding an edge would create a cycle. Each node starts in its own set, and as edges are added, sets are merged.

class DisjointSet:
    def __init__(self, nodes):
        self.parent = {node: node for node in nodes}
        self.rank = {node: 0 for node in nodes}
    def find(self, node):
        if self.parent[node] != node:
            self.parent[node] = self.find(self.parent[node])
        return self.parent[node]
    def union(self, node1, node2):
        r1, r2 = self.find(node1), self.find(node2)
        if r1 != r2:
            if self.rank[r1] > self.rank[r2]:
                self.parent[r2] = r1
            else:
                self.parent[r1] = r2
                if self.rank[r1] == self.rank[r2]:
                    self.rank[r2] += 1

def kruskal(graph):
    edges = sorted(graph, key=lambda x: x[2])
    mst, ds = [], DisjointSet({u for e in graph for u in e[:2]})
    for u,v,w in edges:
        if ds.find(u) != ds.find(v):
            ds.union(u,v)
            mst.append((u,v,w))
    return mst

graph = [('A','B',1), ('A','C',4), ('B','C',2), ('B','D',5), ('C','D',1)]
print(kruskal(graph))

Output:

[('A','B',1), ('C','D',1), ('B','C',2)]

Here, the MST includes the smallest edges that connect all nodes without forming cycles. Now that we have seen Kruskal’s, we can move further to analyze another algorithm.

Prim’s Algorithm

Prim’s algorithm also finds an MST, but it grows the tree step by step. It starts with one node and repeatedly adds the smallest edge that connects the current tree to a new node. Think of it as expanding a connected “island” until all nodes are included.

This implementation uses a priority queue (heapq) to always select the smallest available edge efficiently.

import heapq

def prim(graph, start):
    mst, visited = [], {start}
    edges = [(w, start, n) for n,w in graph[start]]
    heapq.heapify(edges)

    while edges:
        w,u,v = heapq.heappop(edges)
        if v not in visited:
            visited.add(v)
            mst.append((u,v,w))
            for n,w in graph[v]:
                if n not in visited:
                    heapq.heappush(edges, (w,v,n))
    return mst

graph = {
    'A':[('B',1),('C',4)],
    'B':[('A',1),('C',2),('D',5)],
    'C':[('A',4),('B',2),('D',1)],
    'D':[('B',5),('C',1)]
}
print(prim(graph,'A'))

Output:

[('A','B',1), ('B','C',2), ('C','D',1)]

Notice how the algorithm gradually expands from node A, always picking the lowest-weight edge that connects a new node.

Let’s now look at an algorithm that can handle graphs with negative edges: Bellman-Ford.

Bellman-Ford Algorithm

Bellman-Ford is a shortest path algorithm that can handle negative edge weights, unlike Dijkstra’s. It works by relaxing all edges repeatedly: if the current path to a node can be improved by going through another node, it updates the distance. After V-1 iterations (where V is the number of vertices), all shortest paths are guaranteed to be found.

This makes it slightly slower than Dijkstra’s but more versatile. It can also detect negative weight cycles by checking for further improvements after the main loop.

def bellman_ford(graph, start):
    dist = {node: float('inf') for node in graph}
    dist[start] = 0
    for _ in range(len(graph)-1):
        for u in graph:
            for v,w in graph[u]:
                if dist[u] + w < dist[v]:
                    dist[v] = dist[u] + w
    return dist

graph = {
    'A':[('B',4),('C',2)],
    'B':[('C',-1),('D',2)],
    'C':[('D',3)],
    'D':[]
}
print(bellman_ford(graph,'A'))

Output:

{'A': 0, 'B': 4, 'C': 2, 'D': 5}

Here, the shortest path to each node is found, even though there’s a negative edge (B → C with weight -1). If there had been a negative cycle, Bellman-Ford would detect it by noticing that distances keep improving after V-1 iterations.

With the main algorithms explained, let’s move on to some practical tips for making these implementations more efficient in Python.

Optimizing Graph Algorithms in Python

When graphs get bigger, little tweaks in how you write your code can make a big difference. Here are a few simple but powerful tricks to keep things running smoothly.

1. Use deque for BFS
If you use a regular Python list as a queue, popping items from the front takes longer the bigger the list gets. With collections.deque, you get instant (O(1)) pops from both ends. It’s basically built for this kind of job.

from collections import deque

queue = deque([start])  # fast pops and appends

2. Go Iterative with DFS
Recursive DFS looks neat, but Python doesn’t like going too deep – you’ll hit a recursion limit if your graph is very large. The fix? Write DFS in an iterative style with a stack. Same idea, no recursion errors.

def dfs_iterative(graph, start):
    visited, stack = set(), [start]
    while stack:
        node = stack.pop()
        if node not in visited:
            visited.add(node)
            stack.extend(graph[node])

3. Let NetworkX Do the Heavy Lifting
For practice and learning, writing your own graph code is great. But if you’re working on a real-world problem – say analyzing a social network or planning routes – the NetworkX library saves tons of time. It comes with optimized versions of almost every common graph algorithm plus nice visualization tools.

import networkx as nx

G = nx.Graph()
G.add_edges_from([('A','B'), ('A','C'), ('B','D'), ('C','D')])

print(nx.shortest_path(G, source='A', target='D'))

Output:

['A', 'B', 'D']

Instead of worrying about queues and stacks, you can let NetworkX handle the details and focus on what the results mean.

Key Takeaways

An adjacency matrix is fast for lookups but is memory-heavy.
An adjacency list is space-efficient for sparse graphs.
NetworkX makes graph analysis much easier for real-world projects.
BFS explores layer by layer, DFS explores deeply before backtracking.
Dijkstra’s and A* handle shortest paths.
Kruskal’s and Prim’s build spanning trees.
Bellman-Ford works with negative weights.

Conclusion

Graphs are everywhere, from maps to social networks, and the algorithms you have seen here are the building blocks for working with them. Whether it is finding paths, building spanning trees, or handling tricky weights, these tools open up a wide range of problems you can solve.

Keep experimenting and try out libraries like NetworkX when you are ready to take on bigger projects.

A Beginner Developer's Guide to Scrum

Aditya Vikram Kashyap — Wed, 23 Jul 2025 19:48:05 +0000

Let me guess: you’re learning to code…alone.

You’ve been grinding through tutorials. You've built a portfolio site, maybe deployed a few projects on GitHub. And now you're trying to land a job or join a team.

Then the interviews start.

Suddenly, people ask:

"Are you familiar with Agile?"
"Have you worked in a Scrum environment?"
"What’s your experience with sprints?"

Cue the imposter syndrome. Because no one teaches this stuff in JavaScript 101.

This guide is for you.

I’ll help make the Scrum process – a very common way developers work together – make actual sense. I’ll walk you through the basics, but also tell you what developers actually do, how standups feel when you're new, and what’s expected of you when you’re no longer coding in a vacuum.

Let’s break it down.

Here’s what we’ll cover:

What Even Is Scrum?
The Three Roles in Scrum (and Who Does What)
The Scrum Rhythm: What a Sprint Actually Looks Like
Who attends the Ceremonies:
Standups: Where You Talk Like a Human, Not a Robot
Sprint Planning
What’s a User Story and Why Does It Sound Like a Children’s Book?
What Counts as “Done”? Definition of Done and Why It’s Important
Demos, Retros, and Saying the Hard Things
Tools You Might Encounter
If You’re Preparing for a Job, Here’s What You Can Do
Final Thoughts

What Even Is Scrum?

Scrum is not a tool. It’s not a software. It’s not some elite thing only PMs care about.

It’s a lightweight framework that helps software teams build things incrementally, together, in short focused cycles called sprints.

Scrum is used by everyone from FAANG teams to indie dev shops because it helps:

Keep teams aligned
Deliver working software fast
Course-correct often
Spot problems early (before they go nuclear)

It’s the opposite of the old-school “build for a year and pray it works” model.

The Three Roles in Scrum (and Who Does What)

Scrum officially defines three roles. Here's what that means in practice:

1. Product Owner (PO)

Think: Vision-holder. They decide what the team builds and why. A product owner:

Writes user stories (think of these as feature requests written from a user’s point of view)
Prioritizes the work
Clarifies what success looks like
Says “yes” or “not yet” to features

2. Scrum Master (SM)

Think: Air-traffic controller meets therapist. They make sure the process works. The are master Facilitators, like between Dev and PO’s. A Scrum Master:

Facilitates meetings
Removes blockers (“Your AWS access is stuck? I’ll escalate it.”)
Coaches the team on Scrum practices
Doesn’t manage people – manages flow

3. Developers (YOU!)

Think: Builders. You write code, test it, ship it, fix it, and improve it. You also:

Break down stories into tasks
Pick up work from the team board (like Jira or Trello)
Communicate progress
Demo what you’ve built at the end of the sprint

You might also work with designers, testers, or DevOps folks – but within Scrum, you’re all “developers” building a product together.

The Scrum Rhythm: What a Sprint Actually Looks Like

Image Source: https://www.invensislearning.com/blog/what-are-scrum-ceremonies/

Understanding the Scrum Cycle

So, what does it actually look like when a team uses Scrum to build software?

Let’s walk through a full sprint – not just the buzzwords, but what really happens when a group of humans tries to plan, build, review, and improve together. Think of this as your backstage pass to the rhythm of modern teamwork.

📦 Step 1: Build and Refine the Product Backlog

Before any coding starts, the team needs to agree on what they might build – not just this week, but in the near future too.

That’s where the Product Backlog comes in. This is a big, running list of everything the product might need – features, bug fixes, improvements, ideas, and maybe a few wild dreams. It’s like the wishlist for the product, but more organized (ideally).

The Product Owner is responsible for maintaining and prioritizing this list. They decide what’s most important to work on based on customer needs, business goals, and feedback.

But the PO doesn’t do this in isolation. Enter the Backlog Refinement meeting.

In these sessions, the Scrum Team – that’s the PO, the Scrum Master (SM), and the Developers – come together to:

Review the most important upcoming items
Clarify any vague or confusing parts of each task
Break big items down into smaller, buildable pieces called user stories
Estimate effort (how much time or complexity is involved for each story)

This meeting makes sure the team isn’t caught off guard in the sprint – that they understand the work ahead and can actually start sprinting when the time comes.

🧭 Step 2: Sprint Planning – What Are We Building This Time?

Now that we’ve got a solid backlog, it’s time to pick what to build right now.

At the start of each sprint (which typically lasts 1 to 4 weeks), the team holds a Sprint Planning meeting. This meeting sets the stage for the entire sprint – it’s like the huddle before the big game.

In Sprint Planning, the team:

Reviews the top items from the backlog
Discusses what can realistically be completed based on their availability and capacity
Chooses a handful of these stories to commit to
Defines a Sprint Goal – a simple statement that captures the purpose of this sprint

For example, the Sprint Goal might be:
🎯 “Allow users to reset their passwords.”

Every user story chosen should contribute to that goal. The collection of these stories becomes the Sprint Backlog – basically, the to-do list for the sprint.

So when we say:

“The team selects an ordered list of user stories to comprise the Sprint Backlog for the next sprint, which will be achievable to satisfy the Sprint Goal...”

We’re really just saying:
👉 “Pick a realistic number of important tasks that, if completed, will help us hit our target for the sprint.”

Not too vague. Not too ambitious. Just achievable and focused.

☀️ Step 3: Daily Standups – Stay in Sync

Now the sprint is underway! But how does everyone stay aligned and avoid working in silos?

That’s where the Daily Standup comes in. Every day – usually in the morning – the team has a quick check-in (about 15 minutes) where each person answers three questions:

What did I do yesterday?
What am I working on today?
Is anything blocking me? (that is, am I stuck?)

Example:

“Yesterday I set up the login API integration. Today I’ll work on the UI validation. I’m blocked on getting access to the staging database — may need help.”

These standups keep the team in sync and surface blockers early so they can be addressed quickly. They’re not about micromanaging or showing off. They’re about visibility and support.

📉 What’s a Sprint Burndown Chart?

You might hear your team mention a “burndown chart.” No, this isn’t about things going down in flames (hopefully).

A Sprint Burndown Chart is a graph that shows how much work is left in the sprint – day by day.

The y-axis is the amount of work remaining (often measured in story points or tasks)
The x-axis is the number of days left in the sprint

The line should ideally trend downward as work gets completed – hence “burning down.” If it flattens out or slopes up, that’s a red flag that the team might be stuck, behind schedule, or not updating the board.

Think of it as a visual heartbeat of the sprint. You can learn more via a practical example in this video.

🖥️ Step 4: Sprint Review – Show What You’ve Built

At the end of the sprint, the team holds a Sprint Review (also called a “demo”). This is where you show what was actually built during the sprint.

The Developers demo working features – live, not just screenshots
The Product Owner reviews whether the Sprint Goal was achieved
Stakeholders may ask questions, give feedback, or suggest tweaks

This meeting isn’t just for show – it’s a feedback loop. It helps the team validate that what they built is useful, usable, and meets expectations. If changes are needed, those get added to the backlog for future sprints.

🔍 Step 5: Sprint Retrospective – Look Back to Move Forward

Once the review is done, the team shifts focus from what they built to how they worked together.

Enter the Sprint Retrospective – a meeting to reflect on the process, not the product.

The team discusses:

✅ What went well
❌ What didn’t go so well
🔁 What could be improved next time

This isn’t about pointing fingers. It’s about learning, adapting, and continuously improving how the team collaborates.

The Scrum Master often facilitates this meeting and helps turn feedback into action items for the next sprint. For example:

“We underestimated testing time. Next sprint, let’s budget for QA earlier.”

The best teams take retros seriously. Why? Because even if your code is perfect, your process needs tuning too – and small process changes often lead to big gains.

♻️ Scrum Is a Loop

Here’s the rhythm:

Plan the sprint
Check in daily
Build and demo the product
Reflect and improve

Then do it all over again – with slightly better coordination and slightly more trust each time.

It’s not about being fast. It’s about being intentional, consistent, and collaborative.

Example Sprint

Let’s say, for example, that your team does 4-week sprints. (Keep in mind that Sprints can differ by team, nature of product, release cycles, and so on.)

Here’s the rough beat:

Week	What Happens (Sprint Ceremonies)	Your Role
1	Sprint Planning	Help estimate effort, pick what to build
1-4	Daily Stand ups (15 mins)	Share what you’re doing & any blockers
1-3	Development Time	Code, test, commit, fix, push, repeat
3.5-4	Sprint Review	Demo what you built
4	Sprint Retrospective	Reflect on how the sprint went as a team

Scrum works in loops. Every 2-4 weeks (depending on your cadence and sprint cycle), your team should have working, demo-able software to show for it – even if it’s small.

And no, it’s not about “speed.” It’s about consistency, communication, and collaboration.

Who attends the Ceremonies:

Ceremony	Who Attends	Why They’re There
Sprint Planning	Product Owner (PO), Scrum Master (SM), Development Team	To define what will be delivered and how the work will be accomplished
Daily Standup	Development Team, Scrum Master (optional), PO (optional)	To sync on progress, share blockers, and coordinate efforts
Sprint Review	Development Team, Scrum Master, Product Owner, Stakeholders	To demo the work, get feedback, and assess if goals were met
Sprint Retrospective	Development Team, Scrum Master, Product Owner (optional)	To reflect on the process, identify what worked/what didn’t, and improve the next sprint
Backlog Refinement	Product Owner, Development Team, Scrum Master (optional)	To clarify upcoming stories, estimate work, and prepare for future sprint planning

Now lets dive deeper and understand practically how each of these ceremonies work:

Standups: Where You Talk Like a Human, Not a Robot

So how does the team actually stay connected day to day? That’s where standups come in.

Every morning, your team meets briefly – usually on Zoom or in a circle – and you answer 3 questions:

What did I work on yesterday?
What will I work on today?
What’s blocking me? Any impediments?

Example:

"Yesterday I cleaned up the signup validation logic. Today I’m working on the email verification flow. I’m stuck on SendGrid config – might need help setting up credentials."

It’s not about impressing anyone. It’s about keeping everyone in sync. Some days you’ll say, “I spent the whole day debugging a CSS bug that turned out to be a semicolon.” That’s okay.

How does it work?

The Scrum Master gathers everyone in a huddle room, the PO and Dev Team included, and opens the the Standup. They are the facilitator of the ceremony. Everyone gets a chance to answer the 3 questions above (usually about 2-5 minutes each). It’s not a full report – it’s quick. When one person is done, they pass it on to someone else.

This ensures there is team cohesion and transparency.

Here is a video example of a standup.

Sprint Planning

The goal of the planning meeting is to answer the questions “What are we going to work on, and how are we going to do it?” It is critical for the team to have a shared goal and a shared commitment to this goal before beginning this ceremony.

Participants should:

Measure growth
Sync with the Scrum Master
Sync with the Product Owner

Sprint planning happens just before the sprint starts, and usually lasts for an hour or two. In this meeting, the team goes over a collection of user stories and discuss, plan, measure, and prioritize. This is where they decide what is going to be in scope for their upcoming sprint cycle.

The Product Owner will have a prioritized view of things in the backlog. They work with the team on each object or customer experience. Together, as a group they go through and make calculations, deciding to what they can commit.

What’s a User Story and Why Does It Sound Like a Children’s Book?

So you might be wondering: how do you know what to work on? What to build? So much work, so little time? Thats where user stories come in.

In Scrum, teams don’t just write vague tasks like “code the login.” Instead, they write user stories – short, human-centered feature descriptions that describe what the user needs, why they need it, and what success looks like.

Here’s an example:

As a user, I want to be able to reset my password, so I can access my account if I forget it.

User stories are the scaffolding of teamwork. They’re written with empathy, not just efficiency. And each one comes with acceptance criteria – a checklist that clarifies what “done” actually means:

A “Forgot Password” link is visible
Clicking it shows a form
An email gets sent with a reset link

Once a story is agreed upon, developers break it down into tasks, like “build form,” “hook into backend,” or “handle email validation.” It’s collaborative, not prescriptive. And user stories have priority so you know what’s the most important and what’s the least.

A helpful rule of thumb many teams use is the Gherkin-style "Given–When–Then" format:

Given some initial context
When an event occurs
Then a specific outcome should happen

This ensures that everyone – devs, testers, and product owners – shares the same understanding of behavior and expectations.

Here is a great video example thats outlines how to draft effective and powerful user stories.

What Counts as “Done”? Definition of Done and Why It’s Important

Now you might be wondering – how do I know when a task is done and can be closed out?

The Definition of Done is a type of documentation in the form of a team agreement. The Definition of Done identifies the conditions that need to be achieved in order for the product to be considered done (as in potentially shippable).

This is how we know that we "did the thing right". Meaning, we built the correct level of quality into the product. The Definition of Done is not the same as the acceptance criteria, which are written by the product owner to help us know we did the "right thing".

Every team has a Definition of Done – it’s not just “I pushed code.” It could mean:

Code is written
Reviewed by a peer
Merged into main
Tested on staging
Possibly deployed

This clarity keeps teams honest and accountable. No “it works on my machine” energy here. The DoD sets a quality bar. It prevents ambiguity, rework, and “it works on my machine” moments. When every card on the board passes the same finish line, teams move faster – and trust each other more.

Everyone should know what done is in a team. Either its Done as per DoD standards or its not.

Here is a beautiful video highlighting the impotence of DoD.

Demos, Retros, and Saying the Hard Things

Once you’ve built the product, then comes demos (showcasing your work) and retros (analysis as a team on what when well and what areas to improve on).

In the retro, everyone’s encouraged to speak up:

What went well?
What didn’t?
What should we try next time?

Example:

“We missed a lot of stories because we didn’t account for testing time. Maybe we buffer next sprint with fewer tasks.”

The goal is not to blame – it’s to improve. Over time, this feedback loop becomes gold. The Scrum Master usually facilitates, collects feedback (via tools like Parabol, Miro, or sticky notes), and helps turn insights into actionable experiments for the next sprint.

Over time, retros become the heartbeat of team evolution.

Here is a video highlighting the importance of a Retro and Sprint Review.

🧠 Why Retrospection Matters More Than You Think

The Sprint Retrospective is more than just another meeting. It’s a mirror for your team – a safe, structured space to pause, reflect, and improve together.

You discuss:

✅ what went well

❌ what did not go well

🔁 what could we do better next time

Great teams don't just deliver great software, they continually deliver better ways of working.

This is why many experienced Scrum practitioners consider the retro to be the most important event in Scrum. Code is deployed once, but process improvements grow exponentially, sprint after sprint.

Tools You Might Encounter

Scrum doesn’t require software, but real-world teams use a variety of tools:

Jira – Tracks sprints, issues, velocity
Trello – Simple board, good for small teams
Slack – Where standups often happen if async
Notion / Confluence – Docs, retros, notes
GitHub Projects – Lightweight planning for devs

Don’t worry if you’re not fluent in these yet. They’re tools – you’ll learn them on the job.

If You’re Preparing for a Job, Here’s What You Can Do

✍️ Practice writing user stories from your side projects
🧪 Run a mini-sprint: Plan your weekend project, set goals, and “review” it at the end
🤝 Contribute to an open-source project that uses Scrum or Agile workflows
🧾 Write about what you learned – maybe as a tutorial (hint hint)

Final Thoughts

So to recap, Scrum is a simple yet powerful way for teams to work together, stay organized, and deliver results quickly. It runs in short cycles called sprints, where the team plans what to do, checks in daily, shows their progress at the end, and reflects on how to improve.

The four key ceremonies – Sprint Planning, Daily Scrum, Sprint Review, and Sprint Retrospective – help keep everyone aligned and focused. With clear roles and regular feedback, Scrum makes it easier to handle changes, solve problems early, and continuously get better as a team.

But scrum isn’t a magic spell. It’s just a way for humans to build complex things – together – without falling apart.

You don’t need to be a Scrum Master. You don’t need a certification. But if you understand how sprints work, what’s expected of you, and how to show up to meetings with clarity and candor, you’re 10 steps ahead of most.

Scrum helps teams talk, plan, build, and learn. And now? You can too.

If you liked this, please do share. You never know who it might help out.

Until then…keep learning, unlearning, and relearning!!!

Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

Kuriko — Fri, 30 May 2025 18:21:29 +0000

The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design.

In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:

Custom classifier
Scikit-learn’s MLPClassifier
Keras Sequential classifier using SGD and Adam optimizers.

This will help you learn about their various use cases and how they work.

What is a Perceptron?
How to Build a Single-Layered Classifier
What is a Multi-Layer Perceptron?
How to Build Multi-Layered Perceptrons
Understanding Optimizers
How to Build an MLP Classifier with SGD Optimizer
How to Build an MLP Classifier with Adam Optimizer
Final Results: Generalization
Conclusion

Prerequisites

Mathematics (Calculus, Linear Algebra, Statistics)
Coding in Python
Basic understanding of Machine Learning concepts

What is a Perceptron?

A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.

A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.

But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.

The perceptron consists of four main parts:

Input layer: Takes the initial numerical values into the system for further processing.
Weights: Combines input values with weights (and bias terms).
Activation function: Determines whether the neuron should fire based on the threshold value.
Output layer: Produces classification result.

It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.

So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.

Applications of Perceptrons

Perceptrons are applied to tasks such as:

Image classification: Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.
Linear regression: Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.

How the Activation Function Works

For a single perceptron used for binary classification, the most common activation function is the step function (also known as the threshold function):

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq \theta \\ \\ 0 &\text{if } z < \theta \end{cases}$$

where:

ϕ(z): the output of the activation function.
z: the weighted sum of the inputs plus the bias:

$$z = \sum_{i=1}^m w_i x_i + b$$

(xi: input values, w: weight associated with each input, b: bias terms)

θ is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.

In that case, the formula becomes:

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq 0 \\ \\ 0 &\text{if } z < 0 \end{cases}$$

When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.

This occurs when the weighted sum is greater than zero, leading the perceptron to predict the input is in this binary class.

While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.

In modern implementations, we can use other activation functions like the sigmoid function:

$$\sigma (z) = \frac {1} {1 + e^{-z}}$$

The sigmoid function also outputs zero or one depending on the weighted sum (z).

How the Loss Function Works

The loss function is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.

Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.

In a binary classification task, the model may adopt the hinge loss function to penalize misclassifications by incurring an additional cost for incorrect predictions:

$$L(y, h(x)) = max(0, 1- y*h(x))$$

(h(x): prediction label, y: true label)

How to Build a Single-Layered Classifier

Now, let’s build a simple single-layer perceptron for binary classification.

1. Custom Classifier

Initialize the classifier

We’ll first initialize the classifier with weights, bias, number of epochs (n_iterations), and learning_rates.

def __init__(self, learning_rate=0.01, n_iterations=1000):
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = None
    self.bias = None

Define the activation function

Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the threshold is set to zero.

def _step_function(self, x, threshold: int = 0):
     return np.where(x > threshold, 1, 0)

Train the model

Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: weights and bias.

This process is controlled by a specified number of training epochs defined by n_iterations.

In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined learning_rate.

def fit(self, X, y):
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in range(self.n_iterations):
        for i in range(n_samples):
            # compute weighted sum (z)
            z = np.dot(X[i], self.weights) + self.bias

            # apply the activation function
            y_pred = self._step_function(z)

            # update weights and bias
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)

How the weights work in the iteration loop

The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.

Its iterative update in the for loop aims to reduce classification errors such that:

$$\begin {align*} w_j &:= w_j + \Delta w_j \\ & := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &= \begin{cases} w_j &\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

(w_j: j-th weight, η: learning rate, (yi−y^i): error)

This means that:

When the prediction is correct, the error is zero, so the weight is unchanged.
When the prediction is too low (yi=1 and y^i=0), the weight is adjusted to the same direction to increase the weighted sum.
When the prediction is too high (yi=0 and y^i=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.

How the bias terms work in the iteration loop

The bias determines the decision boundary’s intercept (position from the origin).

Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:

$$\begin {align*} b &:= b + \Delta b \\ & := b + \eta (y_i - \hat y_i) \\ &= \begin{cases} b &\text{(a) } y_i - \hat y_i = 0\\ b + \eta &\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.

Make a prediction

Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):

def predict(self, X):
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      return predictions

The entire classifier looks like this:

import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None

    def _step_function(self, x, threshold: int = 0):
        return np.where(x > threshold, 1, 0)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iterations):
            for i in range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        return self

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        return y_pred

Simulate with synthetic datasets

First, we generated a synthetic linearly separable dataset using make_blob and computed a decision boundary, then train the classifier we created.

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import numpy as np

# create a mock dataset
X, y = make_blobs(n_features=2, centers=2, n_samples=1000, random_state=12)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the model
perceptron = Perceptron(learning_rate=0.1, n_iterations=1000).fit(X_train, y_train)

# make a prediction
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

# evaluate the results
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"Accuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The classifier generated a clear, highly accurate linear decision boundary.

Accuracy (Train): 0.981
Accuracy (Test): 0.975

2. Leverage SckitLearn’s MCP Classifier

For our convenience, we’ll use sckit-learn’s build-in classifier ( MCPClassifier) to build a similar, yet more robust classifier:

model = MLPClassifier(
    hidden_layer_sizes=(), # intentionally set empty to create a single layer perceptron
    activation='logistic', # choosing a sigmoid function as an activation function
    solver='sgd', # choosing SGD optimizer
    max_iter=1000,
    random_state=42, 
    learning_rate='constant', 
    learning_rate_init=0.1
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"MCPClassifier\nAccuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.

Accuracy (Train): 0.985
Accuracy (Test): 0.995

Limitations of Single-Layer Perceptrons

Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.

Unlike more general neural networks, single-layer perceptrons use a step function as their activation.

Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).

This fundamental property precludes the use of gradient-based optimization algorithms such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.

In contrast, most neural networks employ differentiable activation functions (for example, sigmoid, ReLU) and loss functions (for example, MSE, Cross-Entropy) for effective optimization.

Other challenges of a single-layer perceptron include:

Limited to linear separability: Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.
Lack of depth: Being single-layered, they cannot learn complex hierarchical representations.
Limited optimizer options: As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.

So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.

What is a Multi-Layer Perceptron?

An MLP is a class of feedforward artificial neural network that consists of at least three layers of nodes:

an input layer,
one or more hidden layers, and
an output layer.

Except for the input nodes, each node is a neuron that uses a nonlinear activation function.

MLPs are widely used for classification problems as well as regression:

Classification tasks: MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.
Regression analysis: They are also applied in regression problems where the relationship between input and output is complex.

How to Build Multi-Layered Perceptrons

Let’s handle a binary classification task using a standard MLP architecture.

Outline of the Project

Objective

Detect fraudulent transactions

Evaluation Metrics

Considering the cost of misclassification, we’ll prioritize improving Recall and Precision scores
Then check the accuracy of classification with Accuracy Score (TP + TN / (TP + TN + FP + FN ))

Cost of Misclassification (from high to low):

False Negative (FN): The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)
False Positive (FP): The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)
True Positive (TP): The model correctly identifies a fraudulent transaction as fraud.
True Negative (TN): The model correctly identifies a non-fraudulent transaction as non-fraud.

Planning an MLP Architecture

In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.

Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.

During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.

Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using image source)

Especially in deeper network, ReLU is advantageous in preventing vanishing gradient problems where gradients become extremely small as they are backpropagated from the output layers.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

Preprocessing the Datasets

First, we consolidate three datasets – transaction, customer, and credit card – into a single DataFrame, independently sanitizing numerical and categorical data:

import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# download the raw data to local
import kagglehub
path = kagglehub.dataset_download("computingvictor/transactions-fraud-datasets")
dir = f'{path}/gd_card_flaud_demo'

def sanitize_df(amount_str):
    """Removes '$' and converts the string to a float."""
    if isinstance(amount_str, str):
        return float(amount_str.replace('$', ''))
    return amount_str

# load transaction data
trx_df = pd.read_csv(f'{dir}/transactions_data.csv')

# sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)
trx_df = trx_df[trx_df['errors'].isna()]
trx_df = trx_df.drop(columns=['merchant_city','merchant_state', 'date', 'mcc', 'errors'], axis='columns')
trx_df['amount'] = trx_df['amount'].apply(sanitize_df)

# merge the dataframe with fraud transaction flag.
with open(f'{dir}/train_fraud_labels.json', 'r') as fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get('target', {})
fraud_labels_series = pd.Series(fraud_labels_dict, name='is_fraud')
fraud_labels_series.index = fraud_labels_series.index.astype(int) # convert the datatype from string to integer
merged_df = pd.merge(trx_df, fraud_labels_series, left_on='id', right_index=True, how='left')
merged_df.fillna({'is_fraud': 'No'}, inplace=True)
merged_df['is_fraud'] = merged_df['is_fraud'].map({'Yes': 1, 'No': 0})

# load card data
card_df = pd.read_csv(f'{dir}/cards_data.csv')
card_df = card_df.drop(columns=['client_id', 'acct_open_date', 'card_number', 'expires', 'cvv'], axis='columns')
card_df['credit_limit'] = card_df['credit_limit'].apply(sanitize_df)

# merge transaction and card data
merged_df = pd.merge(left=merged_df, right=card_df, left_on='card_id', right_on='id', how='inner')
merged_df = merged_df.drop(columns=['id_y', 'card_id'], axis='columns')

# converts categorical variables into a new binary column (0 or 1)
categorical_cols = merged_df.select_dtypes(include=['object']).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False, dtype=float) 
df = df.dropna().drop(['client_id', 'id_x'], axis=1)
print('\nDataFrame: \n', df.head(n=3))

DataFrame:

Our DataFrame shows an extremely skewed data distribution with:

Fraud samples: 1,191
Non-fraud samples: 11,477,397

For classification tasks, it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact on classification model performance, especially regarding the minority class.

For our data, we’ll:

split the 1,191 fraud samples into training, validation, and test sets,
add an equal number of randomly chosen non-fraud samples from the DataFrame, and
adjust split balances later if generalization challenges arise.

# define the desired size of the fraud samples for the validation and test sets
val_size_per_class = 200
test_size_per_class = 200

# create test sets
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=42)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=42)

# combine to form the balanced test set
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_test = X_test['is_fraud']
X_test = X_test.drop('is_fraud', axis=1)

# remove sampled rows from the original dataframes to avoid data leakage
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


# create validation sets
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=42)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=42)

# combine to form the balanced validation set
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_val = X_val['is_fraud']
X_val = X_val.drop('is_fraud', axis=1)

# remove sampled rows from the remaining dataframes
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


# create training sets
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=42)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=42)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_train = X_train['is_fraud']
X_train = X_train.drop('is_fraud', axis=1)


print("\n--- Final Dataset Shapes and Distributions ---")
print(f"X_train shape: {X_train.shape}, y_train distribution: {np.unique(y_train, return_counts=True)}")
print(f"X_val shape: {X_val.shape}, y_val distribution: {np.unique(y_val, return_counts=True)}")
print(f"X_test shape: {X_test.shape}, y_test distribution: {np.unique(y_test, return_counts=True)}")

After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a 50:50 split between fraud and non-fraud transactions:

Considering the high dimensional feature space with 19 input features, we’ll apply SMOTE to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):

from imblearn.over_sampling import SMOTE
from collections import Counter

train_target = 2000

smote_train = SMOTE(
  sampling_strategy={0: train_target, 1: train_target},  # increase sample size to 2,000
  random_state=12
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(f"\nAfter SMOTE with custom sampling_strategy (target train: {train_target}):")
print(f"X_train_oversampled shape: {X_train.shape}")
print(f"y_train_oversampled distribution: {Counter(y_train)}")

We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:

Lastly, we’ll apply column transformers to numerical and categorical features separately.

Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))])

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

Understanding Optimizers

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.

Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.

In this article, we’ll use the SGD Optimizer and Adam Optimizer.

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:

$$\begin{align*} w_j &:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$

(w: weight, b: bias, J: cost function, η: learning rate)

In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:

$$\begin{align*} J(y, \hat y) &=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &= \sum_{i=1}^m w_i x_i + b \end {align*}$$

2. How Adam (Adaptive Moment Estimation) Optimizer Works

Adam is an optimization algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

Adam optimizer combines the advantages of RMSprop (using squared gradients to scale the learning rate) and Momentum (using past gradients to accelerate convergence):

$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$

where:

α: The learning rate (default is 0.001)
ϵ: A small positive constant used to avoid division by zero
m^: First moment (mean) estimate with a bias correction, leveraging Momentum:

$$\begin{align*} \hat m_t &= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$

(β1: Decay rates, typically set to β1=0.9)

v^: Second moment (variance) estimate with a bias correction, leveraging RMSprop:

$$\begin{align*} \hat v_t &= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$

(β2: Decay rates, typically set to β2=0.999)

Since both m and v are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

This process involves a forward pass and backpropagation, during which SGD computes optimal weights and biases using gradients:

for i in range(0, n_samples, self.batch_size):
    # SGD starts with randomly selected mini-batch for the epoch
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    # A. forward pass
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[-1]  # final output of the network

    # B. backpropagation
    # 1) calculating gradients for the output layer)
    delta = y_pred - y_batch
    dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
    db = np.sum(delta, axis=0) / X_batch.shape[0]

    # 2) update output layer parameters
    self.weights[-1] -= self.learning_rate * dW
    self.biases[-1] -= self.learning_rate * db

    # 3) iterate backward from last hidden layer to the input layer
    for l in range(len(self.weights) - 2, -1, -1):
        delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
        dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
        db = np.sum(delta, axis=0) / X_batch.shape[0]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db

In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.

def _forward_pass(self, X):
    activations = [X]
    zs = []

    # forward through hidden layers
    for i in range(len(self.weights) - 1):
        z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) # using ReLU for hidden layers
        activations.append(a)

    # forward through output layer
    z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
    zs.append(z_output)

    # computes the final output using sigmoid function
    y_pred = 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    activations.append(y_pred)
    return activations, zs

So the final classifier looks like this:

from sklearn.metrics import accuracy_score

class MLP_SGD:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.01, n_epochs=1000, batch_size=32):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        for epoch in range(self.n_epochs):
            # shuffle datasets
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1]

                delta = y_pred - y_batch
                dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
                db = np.sum(delta, axis=0) / X_batch.shape[0]
                self.weights[-1] -= self.learning_rate * dW
                self.biases[-1] -= self.learning_rate * db

                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    db = np.sum(delta, axis=0) / X_batch.shape[0]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self

    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten() # for 1D output

Training / Prediction

Train the model and make a prediction using training and validation datasets:

# 1. define the model
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(30, 30, ), # 2 hidden layers with 30 neurons each
  learning_rate=0.001,           # a step size
  n_epochs=1000,                 # number of epochs
  batch_size=32                  # mini-batch size
)

# 2. train the model
mlp_sgd.fit(X_train_processed, y_train)

# 3. make a prediction with training and validation datasets
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

# 4. compute evaluation matrics
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=1)
recall = recall_score(y_true, y_pred, pos_label=1)
f1 = f1_score(y_true, y_pred, pos_label=1)


print(f"\nMLP (Custom SGD) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom SGD) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.7930 — 0.6650 (from training to validation)
Precision: 0.7790 — 0.6786 (from training to validation)

The model effectively learned and generalized the patterns, achieving a Recall of 79.3% (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.

Loss history:

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.

Leverage SckitLearn’s MCP Classifier

We can use an MCP Classifier to define a similar model, incorporating;

Early stopping using internal validation to prevent overfitting and
L2 regularization with a small tolerance.

from sklearn.neural_network import MLPClassifier

# define a model
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='sgd',
    learning_rate_init=0.001,
    learning_rate='constant',
    momentum=0.9,
    nesterovs_momentum=True,
    alpha=0.00001,           # l2 regulation strength
    max_iter=3000,           # max epochs (keep it high)
    batch_size=16,           # mini-batch size
    random_state=42,
    early_stopping=True,     # apply early stopping
    n_iter_no_change=50,     # stop the iteration if internal validation score doesn't improve for 50 epochs
    validation_fraction=0.1, # proportion of training data for internal validation (default is 0.1)
    tol=1e-4,                # tolerance for optimization
    verbose=False,
)

# training
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

# make a prediction
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)

Results

Recall: 0.7830 - 0.6200 (from training to validation)
Precision: 0.8208 - 0.6703 (from training to validation)

The model showed strong performance during training, achieving a Recall of 78.30%. Its performance declined on the validation set.

This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.

Leverage Keras Sequential Classifier

For the sequential classifier, we can further enhance the classifier by:

Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train) to address dataset imbalance and promote faster convergence,
Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,
Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,
Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and
Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


# calculates an initial bias for the output layer 
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])


# defines the model
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu'),
    Dropout(0.1), # 10% of the neurons in that layer randomly dropped out
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', # binary classification
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) # to address the imbalanced datasets
])



# compiles the model with the SGD optimizer
opt = SGD(learning_rate=0.001)
model_keras_sgd.compile(
    optimizer=opt, 
    loss='binary_crossentropy',
    metrics=[
        'accuracy', # add several metrics to return
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)


# defines early stopping to prevent overfitting
early_stopping_callback = EarlyStopping(
    monitor='val_recall',  # monitor recall 
    mode='max',         # maximize recall
    patience=50,        # stop after 50 epochs without loss improvement
    min_delta=1e-4,     # minimum change to be considered an improvement (tol)
    verbose=0
)


# compute the class weight
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


# train the model
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val), # use our external val set
    callbacks=[early_stopping_callback], # early stopping to prevent overfitting
    class_weight=class_weights_dict, # penarlize more misclassification on minority class
    verbose=0
)

# evaluate
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")

# display model summary
model_keras_sgd.summary()

Results

Recall: 0.7125 — 0.7250 (from training to validation)
Precision: 0.7607 — 0.7545 (from training to validation)

Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.

It suggests that the regularization techniques are likely effective in preventing significant overfitting.

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:

# apply Adam updates for output layer parameters
# 1) weights (w)
self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

# 2) bias (b)
self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)

Following the principles of forward and backward passes, we construct the final classifier by initializing it with beta1 and beta2, built upon an MLP_SGD architecture:

class MLP_Adam:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.001, n_epochs=1000, batch_size=32,
                 beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        # Adam optimizer internal states for each parameter (weights and biases)
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((1, fan_out)))
            self.v_biases.append(np.zeros((1, fan_out)))


    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        # global time step for Adam bias correction
        t = 0

        for epoch in range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # Mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += 1

                # 1. forward pass
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1] # Output of the network

                # 2. backpropagation
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[-2].T, delta) / X_batch.shape[0] # Average over batch
                grad_b_output = np.sum(delta, axis=0) / X_batch.shape[0]

                # apply Adam updates to weights
                self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
                self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
                m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
                v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
                self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                # apply Adam updates to bias
                self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
                self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
                m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
                v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
                self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                # Propagate gradients backward through hidden layers
                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    grad_b_hidden = np.sum(delta, axis=0) / X_batch.shape[0]

                    # apply Adam updates to weights
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (1 - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (1 - self.beta2) * (grad_w_hidden ** 2)
                    m_w_hat = self.m_weights[l] / (1 - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (1 - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    # apply Adam updates to bias
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (1 - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (1 - self.beta2) * (grad_b_hidden ** 2)
                    m_b_hat = self.m_biases[l] / (1 - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (1 - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self


    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten()

Training / Prediction

Train the model and make a prediction using training and validation datasets:

mlp_adam = MLP_Adam(hidden_layer_sizes=(30, 10), learning_rate=0.001, n_epochs=500, batch_size=32)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(f"\nMLP (Custom Adam) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom Adam) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.9870–0.6150 (from training to validation)
Precision: 0.9811–0.6474 (from training to validation)

While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.

Loss History

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.

Leverage SckitLearn’s MCP Classifier

We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:

model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='adam',             # update the optimizer from SGD to Adam
    learning_rate_init=0.001,
    learning_rate='constant',
    alpha=0.0001,
    max_iter=3000,
    batch_size=16,
    random_state=42,
    early_stopping=True,
    n_iter_no_change=50,
    validation_fraction=0.1,
    tol=1e-4,
    verbose=False,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)

Results

Recall: 0.8975–0.6400 (from training to validation)
Precision: 0.8864 — 0.6305 (from training to validation)

Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.

Leverage Keras Sequential Classifier

Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu')),
    Dropout(0.1),
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=0.001)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss='binary_crossentropy', 
    metrics=[
        'accuracy',
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor='val_recall',
    mode='max',
    patience=50,
    min_delta=1e-4,
    verbose=0
)

class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=0
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")


model_keras_adam.summary()

Results

Recall: 0.7995–0.7500 (from training to validation)
Precision: 0.8409–0.8065 (from training to validation)

The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).

This indicates good generalization, with only minor performance degradation on unseen data.

Final Results: Generalization

Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.

# Custom classifiers
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# MLPClassifer
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# Keras Sequential
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=0)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=0)

Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an AUPRC (Area Under Precision-Recall Curve) of 0.72.

Conclusion

In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.

Our findings underscore that effective machine learning hinges on three critical factors:

robust data preprocessing (tailored to objectives and data distribution),
judicious model selection, and
strategic framework or library choices.

Choosing the right framework

Generally speaking, choose MLPClassifier when:

You’re primarily working with tabular data,
You want to prioritize simplicity, quick iteration, and seamless integration,
You have simple, shallow architectures, and
You have a moderate dataset size (manageable on a CPU).

Choose Keras Sequential when:

You’re dealing with image, text, audio, or other sequential data,
You’re building deep learning models such as CNNs, RNNs, LSTMs,
You need fine-grained control over the model architecture, training process, or custom components,
You need to leverage GPU acceleration,
You’re planning for production deployment, and
You want to experiment with more advanced deep learning techniques.

Limitation of MLPs

While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.

Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.

You can find more info about me on my Portfolio / LinkedIn / Github.

Learn Python for Data Science – Full Course for Beginners

Beau Carnes — Thu, 29 May 2025 18:37:23 +0000

If you're interested in data science but not sure where to begin, Python is a great starting point. It’s easy to pick up and has a bunch of libraries that make working with data a lot easier.

We just published a course on the freeCodeCamp.org YouTube channel that teaches you how to do data science using Python. Frank Andrade developed this course.

It starts with installation and setup, then covers Python fundamentals so you’re not lost if you’ve never coded before. From there, it gets into two of the most commonly used libraries in data science: Pandas and NumPy. Pandas helps you work with tables of data (think spreadsheets, but in Python), and NumPy is great for doing math on that data.

You’ll get to apply what you're learning right away with hands-on projects. The first one shows you how to scrape data from websites using Pandas. Then you’ll learn how to filter and clean that data, reshape it, and create pivot tables. There's also a project where you’ll build charts and graphs so you can actually visualize what the data is telling you. You’ll use real datasets and build things like bar charts and scatter plots to explore trends and patterns.

Once you're comfortable with those basics, the course introduces more useful techniques like using groupby and aggregate functions, combining different data sets, and using regular expressions to pull out specific patterns from text. These are skills you’ll need for any real data job, or even if you’re just trying to make sense of a big messy spreadsheet.

Later in the course, you'll start working with machine learning. It’s not super advanced, but it gives you a solid first look at how it works. You’ll use scikit-learn to build a simple text classification model. Basically, you’ll train a program to read some text and decide what category it belongs to. Think spam vs. not spam, or positive vs. negative reviews.

If you're new to data science and want to actually try things instead of just reading about them, this course is a solid pick. Everything is broken into small, manageable sections, and the projects help the ideas stick. It’s free, it’s on YouTube, and you can follow along at your own pace.

Are you ready to learn Data Science with Python? Watch the full course on the freeCodeCamp.org YouTube channel (17-hour watch):

How to Extract YouTube Analytics Data and Analyze in Python

Adejumo Ridwan Suleiman — Wed, 26 Mar 2025 16:05:29 +0000

If you’re a YouTube content creator, you’ll make data-driven decisions when posting content. This helps you target the right audience when creating your videos.

YouTube Studio provides YouTube Analytics, where you can get comprehensive data about your channel. But there is a caveat: most of the statistics provided by YouTube Analytics are descriptive and not predictive. This means information like future views, subscriber counts, and factors influencing watch time or earnings are unavailable. This means you’ll need to calculate these metrics yourself.

In this article, you’ll learn how to export data from YouTube Analytics to Python so you can analyze it further or create visualizations. You can even build your own custom dashboard using various Python libraries like Streamlit, Shiny, or Dash.

Here’s what we

Prerequisites
Step 1: Identify the Problem Statement
Step 2: Extract the Data
Step 3: Analyze the Data in Python
- Correlation Analysis
- Audience Retention Analysis
Conclusion

Prerequisites

Active YouTube and YouTube Studio Account
Jupyter Notebook, Google Colab, Kaggle, or any other environment that supports Python
Pandas library installed
Seaborn library installed
Matplotlib library installed

Step 1: Identify the Problem Statement

Before proceeding, we need to know what we’re looking for – because YouTube Analytics has many metrics, and this can get overwhelming. My channel doesn’t have a ton of subscribers, but I have quite a few videos and views. So we’ll use my data as an example.

Just note that this analysis I’ll conduct in this tutorial is specific to my channel and can vary from channel to channel. You’ll be able to use the techniques here to answer the same/similar questions using your data, but your results will be different from mine.

Here are the questions I would like to find an answer for:

Correlation Analysis

Views and watch time – Are longer watch times associated with higher views?
Views and subscribers – Do more views translate to more subscribers?
Impressions and Click-Through Rate (CTR%) – Does a stronger impression lead to better engagement?
Watch time and average view duration – Are longer videos watched more?

Audience Retention Analysis

Average view duration vs. Video length – Are longer videos watched in full?
Drop-off points – Which duration range has the best retention?
Retention Rate (%) – Watch time divided by duration?

Step 2: Extract the Data

This will open a dashboard showing comprehensive descriptive analytics of your YouTube channel. This can get overwhelming, as there are a lot of metrics and filters with various types of data. This is why I emphasized the importance of knowing your problem and identifying your questions before diving in.

You can select the range of data you are interested in using the date dropdown (1 in the image below) and the Compare to button (2) to compare data from different date ranges.

The column headers you see in the dashboard are the filters. Each contains different metrics, and you can find some metrics in one or more filters. You can play around with the tabs and dropdowns to understand them better.

This is just a foundation for understanding your YouTube channel performance. If you have a long-running channel with a large number of subscribers and views, trust me – you can get a lot of insights from your data.

For this tutorial, I will select my entire lifetime data (1) and click the download button at the top right-hand corner (2).

This will display two options: whether to open the data in Google Sheets in a new tab or download the CSV file.

Since we want to use the data in Python, select the option to download the CSV file. After downloading the file, extract the files from the zip folder, and inside the extracted folder, you will see three CSV files: Chart data.csv, Table data.csv, and Totals.csv.

For this tutorial, we are interested in the Table data.csv. Click the data to open and view it in Excel to do some manual data cleaning before importing the data in Python.

The data is a list of all the videos on my YouTube channel, which is forty (yours might have more or fewer). Remove the first row, which is the Total row, and save the changes.

Here are the columns in the dataset:

Content: The video id
Video title: The video title
Video publish time: The day the video was published
Duration: The video duration in seconds
Views: The number of views per video
Watch time: The estimated amount of video watch time by your audience in hours
Subscribers: Change in total subscribers found by subtracting subscribers lost from subscribers gained for the selected date and region.
Average view duration: Estimated average minutes watched per video.
Impressions: Number of times your videos were shown to viewers.
Impressions click-through rate (%): Number of times viewers clicked your video after seeing an impression.

Step 3: Analyze the Data in Python

Go to your Jupyter Notebook and import the Pandas, Seaborn, and Matplotlib libraries.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Next, import the Table data.csv file.

# Load data
df = pd.read_csv("/content/Table data.csv")

Correlation Analysis

Concerning our problem statement, we are going to plot a correlation heatmap between the following variables: Views, Watch time (hours), Subscribers, Average view duration, and Impressions-click-through rate (%) to see the strength and direction of the relationship between them.

# Convert "Average view duration" (formatted as H:M:S) to seconds
df['Average view duration'] = pd.to_timedelta(df['Average view duration']).dt.total_seconds()

# Select relevant columns for correlation analysis
correlation_data = df[['Views', 'Watch time (hours)', 'Subscribers', 'Average view duration', 'Impressions', 'Impressions click-through rate (%)']]

# Compute correlation matrix
corr_matrix = correlation_data.corr()

# Visualization using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("YouTube Analytics Correlation Heatmap")
plt.show()

Correlation coefficient ranges from -1 to 1, where values less than 0 mean a negative relationship, while those above 0 mean a positive relationship. The lower the value in a negative relationship, the stronger the negative relationship, while the higher the value in a positive relationship, the stronger the relationship.

Based on the plot above, here are the key insights:

Views and watch time: There's a strong correlation (0.94) between views and watch time, suggesting that as videos get more views, they also accumulate more watch hours, proportionally.
Views and impressions: There's a strong correlation (0.89) between views and impressions, indicating that videos that are shown more frequently in recommendations and search results tend to get more views.
Average view duration: This metric has very weak correlations with almost all other metrics. It is particularly notable in views (0.06), subscribers (0.01), and impressions (0.03).
Subscribers and metrics: Subscribers have a moderate to strong correlation with views (0.75) and impressions (0.79) and a weaker correlation with click-through rate (0.54).
Click-through rate: Has moderate correlations with views (0.69) and watch time (0.66) but a weaker correlation with subscribers (0.54).

The most significant insight is that average view duration appears to operate independently from other metrics. This suggests that on my YouTube channel, a video's ability to retain viewers throughout its length isn't necessarily connected to how many people watch it, how often it's recommended, or how many subscribers the channel has.

This implies that the strategies I would implement to increase my views, subscribers, and impressions might differ from those needed to improve average view duration, an important factor in YouTube's recommendation algorithm. This means I need to look at other YouTube metrics that have a relationship with average view duration, which is a topic for another article.

Audience Retention Analysis

To analyze audience retention, we need to create a new variable Retention Rate (%), which is calculated by dividing a video’s Average view duration by the Duration and expressing it as a percentage.


# Calculate retention rate as (Average View Duration / Total Video Duration) * 100
df['Retention Rate (%)'] = (df['Average view duration'] / df['Duration']) * 100

Next is to sort the videos in ascending order based on Retention Rate (%) and display the top 10 videos with the highest retention rate.

# Sort videos by retention rate
df_sorted = df.sort_values(by='Retention Rate (%)', ascending=False)

# Display top 10 videos with highest retention
df_sorted[['Video title', 'Duration', 'Average view duration', 'Retention Rate (%)']].head(10)

From the table above, you will notice that most of the videos in the top 10 spot are not above 503 seconds, which is approximately 8 minutes. This implies that my audience are interested in short, mid-range videos.

Most videos with the high retention rate have a duration less than 4 minutes, with a retention rate ranging from 27% - 40%. With this insight, I can ensure that the next videos I will upload are within 5 to 8 minutes.

Let’s take a look at the bottom 10 videos with a low retention rate:

# Sort videos by retention rate
df_sorted = df.sort_values(by='Retention Rate (%)', ascending=False)

# Display bottom 10 videos with highest retention
df_sorted[['Video title', 'Duration', 'Average view duration', 'Retention Rate (%)']].tail(10)

From the above information, you will notice that long videos in my channel spanning approximately 22 - 58 minutes have a low retention rate. This further supports the claim above that my audience is more interested in shorter videos.

We can further decide to plot a scattered plot of Duration against Retention Rate (%) to summarize the above tables.

# Set style for plots
sns.set_style("whitegrid")

# Plot Retention Rate vs. Video Duration
plt.figure(figsize=(12, 6))

sns.scatterplot(data=df, x='Duration', y='Retention Rate (%)', hue='Views', size='Views', sizes=(20, 200), palette='coolwarm')
plt.title("Audience Retention vs. Video Duration")
plt.xlabel("Video Duration (seconds)")
plt.ylabel("Retention Rate (%)")
plt.legend(title="Views", loc="upper right")

plt.show()

The scatter plot above shows the relationship between audience retention rate (y-axis, measured as a percentage) and video duration (x-axis, measured in seconds) for various videos. Here are the following key observations:

There's a clear negative correlation between video duration and retention rate – as videos get longer, the retention rate generally decreases.
The highest retention rates (35-40%) are found in shorter videos, mostly under 500 seconds (around 8 minutes).
Videos over 1500 seconds (25 minutes) consistently show retention rates below 15%.
The size and color of the dots represent the number of views, with larger, redder dots indicating more views (up to 1000) and smaller, blue dots representing fewer views (around 200).
Interestingly, some mid-length videos (around 500 seconds) have both higher view counts (indicated by larger red dots) and decent retention rates of about 25%.
The longest video in the dataset (at around 3500 seconds or 58 minutes) has a retention rate of about 14% and relatively few views.

This plot further confirms the claim that shorter videos tend to better maintain audience attention on my channel, though some mid-length videos can still perform well in terms of both retention and view count.

Conclusion

What we’ve learned from my data is just the tip of the iceberg. YouTube has many metrics, and because my channel is not monetized and has few subscribers and videos, I don’t have data on monetization, demographics, and other metrics.

But after reading this article, I hope that you can think of endless information you want to get based on these metrics. You can even forecast your views, subscriber counts, and revenue for the next days or months. You can also perform a multivariate time series analysis to see how these factors affect your primary variable of interest.

If you find this article interesting, don’t forget to check out my blog for other interesting articles, follow me on Medium, connect on LinkedIn, and subscribe to my YouTube channel.

Data Science - freeCodeCamp.org

How to Clean Time Series Data in Python

Prerequisites

Table of Contents

How to Audit Your Time Series Before Cleaning It

How to Reindex to a Canonical Frequency

How to Handle Missing Values

Forward Fill — For Step-Function Signals

Time-Weighted Interpolation — For Continuous Signals

Seasonal Decomposition Imputation — For Long Gaps

How to Detect and Handle Outliers

Z-Score with Rolling Window

IQR-Based Outlier Detection

Isolation Forest — For Multivariate Outlier Detection

Outlier Treatment

How to Remove Duplicates

Frequency Alignment and Resampling

Smoothing Noise

Exponential Weighted Moving Average

Savitzky-Golay Filter

Schema and Sanity Validation

The Complete Cleaning Checklist

Wrapping Up

Data Science Insights: Why the Mean Lies When Handling Messy Retail Data

Table Of Contents

Prerequisites

The Dataset

Mean: The Sensitive Giant

Median: The Robust Middle

Beyond Averages: Understanding Spread with Quartiles

The IQR: Detecting Outliers

A Simple Example to Understand IQR

Step 1: Find the Median (Q2):

Step 2: Find Q1 (Lower Quartile):

Step 3: Find Q3 (Upper Quartile):

Step 4: Calculate IQR:

Step 5: Find Outlier Bounds:

Applying IQR to Our Dataset

Revisiting the Mean After Removing Outliers

Final Comparison and Insights

Conclusion

Connect with me

The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained.

What We'll Cover:

Prerequisites

The Importance of Data Quality

How Does Bad Data Happen in the First Place?

The Cost of Bad Data

Types of Data Errors

Required Field Errors

Format Validation Errors

Range and Limit Errors

Logical Consistency Errors

Duplicate and Data Integrity Errors

Relational Errors (Reference Integrity)

Structural Errors (Dropdowns, Radio Buttons, Enums)

What Makes Good Data?

Completeness:

Uniqueness:

Validity:

Timeliness:

Accuracy:

Consistency:

Fitness for Purpose:

Data Validation Layers

Frontend Layer — “Protect the User, Not the System”

Backend Validation — “The Real Gatekeeper”

Database Layer — “Protect the Data at Rest”

Service Layer / Business Logic — “Validate Real-World Rules”

Jobs / Queues / Data Ingestion — “Validate External Data”

Testing Strategies to Protect Data Quality

Unit Testing

Example: Testing a Discount Calculation Rule

Integration Testing: The Flow & Lineage Check

Functional Testing: The Business Rule Check

Here's an example: Functional Test

Conclusion

How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD

Table of Contents

Project Overview and Setup