<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Data Science - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Data Science - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 21 May 2026 10:20:56 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/data-science/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Clean Time Series Data in Python ]]>
                </title>
                <description>
                    <![CDATA[ Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-clean-time-series-data-in-python/</link>
                <guid isPermaLink="false">6a0ad57ee4a28cf570ec90ac</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Bala Priya C ]]>
                </dc:creator>
                <pubDate>Mon, 18 May 2026 09:01:50 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5fc16e412cae9c5b190b6cdd/bf717910-4e75-44c5-8ea1-fd55eb574100.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, it has passed through collection, transmission, and storage, each step a potential source of corruption.</p>
<p>Cleaning time series data is harder than cleaning tabular data because time is a structural constraint. You can't shuffle rows or impute a missing value with a column mean without pulling future data into a past observation. Every cleaning decision has to respect temporal ordering, or it breaks the integrity of everything built on top of it.</p>
<p>This guide walks through the full cleaning pipeline in Python: from raw data arrival to a dataset ready for feature engineering or modelling. We'll cover missing value detection and imputation, outlier identification and treatment, duplicate handling, frequency alignment, noise smoothing, and schema validation, applied to sample sensor data throughout.</p>
<p><a href="https://github.com/balapriyac/data-science-tutorials/blob/main/time-series-data-cleaning/time_series_data_cleaning.ipynb">You can get the Colab notebook from GitHub and follow along</a>.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along to this guide, you'll need to be:</p>
<ul>
<li><p>Comfortable working with Python and pandas DataFrames</p>
</li>
<li><p>Familiar with time-indexed data</p>
</li>
<li><p>Aware of what feature engineering and machine learning modelling involve at a high level</p>
</li>
</ul>
<p>We'll use <code>pandas</code> and <code>numpy</code> for data manipulation, <code>scipy</code> for signal smoothing and statistical tests, <code>scikit-learn</code> for anomaly detection, and <code>statsmodels</code> for seasonal decomposition. Install them before running any code in this guide:</p>
<pre><code class="language-bash">pip install pandas numpy scipy scikit-learn statsmodels
</code></pre>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#heading-how-to-audit-your-time-series-before-cleaning-it">How to Audit Your Time Series Before Cleaning It</a></p>
</li>
<li><p><a href="#heading-how-to-reindex-to-a-canonical-frequency">How to Reindex to a Canonical Frequency</a></p>
</li>
<li><p><a href="#heading-how-to-handle-missing-values">How to Handle Missing Values</a></p>
<ul>
<li><p><a href="#heading-forward-fill-for-step-function-signals">Forward Fill — For Step-Function Signals</a></p>
</li>
<li><p><a href="#heading-time-weighted-interpolation-for-continuous-signals">Time-Weighted Interpolation — For Continuous Signals</a></p>
</li>
<li><p><a href="#heading-seasonal-decomposition-imputation-for-long-gaps">Seasonal Decomposition Imputation — For Long Gaps</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-detect-and-handle-outliers">How to Detect and Handle Outliers</a></p>
<ul>
<li><p><a href="#heading-z-score-with-rolling-window">Z-Score with Rolling Window</a></p>
</li>
<li><p><a href="#heading-iqr-based-outlier-detection">IQR-Based Outlier Detection</a></p>
</li>
<li><p><a href="#heading-isolation-forest-for-multivariate-outlier-detection">Isolation Forest — For Multivariate Outlier Detection</a></p>
</li>
<li><p><a href="#heading-outlier-treatment">Outlier Treatment</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-how-to-remove-duplicates">How to Remove Duplicates</a></p>
</li>
<li><p><a href="#heading-frequency-alignment-and-resampling">Frequency Alignment and Resampling</a></p>
</li>
<li><p><a href="#heading-smoothing-noise">Smoothing Noise</a></p>
<ul>
<li><p><a href="#heading-exponential-weighted-moving-average">Exponential Weighted Moving Average</a></p>
</li>
<li><p><a href="#heading-savitzky-golay-filter">Savitzky-Golay Filter</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-schema-and-sanity-validation">Schema and Sanity Validation</a></p>
</li>
<li><p><a href="#heading-the-complete-cleaning-checklist">The Complete Cleaning Checklist</a></p>
</li>
</ul>
<h2 id="heading-how-to-audit-your-time-series-before-cleaning-it">How to Audit Your Time Series Before Cleaning It</h2>
<p>The first rule of data cleaning is: look before you cut. Before imputing, smoothing, or dropping anything, you need a complete picture of what's wrong and where.</p>
<p>A good audit covers the following:</p>
<ul>
<li><p>The time index: Is it regular? Are there gaps?</p>
</li>
<li><p>Missing value distribution: Are missing values random or clustered?</p>
</li>
<li><p>Value range: Are there obvious gaps or sensor failures?</p>
</li>
<li><p>Duplicate timestamps</p>
</li>
</ul>
<p>Let's spin up a sample dataset (with some of the above problems):</p>
<pre><code class="language-python"># Simulate one week of smart grid voltage readings (hourly)
# with realistic problems injected
periods = 168
index = pd.date_range("2024-06-01", periods=periods, freq="H")

voltage = (
    230.0
    + 3.5 * np.sin(2 * np.pi * np.arange(periods) / 24)
    + np.random.normal(0, 1.2, periods)
)

# Inject problems
voltage[14:17] = np.nan          # sensor dropout: 3 consecutive missing
voltage[42] = np.nan             # isolated missing
voltage[78] = 312.4              # spike outlier
voltage[101:104] = np.nan        # another dropout
voltage[130] = 187.2             # dip outlier

series = pd.Series(voltage, index=index, name="voltage_v")

# --- Audit ---
print("=== TIME SERIES AUDIT ===")
print(f"Period:        {series.index.min()} → {series.index.max()}")
print(f"Observations:  {len(series)}")
print(f"Expected freq: {pd.infer_freq(series.index)}")
print(f"\nMissing values: {series.isna().sum()} ({series.isna().mean()*100:.1f}%)")
print(f"Value range:    [{series.min():.2f}, {series.max():.2f}]")
print(f"Mean ± Std:     {series.mean():.2f} ± {series.std():.2f}")

# Identify consecutive missing runs
missing_mask = series.isna()
missing_runs = []
run_start = None
for i, (ts, is_missing) in enumerate(missing_mask.items()):
    if is_missing and run_start is None:
        run_start = ts
    elif not is_missing and run_start is not None:
        missing_runs.append((run_start, missing_mask.index[i - 1]))
        run_start = None

print(f"\nMissing runs ({len(missing_runs)} total):")
for start, end in missing_runs:
    print(f"  {start} → {end}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">=== TIME SERIES AUDIT ===
Period:        2024-06-01 00:00:00 → 2024-06-07 23:00:00
Observations:  168
Expected freq: h

Missing values: 7 (4.2%)
Value range:    [187.20, 312.40]
Mean ± Std:     230.22 ± 7.81

Missing runs (3 total):
  2024-06-01 14:00:00 → 2024-06-01 16:00:00
  2024-06-02 18:00:00 → 2024-06-02 18:00:00
  2024-06-05 05:00:00 → 2024-06-05 07:00:00
</code></pre>
<p>This audit gives you a map of the damage before you start cleaning. The key task is distinguishing between <strong>isolated missing values</strong>, which are imputable with local context, and <strong>missing long runs</strong>, which may need a different strategy or flagging for downstream consumers.</p>
<h2 id="heading-how-to-reindex-to-a-canonical-frequency">How to Reindex to a Canonical Frequency</h2>
<p>Before imputing missing values, you need to confirm your time index is actually <em>regular</em>. A common problem in ingested time series is that missing timestamps are simply absent rather than represented as <code>NaN</code> rows — which means a <code>.fillna()</code> call will never find them.</p>
<pre><code class="language-python"># Simulate a sensor feed with missing timestamps (not just missing values)
irregular_index = index.delete([14, 15, 16, 42, 101, 102, 103])
irregular_series = series.dropna().reindex(irregular_index)

print(f"Original length:   {len(series)}")
print(f"Irregular length:  {len(irregular_series)}")
print(f"Inferred freq:     {pd.infer_freq(irregular_series.index)}")  # None = irregular

# Reindex to the full canonical hourly grid
canonical_index = pd.date_range(
    start=irregular_series.index.min(),
    end=irregular_series.index.max(),
    freq="H"
)

reindexed = irregular_series.reindex(canonical_index)

print(f"\nAfter reindex:")
print(f"Length:         {len(reindexed)}")
print(f"Missing values: {reindexed.isna().sum()}")
print(f"Inferred freq:  {pd.infer_freq(reindexed.index)}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Original length:   168
Irregular length:  161
Inferred freq:     None

After reindex:
Length:         168
Missing values: 7
Inferred freq:  h
</code></pre>
<p><code>pd.infer_freq</code> returning <code>None</code> is your signal that the index has gaps. After reindexing to the canonical grid, missing timestamps become explicit <code>NaN</code> rows, and now your imputation logic can find them.</p>
<h2 id="heading-how-to-handle-missing-values">How to Handle Missing Values</h2>
<p>Not all missing values should be handled the same way. A single isolated missing reading in a smooth signal is best filled with interpolation. A 3-hour sensor dropout in a volatile signal, however, might be better flagged than fabricated. Strategy should match both gap length and signal behavior.</p>
<h3 id="heading-forward-fill-for-step-function-signals">Forward Fill — For Step-Function Signals</h3>
<p>Forward fill is appropriate when the variable holds its last known value until something changes it — a machine state, a setpoint, a categorical flag.</p>
<pre><code class="language-python"># Equipment operating mode — a step signal
mode_data = pd.Series(
    ["running", "running", np.nan, np.nan, "idle", "idle", np.nan, "running"],
    index=pd.date_range("2024-06-01", periods=8, freq="H"),
    name="operating_mode"
)

filled_mode = mode_data.ffill()
print(pd.DataFrame({"original": mode_data, "ffill": filled_mode}))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                    original    ffill
2024-06-01 00:00:00  running  running
2024-06-01 01:00:00  running  running
2024-06-01 02:00:00      NaN  running
2024-06-01 03:00:00      NaN  running
2024-06-01 04:00:00     idle     idle
2024-06-01 05:00:00     idle     idle
2024-06-01 06:00:00      NaN     idle
2024-06-01 07:00:00  running  running
</code></pre>
<h3 id="heading-time-weighted-interpolation-for-continuous-signals">Time-Weighted Interpolation — For Continuous Signals</h3>
<p>For continuous sensor readings, linear interpolation weighted by time handles irregular gaps correctly because it doesn't assume equal spacing.</p>
<pre><code class="language-python"># Fill the voltage series using time-based interpolation
voltage_clean = reindexed.interpolate(method="time")

# Compare original vs filled around the first gap
gap_window = voltage_clean["2024-06-01 12:00":"2024-06-01 18:00"]
original_window = reindexed["2024-06-01 12:00":"2024-06-01 18:00"]

comparison = pd.DataFrame({
    "original":     original_window,
    "interpolated": gap_window.round(3),
    "was_missing":  original_window.isna(),
})
print(comparison)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                       original  interpolated  was_missing
2024-06-01 12:00:00  230.290355       230.290        False
2024-06-01 13:00:00  226.798197       226.798        False
2024-06-01 14:00:00         NaN       226.848         True
2024-06-01 15:00:00         NaN       226.897         True
2024-06-01 16:00:00         NaN       226.947         True
2024-06-01 17:00:00  226.996356       226.996        False
2024-06-01 18:00:00  225.410371       225.410        False
</code></pre>
<h3 id="heading-seasonal-decomposition-imputation-for-long-gaps">Seasonal Decomposition Imputation — For Long Gaps</h3>
<p>For gaps longer than a few steps in a seasonal signal, interpolating across the gap ignores the seasonal pattern. A better approach is to decompose the series, impute each component separately, then reconstruct.</p>
<pre><code class="language-python">from statsmodels.tsa.seasonal import seasonal_decompose

# Use a longer series for decomposition (needs enough periods)
long_voltage = pd.Series(
    230.0
    + 3.5 * np.sin(2 * np.pi * np.arange(336) / 24)
    + np.random.normal(0, 1.0, 336),
    index=pd.date_range("2024-06-01", periods=336, freq="H")
)

# Inject a 6-hour gap
long_voltage.iloc[100:106] = np.nan

# Interpolate first to give decompose a complete series to work with
temp_filled = long_voltage.interpolate(method="time")
decomp = seasonal_decompose(temp_filled, model="additive", period=24)

# Reconstruct: trend + seasonal + zero residual for missing positions
reconstructed = long_voltage.copy()
missing_idx = long_voltage[long_voltage.isna()].index
reconstructed[missing_idx] = (
    decomp.trend[missing_idx].fillna(method="ffill")
    + decomp.seasonal[missing_idx]
)

print(f"Missing before: {long_voltage.isna().sum()}")
print(f"Missing after:  {reconstructed.isna().sum()}")
print("\nFilled values at gap:")
print(reconstructed[missing_idx].round(3))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">
                       original  interpolated  was_missing
2024-06-01 12:00:00  230.290355       230.290        False
2024-06-01 13:00:00  226.798197       226.798        False
2024-06-01 14:00:00         NaN       226.848         True
2024-06-01 15:00:00         NaN       226.897         True
2024-06-01 16:00:00         NaN       226.947         True
2024-06-01 17:00:00  226.996356       226.996        False
2024-06-01 18:00:00  225.410371       225.410        False
</code></pre>
<p>The seasonal decomposition imputation respects the time-of-day pattern. As you can see, the filled values aren't a flat line across the gap but follow the expected daily curve.</p>
<h2 id="heading-how-to-detect-and-handle-outliers">How to Detect and Handle Outliers</h2>
<p>Outliers in time series are trickier than in tabular data because context matters. For example, an unusually high or low voltage might be a sensor spike or a genuine grid event. You need methods that use <em>temporal context</em>, not just global statistics.</p>
<h3 id="heading-z-score-with-rolling-window">Z-Score with Rolling Window</h3>
<p>A global Z-score misses local anomalies in non-stationary series. A rolling Z-score flags values that are unusual <em>relative to their local neighbourhood</em>.</p>
<p><strong>Note</strong>: A <strong>non-stationary series</strong> is a time series whose statistical properties—such as mean, variance, or trend—change over time instead of remaining constant.</p>
<pre><code class="language-python">window = 24  # 24-hour rolling window

roll_mean = voltage_clean.rolling(window, center=True, min_periods=1).mean()
roll_std  = voltage_clean.rolling(window, center=True, min_periods=1).std()

rolling_z = (voltage_clean - roll_mean) / roll_std

threshold = 3.0
outliers_z = rolling_z[rolling_z.abs() &gt; threshold]

print(f"Rolling Z-score outliers detected: {len(outliers_z)}")
print(outliers_z.round(3))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Rolling Z-score outliers detected: 2
2024-06-04 06:00:00    4.646
2024-06-06 10:00:00   -4.484
Name: voltage_v, dtype: float64
</code></pre>
<p>Z-score outlier detection works best for approximately Gaussian (normal) distributions because it assumes the data is centered around a mean with symmetric spread measured by standard deviation.</p>
<h3 id="heading-iqr-based-outlier-detection">IQR-Based Outlier Detection</h3>
<p>The interquartile range (IQR) method is more robust for detecting outliers in non-Gaussian distributions. The interquartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1), representing the spread of the middle 50% of the data.</p>
<pre><code class="language-python">Q1 = voltage_clean.quantile(0.25)
Q3 = voltage_clean.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = voltage_clean[
    (voltage_clean &lt; lower_bound) | (voltage_clean &gt; upper_bound)
]

print(f"IQR bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"Outliers detected: {len(outliers_iqr)}")
print(outliers_iqr.round(2))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">IQR bounds: [220.16, 239.46]
Outliers detected: 2
2024-06-04 06:00:00    312.4
2024-06-06 10:00:00    187.2
Name: voltage_v, dtype: float64
</code></pre>
<h3 id="heading-isolation-forest-for-multivariate-outlier-detection">Isolation Forest — For Multivariate Outlier Detection</h3>
<p>When you have multiple sensors, an isolated reading on one channel might look normal, but its combination with readings from other channels reveals the anomaly. Isolation Forest handles this naturally.</p>
<pre><code class="language-python"># Build a multi-sensor DataFrame
np.random.seed(42)
n = 200

sensor_df = pd.DataFrame({
    "voltage_v":    230 + 3 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 1, n),
    "current_a":    15  + 0.8 * np.sin(2 * np.pi * np.arange(n) / 24) + np.random.normal(0, 0.3, n),
    "frequency_hz": 50  + np.random.normal(0, 0.05, n),
}, index=pd.date_range("2024-06-01", periods=n, freq="H"))

# Inject a multivariate anomaly — voltage drops, current spikes together
sensor_df.iloc[88, 0] = 194.2   # voltage dip
sensor_df.iloc[88, 1] = 28.7    # current surge (consistent with fault)

clf = IsolationForest(contamination=0.02, random_state=42)
sensor_df["anomaly_score"] = clf.fit_predict(sensor_df[["voltage_v", "current_a", "frequency_hz"]])

anomalies = sensor_df[sensor_df["anomaly_score"] == -1]
print(f"Anomalies detected: {len(anomalies)}")
print(anomalies[["voltage_v", "current_a", "frequency_hz"]].round(2))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Anomalies detected: 4
                     voltage_v  current_a  frequency_hz
2024-06-02 07:00:00     234.75      15.84         49.90
2024-06-04 06:00:00     233.09      15.82         50.15
2024-06-04 16:00:00     194.20      28.70         50.08
2024-06-06 05:00:00     235.09      15.41         49.91
</code></pre>
<p>In practice you'd follow up anomaly scores with domain-specific threshold rules.</p>
<h3 id="heading-outlier-treatment">Outlier Treatment</h3>
<p>Once outliers are identified, you can handle them in several ways:</p>
<ul>
<li><p>Cap them using Winsorization by limiting extreme values to a threshold.</p>
</li>
<li><p>Replace them with interpolated or estimated values.</p>
</li>
<li><p>Flag them so the model can handle them appropriately.</p>
</li>
</ul>
<pre><code class="language-python"># Winsorize: cap at the IQR bounds
voltage_winsorized = voltage_clean.clip(lower=lower_bound, upper=upper_bound)

# Replace outliers with time-interpolated values
voltage_outlier_fixed = voltage_clean.copy()
voltage_outlier_fixed[outliers_iqr.index] = np.nan
voltage_outlier_fixed = voltage_outlier_fixed.interpolate(method="time")

print("Outlier treatment comparison:")
for ts in outliers_iqr.index:
    print(f"\n  {ts}")
    print(f"    Original:     {voltage_clean[ts]:.2f}")
    print(f"    Winsorized:   {voltage_winsorized[ts]:.2f}")
    print(f"    Interpolated: {voltage_outlier_fixed[ts]:.2f}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Outlier treatment comparison:

  2024-06-04 06:00:00
    Original:     312.40
    Winsorized:   239.46
    Interpolated: 232.01

  2024-06-06 10:00:00
    Original:     187.20
    Winsorized:   220.16
    Interpolated: 231.43
</code></pre>
<p>Winsorization preserves the point but clips it to a plausible range — useful when you want to retain the information that something anomalous happened. Interpolation treats the outlier as if it were missing — better when you believe the reading is simply wrong.</p>
<h2 id="heading-how-to-remove-duplicates">How to Remove Duplicates</h2>
<p>Duplicate timestamps are common when data pipelines retry on failure. Unlike tabular duplicates, time series duplicates aren't always identical, a retry might deliver a slightly different reading for the same timestamp.</p>
<pre><code class="language-python"># Inject duplicate timestamps with slightly different values (retry scenario)
dup_index = index.tolist()
dup_index.insert(20, index[20])  # exact duplicate timestamp
dup_index.insert(55, index[55])  # retry duplicate

dup_values = voltage_clean.tolist()
dup_values.insert(20, voltage_clean.iloc[20])
dup_values.insert(55, voltage_clean.iloc[55] + 0.7)  # slightly different value

dup_series = pd.Series(dup_values, index=pd.DatetimeIndex(dup_index), name="voltage_v")

print(f"Length with duplicates: {len(dup_series)}")
print(f"Duplicate timestamps:   {dup_series.index.duplicated().sum()}")

# Strategy 1: keep first (original reading)
dedup_first = dup_series[~dup_series.index.duplicated(keep="first")]

# Strategy 2: keep mean (average across retries)
dedup_mean = dup_series.groupby(level=0).mean()

print(f"\nAfter dedup (keep first): {len(dedup_first)}")
print(f"After dedup (mean):       {len(dedup_mean)}")

# Show the retry duplicate
ts_retry = index[55]
print(f"\nRetry duplicate at {ts_retry}:")
print(f"  Values:      {dup_series[ts_retry].values.round(3)}")
print(f"  Keep first:  {dedup_first[ts_retry]:.3f}")
print(f"  Mean:        {dedup_mean[ts_retry]:.3f}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Length with duplicates: 170
Duplicate timestamps:   2

After dedup (keep first): 168
After dedup (mean):       168

Retry duplicate at 2024-06-03 07:00:00:
  Values:      [235.198 234.498]
  Keep first:  235.198
  Mean:        234.848
</code></pre>
<p>For most sensor pipelines, keep-first is the right default; the first delivery is the original reading. Mean makes sense when retries come from independent sensors measuring the same quantity.</p>
<h2 id="heading-frequency-alignment-and-resampling">Frequency Alignment and Resampling</h2>
<p>Real pipelines often mix data at different frequencies. For example, you may need a 1-minute meter reading merged with an hourly weather feed. Before joining them, you need to align frequencies explicitly.</p>
<pre><code class="language-python"># 1-minute power draw readings
power_1min = pd.Series(
    42 + 18 * ((pd.date_range("2024-06-01", periods=1440, freq="T").hour.isin(range(8, 19)))).astype(int)
    + np.random.normal(0, 2, 1440),
    index=pd.date_range("2024-06-01", periods=1440, freq="T"),
    name="power_kw"
)

# Downsample to hourly: mean is appropriate for power (average over the hour)
power_hourly_mean = power_1min.resample("H").mean().round(2)

# Downsample to hourly: max (peak demand within the hour)
power_hourly_max = power_1min.resample("H").max().round(2)

# Downsample to hourly: sum (total energy = kWh)
energy_hourly_kwh = (power_1min.resample("H").sum() / 60).round(3)

comparison = pd.DataFrame({
    "mean_kw":    power_hourly_mean,
    "peak_kw":    power_hourly_max,
    "energy_kwh": energy_hourly_kwh,
}).iloc[7:13]

print(comparison)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                     mean_kw  peak_kw  energy_kwh
2024-06-01 07:00:00    42.13    46.28      42.133
2024-06-01 08:00:00    60.56    64.81      60.557
2024-06-01 09:00:00    59.91    64.88      59.912
2024-06-01 10:00:00    60.07    65.16      60.066
2024-06-01 11:00:00    60.08    64.99      60.083
2024-06-01 12:00:00    59.72    63.65      59.724
</code></pre>
<p>Which aggregation you choose matters enormously for downstream use. Mean power is right for load profiling. Peak power is right for capacity planning. Sum (converted to kWh) is right for billing. You can probably see why the <em>right</em> answer is domain-specific and not technical.</p>
<h2 id="heading-smoothing-noise">Smoothing Noise</h2>
<p>Raw sensor data often contains high-frequency noise that obscures the underlying signal. Smoothing before feature engineering prevents the model from fitting to noise, but over-smoothing destroys real variation.</p>
<h3 id="heading-exponential-weighted-moving-average">Exponential Weighted Moving Average</h3>
<p>Exponential Weighted Moving Average or EWMA gives <em>more weight to recent observations</em> and adapts quickly to level changes. This is better than a simple moving average for non-stationary signals.</p>
<pre><code class="language-python"># Noisy temperature sensor (°C)
temp_noisy = pd.Series(
    3.5
    + 1.2 * np.sin(2 * np.pi * np.arange(168) / 24)
    + np.random.normal(0, 0.8, 168),  # high noise
    index=pd.date_range("2024-06-01", periods=168, freq="H"),
    name="temperature_c"
)

temp_ewma = temp_noisy.ewm(span=6, adjust=False).mean()
temp_sma  = temp_noisy.rolling(window=6, center=True).mean()

comparison = pd.DataFrame({
    "raw":  temp_noisy,
    "ewma": temp_ewma.round(3),
    "sma":  temp_sma.round(3),
}).iloc[22:30]

print(comparison)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                          raw   ewma    sma
2024-06-01 22:00:00  3.212372  2.843  3.035
2024-06-01 23:00:00  3.106840  2.918  3.176
2024-06-02 00:00:00  3.712290  3.145  3.011
2024-06-02 01:00:00  3.344376  3.202  3.294
2024-06-02 02:00:00  2.148946  2.901  3.705
2024-06-02 03:00:00  4.241105  3.284  4.087
2024-06-02 04:00:00  5.677429  3.968  4.381
2024-06-02 05:00:00  5.400083  4.377  4.765
</code></pre>
<h3 id="heading-savitzky-golay-filter">Savitzky-Golay Filter</h3>
<p>For signals where you need to preserve peak shapes — not just smooth them away — the <a href="https://eigenvector.com/wp-content/uploads/2020/01/SavitzkyGolay.pdf">Savitzky-Golay filter</a> fits a polynomial over a sliding window and is better at maintaining the height of genuine spikes.</p>
<pre><code class="language-python">from scipy.signal import savgol_filter

temp_savgol = pd.Series(
    savgol_filter(temp_noisy.values, window_length=11, polyorder=2),
    index=temp_noisy.index,
    name="temp_savgol"
).round(3)

print(pd.DataFrame({
    "raw":    temp_noisy,
    "savgol": temp_savgol,
}).iloc[22:30])
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">                          raw  savgol
2024-06-01 22:00:00  3.212372   2.960
2024-06-01 23:00:00  3.106840   2.944
2024-06-02 00:00:00  3.712290   3.114
2024-06-02 01:00:00  3.344376   3.379
2024-06-02 02:00:00  2.148946   3.809
2024-06-02 03:00:00  4.241105   4.288
2024-06-02 04:00:00  5.677429   4.749
2024-06-02 05:00:00  5.400083   5.138
</code></pre>
<h2 id="heading-schema-and-sanity-validation">Schema and Sanity Validation</h2>
<p>Cleaning without validation is incomplete. You need automated checks that run every time new data arrives — catching problems before they silently corrupt downstream models.</p>
<pre><code class="language-python">def validate_time_series(series: pd.Series, config: dict) -&gt; dict:
    """
    Run schema and sanity checks on a time series.
    Returns a report dict with pass/fail per check.
    """
    report = {}

    # Frequency check
    inferred = pd.infer_freq(series.index)
    report["freq_regular"] = inferred == config["expected_freq"]

    # Missing value threshold
    missing_rate = series.isna().mean()
    report["missing_below_threshold"] = missing_rate &lt;= config["max_missing_rate"]
    report["missing_rate"] = round(missing_rate, 4)

    # Value range check
    in_range = series.dropna().between(config["min_value"], config["max_value"])
    report["values_in_range"] = in_range.all()
    report["out_of_range_count"] = (~in_range).sum()

    # Duplicate timestamps
    report["no_duplicates"] = not series.index.duplicated().any()

    # Monotonic index
    report["index_monotonic"] = series.index.is_monotonic_increasing

    return report


config = {
    "expected_freq":    "H",
    "max_missing_rate": 0.05,
    "min_value":        210.0,
    "max_value":        250.0,
}

report = validate_time_series(voltage_outlier_fixed, config)

print("=== VALIDATION REPORT ===")
for check, result in report.items():
    if check in ("missing_rate", "out_of_range_count"):
        print(f"  {check}: {result}")
    else:
        status = "✓ PASS" if result else "✗ FAIL"
        print(f"  {status}  {check}")
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">=== VALIDATION REPORT ===
  ✗ FAIL  freq_regular
  ✓ PASS  missing_below_threshold
  missing_rate: 0.0
  ✓ PASS  values_in_range
  out_of_range_count: 0
  ✓ PASS  no_duplicates
  ✓ PASS  index_monotonic
</code></pre>
<p>This validator is the kind of function you wrap around every data ingestion step in a production pipeline. Run it before cleaning to know what's broken, and after cleaning to confirm everything passed.</p>
<h2 id="heading-the-complete-cleaning-checklist">The Complete Cleaning Checklist</h2>
<p>Here's the full sequence to run on any incoming time series dataset:</p>
<table>
<thead>
<tr>
<th>Step</th>
<th>Technique</th>
<th>When to Use</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Audit</strong></td>
<td>Index check, missing map, value range</td>
<td>Always — before anything else</td>
</tr>
<tr>
<td><strong>Reindex</strong></td>
<td><code>reindex</code> to canonical frequency</td>
<td>When timestamps are absent rather than NaN</td>
</tr>
<tr>
<td><strong>Missing: short gaps</strong></td>
<td>Time interpolation</td>
<td>Continuous signals, gaps ≤ 3 steps</td>
</tr>
<tr>
<td><strong>Missing: step signals</strong></td>
<td>Forward fill</td>
<td>Categorical or setpoint data</td>
</tr>
<tr>
<td><strong>Missing: long gaps</strong></td>
<td>Seasonal decomposition impute</td>
<td>Seasonal signals, gaps &gt; 6 steps</td>
</tr>
<tr>
<td><strong>Outliers: univariate</strong></td>
<td>Rolling Z-score or IQR</td>
<td>Single sensor, local anomalies</td>
</tr>
<tr>
<td><strong>Outliers: multivariate</strong></td>
<td>Isolation Forest</td>
<td>Multiple correlated sensors</td>
</tr>
<tr>
<td><strong>Outlier treatment</strong></td>
<td>Winsorize or interpolate</td>
<td>Depending on whether event is real</td>
</tr>
<tr>
<td><strong>Duplicates</strong></td>
<td>Keep first or group mean</td>
<td>Pipeline retry duplicates</td>
</tr>
<tr>
<td><strong>Resampling</strong></td>
<td><code>.resample()</code> with correct aggregation</td>
<td>Frequency alignment before joins</td>
</tr>
<tr>
<td><strong>Smoothing</strong></td>
<td>EWMA or Savitzky-Golay</td>
<td>Noisy sensors before feature engineering</td>
</tr>
<tr>
<td><strong>Validation</strong></td>
<td>Schema + sanity checks</td>
<td>After cleaning, and on every new batch</td>
</tr>
</tbody></table>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The order matters. Reindex before imputing. Impute before smoothing. Validate after everything. Skipping steps or doing them out of order compounds errors in ways that are very difficult to trace back once you're looking at model predictions.</p>
<p>Time series cleaning isn't glamorous work, but a model trained on clean data and thoughtfully engineered features will almost always outperform a more sophisticated model trained on data that wasn't cleaned properly. Getting this pipeline right is the highest-leverage thing you can do before you try running even the simplest algorithm on your time series data.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Science Insights: Why the Mean Lies When Handling Messy Retail Data ]]>
                </title>
                <description>
                    <![CDATA[ In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on. Let's take the case of a retail shop. If we're looking at the average order value to u ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-science-insights-why-the-mean-lies-when-handling-messy-retail-data/</link>
                <guid isPermaLink="false">69fa21e5a386d7f121b5fe8c</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ statistics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rakshath Naik ]]>
                </dc:creator>
                <pubDate>Tue, 05 May 2026 16:59:17 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4441dcfc-d100-4613-9937-9c62449c6780.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.</p>
<p>Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.</p>
<p>Done.</p>
<p>Except something looks odd.</p>
<p>When we take a closer look, we see that most customers are buying items worth \(8 - \)15. So where's $20 coming from?</p>
<p>In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.</p>
<p>Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.</p>
<p>In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?</p>
<h2 id="heading-table-of-contents">Table Of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-dataset">The Dataset</a></p>
</li>
<li><p><a href="#heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</a></p>
</li>
<li><p><a href="#heading-median-the-robust-middle">Median: The Robust Middle</a></p>
</li>
<li><p><a href="#heading-beyond-averages-understanding-spread-with-quartiles">Beyond Averages: Understanding Spread with Quartiles</a></p>
</li>
<li><p><a href="#heading-applying-iqr-to-our-dataset">Applying IQR to Our Dataset</a></p>
</li>
<li><p><a href="#heading-final-comparison-and-insights">Final Comparison and Insights</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-connect-with-me">Connect with me</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along here, you'll need:</p>
<p><strong>Basic Python knowledge:</strong> Understanding of variables and functions.</p>
<p><strong>The Pandas library:</strong> Familiarity with loading data and basic DataFrame operations.</p>
<p><strong>A development environment:</strong> Access to a tool like Jupyter Notebook, VS Code, or Google Colab.</p>
<p><strong>A Dataset:</strong> For this analysis, I used the Online Retail Dataset, which is available for download <a href="https://archive.ics.uci.edu/dataset/352/online+retail">here</a>.</p>
<h2 id="heading-the-dataset"><strong>The Dataset</strong></h2>
<p>We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.</p>
<ol>
<li><p><strong>Source:</strong> UCI Machine Learning Repository</p>
</li>
<li><p><strong>Collected by:</strong> UK-based online retail company (2010–2011)</p>
</li>
<li><p><strong>Size:</strong> 541,909 transactions</p>
</li>
<li><p><strong>Features:</strong> 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)</p>
</li>
<li><p><strong>Ownership:</strong> Public dataset hosted by UCI</p>
</li>
<li><p><strong>License:</strong> Open for research and educational use</p>
</li>
</ol>
<h2 id="heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</h2>
<p>In statistics and data analysis, the terms "<strong>average</strong>" and "<strong>arithmetic mean</strong>" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:</p>
<p>$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$</p>
<p>In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.</p>
<pre><code class="language-python"># Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Average Order Value (Mean): 20.40
</code></pre>
<p>At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.</p>
<p>Take a look at the graph for the mean below.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/583bebff-0e5e-44b8-80cb-48e4662b9abf.png" alt="The graph shows the calculated mean for the Online Retail Dataset, where we get a mean of 20.40" style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)</p>
<p>The graph shows <strong>a right-skewed distribution</strong> where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of \(8 - \)15 range, but the <strong>red line</strong> is being dragged to the right by the <strong>long tail</strong> of high-value bulk orders by some customers.</p>
<p>In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.</p>
<p>In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.</p>
<h2 id="heading-median-the-robust-middle">Median: The Robust Middle</h2>
<p>When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.</p>
<p>Median is defined as the <strong>middle value after sorting the data.</strong></p>
<p>In our dataset, we sort all the transactions and pick the middle one.</p>
<p>The formula for calculating the median is:</p>
<p>$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} &amp; \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} &amp; \text{if } n \text{ is even} \end{cases}$$</p>
<p>Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Typical Order Value (Median): 11.10
</code></pre>
<p>Now you'll notice that the result lies in the \(8 — \)15 range, where most of the transactions lie.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/d89a4912-0e44-485e-8ea0-ff559cea6eba.png" alt="The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers." style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)</p>
<p>In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.</p>
<p>In the above figure <strong>the median graph</strong> accurately highlights the range where most of the customers lie.</p>
<h2 id="heading-beyond-averages-understanding-spread-with-quartiles"><strong>Beyond Averages: Understanding Spread with Quartiles</strong></h2>
<p>So far, we've studied the median, but knowing the center is not enough.</p>
<p>To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.</p>
<p>Quartiles divide the dataset into the following parts:</p>
<ol>
<li><p><strong>Q1(25th percentile):</strong> 25% of transactions are below this.</p>
</li>
<li><p><strong>Q2 (50th percentile):</strong> Median</p>
</li>
<li><p><strong>Q3 (75th percentile):</strong> 75% of transactions are below this.</p>
</li>
</ol>
<p>This is formally expressed as the Interquartile Range (IQR):</p>
<p>$$IQR = Q_3 - Q_1$$</p>
<h3 id="heading-the-iqr-detecting-outliers"><strong>The IQR: Detecting Outliers</strong></h3>
<p>The IQR measures the spread of the middle 50%.</p>
<p>If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.</p>
<p>Outlier Rule:</p>
<ol>
<li><p><strong>Lower Bound = Q1 — 1.5 * IQR</strong></p>
</li>
<li><p><strong>Upper Bound = Q3 + 1.5 * IQR</strong></p>
</li>
</ol>
<h4 id="heading-a-simple-example-to-understand-iqr">A Simple Example to Understand IQR</h4>
<p>Consider the following transaction values:</p>
<p>$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$</p>
<h4 id="heading-step-1-find-the-median-q2">Step 1: Find the Median (Q2):</h4>
<p>The middle value is:</p>
<p>$$Q_2 = 12$$</p>
<h4 id="heading-step-2-find-q1-lower-quartile">Step 2: Find Q1 (Lower Quartile):</h4>
<p>The lower half is [5, 8, 10]. The median of the lower half is:</p>
<p>$$Q_1 = 8$$</p>
<h4 id="heading-step-3-find-q3-upper-quartile">Step 3: Find Q3 (Upper Quartile):</h4>
<p>The upper half is [15, 18, 20]. The median of the upper half is:</p>
<p>$$Q_3 = 18$$</p>
<h4 id="heading-step-4-calculate-iqr">Step 4: Calculate IQR:</h4>
<p>$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$</p>
<h4 id="heading-step-5-find-outlier-bounds">Step 5: Find Outlier Bounds:</h4>
<p>$$\begin{aligned} \text{Lower Bound} &amp;= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &amp;= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$</p>
<p>Any value <strong>below -7 or above 33</strong> is an outlier (but in this demo problem, no outliers exist).</p>
<h2 id="heading-applying-iqr-to-our-dataset"><strong>Applying IQR to Our Dataset</strong></h2>
<p>In our retail dataset, instead of neat values, we have bulk values and even negative returns.</p>
<pre><code class="language-python"># 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
</code></pre>
<p>When we calculate IQR for our dataset, we get:</p>
<pre><code class="language-python">Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/e528db9b-57f9-4ee4-b331-143c2b1947fb.png" alt="The figure demonstrates the outlier range for our dataset" style="display:block;margin:0 auto" width="1036" height="547" loading="lazy">

<p>The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)</p>
<p>As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.</p>
<h3 id="heading-revisiting-the-mean-after-removing-outliers">Revisiting the Mean After Removing Outliers</h3>
<p>Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] &gt;= lower_bound) &amp; (df['TotalPrice'] &lt;= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")
</code></pre>
<p>After recomputing, we get:</p>
<pre><code class="language-python">Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/17e6c2d0-883f-4e48-b45b-d1bf93164c63.png" alt="The graph demonstrates that the mean improves significantly after all outliers are removed. (Image by Author)" style="display:block;margin:0 auto" width="876" height="547" loading="lazy">

<p>Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.</p>
<h2 id="heading-final-comparison-and-insights"><strong>Final Comparison and Insights</strong></h2>
<p>Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.</p>
<p>The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.</p>
<p>After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.</p>
<p>This highlights a key lesson: <strong>The mean isn't wrong, but it must be used with an understanding of the data.</strong></p>
<p>Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.</p>
<h2 id="heading-connect-with-me"><strong>Connect with me</strong></h2>
<ol>
<li><p><a href="https://medium.com/@rakshathnaik62">Medium</a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/rakshath-/">LinkedIN</a></p>
</li>
</ol>
<p>If you want to dive deeper, you can visit: <a href="https://qubrica.com/mean-median-mode-python-guide/"><strong>Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis</strong></a><strong>.</strong></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained. ]]>
                </title>
                <description>
                    <![CDATA[ In August 2012, Knight Capital, a major trading firm in the United States, deployed faulty trading software to its production system. The system used this incorrect configuration data and it triggered ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-quality-handbook-data-errors-the-developer-s-role-validation-layers/</link>
                <guid isPermaLink="false">69dea3b491716f3cfb75fd9d</guid>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Validation ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Testing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ handbook ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Great John ]]>
                </dc:creator>
                <pubDate>Tue, 14 Apr 2026 20:29:40 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4f0c9085-cb4f-4255-b7a0-e146eafc32c9.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In August 2012, Knight Capital, a major trading firm in the United States, deployed faulty trading software to its production system. The system used this incorrect configuration data and it triggered millions of unintended stock trades.</p>
<p>The company lost about $440 million in just 45 minutes. Knight Capital nearly collapsed and had to be rescued by investors. It was later acquired by another firm.</p>
<p>When Target expanded into Canada, the company relied on a new supply chain system that contained incorrect product and inventory data. Product information in the database was incomplete and inaccurate. Prices, sizes, and product descriptions were entered incorrectly.</p>
<p>Inventory systems reported items in stock that were actually unavailable. Customers found empty shelves in stores despite the system showing stock. The company lost over $2 billion in the Canadian market. Target eventually shut down all Canadian stores in 2015.</p>
<p>One employee made the statement “Even though we had a great supply chain system on paper, we didn’t have accurate data. Bad data leads to bad decisions’’</p>
<p>Another famous example of data-related engineering failures involves the Mars Climate Orbiter spacecraft. One engineering team used metric units (newtons). Another team used imperial units (pounds-force). The system failed to convert the data correctly. The spacecraft entered Mars' atmosphere at the wrong altitude. The mission failed and the spacecraft was destroyed. The loss was about $125 million.</p>
<p>In this article, we'll delve deep into what data quality truly means, the types of data errors that silently break systems, the developer’s responsibility in preventing them, and the validation layers that work together to keep bad data out of production.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-importance-of-data-quality">The Importance of Data Quality</a></p>
<ul>
<li><p><a href="#heading-how-does-bad-data-happen-in-the-first-place">How Does Bad Data Happen in the First Place?</a></p>
</li>
<li><p><a href="#heading-the-cost-of-bad-data">The Cost of Bad Data</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-types-of-data-errors">Types of Data Errors</a></p>
<ul>
<li><p><a href="#heading-required-field-errors">Required Field Errors</a></p>
</li>
<li><p><a href="#heading-format-validation-errors">Format Validation Errors</a></p>
</li>
<li><p><a href="#heading-range-and-limit-errors">Range and Limit Errors</a></p>
</li>
<li><p><a href="#heading-logical-consistency-errors">Logical Consistency Errors</a></p>
</li>
<li><p><a href="#heading-duplicate-and-data-integrity-errors">Duplicate and Data Integrity Errors</a></p>
</li>
<li><p><a href="#heading-relational-errors-reference-integrity">Relational Errors (Reference Integrity)</a></p>
</li>
<li><p><a href="#heading-structural-errors-dropdowns-radio-buttons-enums">Structural Errors (Dropdowns, Radio Buttons, Enums)</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-what-makes-good-data">What Makes Good Data?</a></p>
<ul>
<li><p><a href="#heading-completeness">Completeness:</a></p>
</li>
<li><p><a href="#heading-uniqueness">Uniqueness:</a></p>
</li>
<li><p><a href="#heading-validity">Validity:</a></p>
</li>
<li><p><a href="#heading-timeliness">Timeliness:</a></p>
</li>
<li><p><a href="#heading-accuracy">Accuracy:</a></p>
</li>
<li><p><a href="#heading-consistency">Consistency:</a></p>
</li>
<li><p><a href="#heading-fitness-for-purpose">Fitness for Purpose:</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-data-validation-layers">Data Validation Layers</a></p>
<ul>
<li><p><a href="#heading-frontend-layer-protect-the-user-not-the-system">Frontend Layer — “Protect the User, Not the System”</a></p>
</li>
<li><p><a href="#heading-backend-validation-the-real-gatekeeper">Backend Validation — “The Real Gatekeeper”</a></p>
</li>
<li><p><a href="#heading-database-layer-protect-the-data-at-rest">Database Layer — “Protect the Data at Rest”</a></p>
</li>
<li><p><a href="#heading-service-layer-business-logic-validate-real-world-rules">Service Layer / Business Logic — “Validate Real-World Rules”</a></p>
</li>
<li><p><a href="#heading-jobs-queues-data-ingestion-validate-external-data">Jobs / Queues / Data Ingestion — “Validate External Data”</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-testing-strategies-to-protect-data-quality">Testing Strategies to Protect Data Quality</a></p>
<ul>
<li><p><a href="#heading-unit-testing-the-schema-amp-constraint-check">Unit Testing: The Schema &amp; Constraint Check</a></p>
</li>
<li><p><a href="#heading-integration-testing-the-flow-amp-lineage-check">Integration Testing: The Flow &amp; Lineage Check</a></p>
</li>
<li><p><a href="#heading-functional-testing-the-business-rule-check">Functional Testing: The Business Rule Check</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>A basic understanding of what data is</p>
</li>
<li><p>A basic understanding of data structures</p>
</li>
<li><p>An understanding of what an API is</p>
</li>
<li><p>An understanding of what a database is and what it does</p>
</li>
</ul>
<h2 id="heading-the-importance-of-data-quality">The Importance of Data Quality</h2>
<p>As you can see from just these few examples, the quality of the data you're working with really matters.</p>
<p>Gartner reports that organisations attribute <a href="https://www.forbes.com/councils/forbestechcouncil/2021/10/14/flying-blind-how-bad-data-undermines-business/"><strong>around $15 million in annual losses</strong></a> to poor‑quality data. The same research also shows that <a href="https://www.forbes.com/councils/forbestechcouncil/2021/10/14/flying-blind-how-bad-data-undermines-business/"><strong>nearly 60% of companies have no clear idea what bad data actually costs them</strong></a>, largely because they don’t track or measure data‑quality issues at all.</p>
<p>A 2016 study by IBM is even more eye-popping. IBM found that <a href="https://community.sap.com/t5/technology-blog-posts-by-sap/bad-data-costs-the-u-s-3-trillion-per-year/ba-p/13575387">poor data quality strips $3.1 trillion from the U.S. economy annually</a> due to lower productivity, system outages, and higher maintenance costs.</p>
<p>Bad data is, and will continue to be, the kryptonite of any organisation. This is even more concerning as more organisations now depend on data for strategy execution than ever before.</p>
<p>When data is wrong, incomplete, duplicated, or inconsistent, the consequences ripple outward: Incorrect dashboards mislead teams, which leads to making incorrect decisions. Implementing these decisions can lead to faulty strategy and policy implementation.</p>
<p>Eventually, the organisation pays the price, financially, operationally, and reputationally. And while money can be recovered, reputation rarely bounces back so easily.</p>
<h3 id="heading-how-does-bad-data-happen-in-the-first-place">How Does Bad Data Happen in the First Place?</h3>
<p>Form fields are usually the first place where data enters an application, so they’re often where bad data begins. This is why the developer’s role is so critical.</p>
<p>Many of the most damaging data errors don’t originate from malicious users or complex edge cases – they come from simple oversights that the system should never have allowed in the first place.</p>
<p>But it's equally important to recognise that data quality issues often originate <em>before</em> the data ever reaches an application. Upstream processes — how data is collected, measured, recorded, or pre‑validated — can introduce inaccuracies long before the system receives it.</p>
<p>For example, a nurse might weigh a patient using an uncalibrated mechanical scale, record the incorrect value on a paper form, and later have that value transcribed into the hospital system. By the time the data enters the application, the error is already embedded.</p>
<p>This means that maintaining data quality requires attention both to upstream data collection practices and to the system-level validation that developers control.</p>
<p>When the UI, backend, or API layer permits invalid, incomplete, inconsistent, or logically impossible data to enter the pipeline, the organisation inherits a long‑term liability. Even small choices — such as allowing empty fields, ignoring duplicates, or failing to enforce validation rules — can introduce errors that may only surface months later in reports or dashboards, leading to confusion and inaccurate insights.</p>
<h3 id="heading-the-cost-of-bad-data">The Cost of Bad Data</h3>
<p>Data quality can also be impacted at any stage of the data pipeline: before ingestion, in production, or even during analysis.</p>
<p>If bad data is caught in the UI, it's almost free, if we're thinking in terms of cost. If it's caught at the API layer, that's still pretty cheap. If it's caught in the database, the cost is moderate. And if it's caught in a report or ML model months later, that's expensive, and sometimes irreversible.</p>
<p>A key principle in modern data management is: the cheapest and safest place to catch bad data is at the source, and that is before ingestion. <a href="https://www.matillion.com/blog/the-1-10-100-rule-of-data-quality-a-critical-review-for-data-professionals">The well-known 1-10-100 Rule</a>, introduced by George Labovitz and Yu Sang Chang in 1992, clearly illustrates this idea.</p>
<p>According to the rule, it costs about \(1 to validate data at the point of entry, \)10 to correct it after it has entered the system, and $100 per record if the error goes unnoticed and causes problems further down the line.</p>
<p>As the saying goes, an ounce of prevention is worth a pound of cure – and this is especially true when it comes to maintaining high-quality data.</p>
<p>To help buttress my point, I’ve categorised the different types of errors and oversights that developers should never allow that can and should be prevented before they ever reach the database, analytics layer, or reporting systems.</p>
<h2 id="heading-types-of-data-errors">Types of Data Errors</h2>
<h3 id="heading-required-field-errors">Required Field Errors</h3>
<p>If you build a form that allows a user to submit a registration form with important fields left empty (like first name, last name, email address, phone number, date of birth, or address), you're directly letting incomplete data enter the system.</p>
<p>I remember a scenario from my time as a data analyst where I was analysing a dataset containing different types of alarms triggered across several buildings. These alarms fell into categories such as aquarium alarms, intruder alarms, fire alarms, and maintenance alarms.</p>
<p>The purpose of the analysis was simple: identify which buildings had the highest frequency of alarms so that maintenance, resources, or investigations could be allocated appropriately.</p>
<p>Whenever an alarm went off, the security team recorded it using a software system. By the end of each month, we could view the cumulative alarms and generate insights.</p>
<p>But I encountered a major data quality issue. The security officers often selected the alarm category but failed to submit the building where the alarm occurred — and the system allowed this incomplete record to be saved into the database.</p>
<p>Every alarm had to occur in a specific building. Yet during analysis, I would see entries like “20 fire alarms” with no building information attached. Since I couldn’t determine where these alarms happened, the data became unusable. I had no choice but to delete those records because they provided no actionable value.</p>
<p>This is a classic example of poor data validation. If the developer had implemented proper constraints, the system would never allow an alarm to be submitted without a building name.</p>
<p>Required fields should be enforced at the UI and backend levels to prevent missing data from entering the system in the first place. These gaps lead to missing or unusable data in the database, often forcing teams to delete or manually repair records later.</p>
<p>To prevent these errors, you can use required‑field validation, disable the submit button until all mandatory fields are completed, and visually highlight missing fields with inline error messages.</p>
<p>Here's a practical code example of some bad code (no required checks):</p>
<pre><code class="language-plaintext">&lt;form id="signup"&gt;
  &lt;input type="text" id="name" placeholder="Full name"&gt;
  &lt;input type="email" id="email" placeholder="Email"&gt;
  &lt;button type="submit"&gt;Sign up&lt;/button&gt;
&lt;/form&gt;

&lt;script&gt;
document.getElementById("signup").addEventListener("submit", e =&gt; {
  const name = document.getElementById("name").value;
  const email = document.getElementById("email").value;
  console.log("Submitted:", { name, email });
});
&lt;/script&gt;
</code></pre>
<p>From the above code snippet, the core problem is that the form doesn't enforce required input. Neither HTML‑level validation (using the <code>required</code> attribute) nor JavaScript‑based checks are implemented. This omission allows users to submit the form without providing necessary information, making the form unreliable for collecting valid and complete user data.</p>
<p>From a usability and data quality perspective, this is problematic. Forms are typically designed to collect meaningful and complete information, and fields such as “Full name” and “Email” are usually essential. Without marking these inputs as required or validating them programmatically, we risk receiving blank or invalid submissions, which can compromise the quality of stored data and any processes that depend on it.</p>
<p>Here's an example of a better version (UI prevents empty submission):</p>
<pre><code class="language-plaintext">&lt;form id="signup"&gt;
  &lt;input type="text" id="name" placeholder="Full name" required&gt;
  &lt;input type="email" id="email" placeholder="Email" required&gt;
  &lt;button type="submit"&gt;Sign up&lt;/button&gt;
&lt;/form&gt;

&lt;script&gt;
document.getElementById("signup").addEventListener("submit", e =&gt; {
  if (!e.target.checkValidity()) {
    e.preventDefault();
    alert("Please fill in all required fields.");
  }
});
&lt;/script&gt;
</code></pre>
<p>In this revised version of the code, the addition of the <code>required</code> attribute to both the name and email input elements ensures that the browser won't allow the form to be submitted unless these fields are filled. This is an important step toward maintaining data completeness and improving the overall reliability of the form.</p>
<p>Also, by checking <code>e.target.checkValidity()</code>, we now ensure that the form is evaluated before submission proceeds.</p>
<p>Another positive aspect is the conditional use of <code>e.preventDefault()</code>. When the form is invalid, the default submission behavior is stopped, preventing incomplete or incorrect data from being sent.</p>
<h3 id="heading-format-validation-errors">Format Validation Errors</h3>
<p>If you have a form that allows a user to enter an email without an @ symbol, an email without a domain, a phone number containing letters, or a postcode/ZIP code in the wrong format, that allows invalid data to enter the system.</p>
<p>The same applies when you allow a user to submit an impossible date (32/15/2025) or a credit card number with the wrong length.</p>
<p>These issues will cause the data analyst to spend more time cleaning the data, if it's even cleanable. And such incorrect inputs create unreliable data that breaks downstream processes and increases cleanup costs.</p>
<p>To prevent these types of errors, you can use regex validation, input masks, and field‑type restrictions (for example, numeric‑only fields for phone numbers) to enforce correct formats before submission.</p>
<p>Here's a bad example of allowing format validation errors:</p>
<pre><code class="language-plaintext">&lt;input id="phone" placeholder="Phone number"&gt;
&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
function save() {
  const phone = document.getElementById("phone").value;
  console.log("Saving phone:", phone);
}
&lt;/script&gt;
</code></pre>
<p>This code doesn't perform any checks on the format or structure of the phone number. The function simply retrieves whatever value exists –&nbsp;whether valid, invalid, or blank –&nbsp;and logs it to the console without any condition.</p>
<p>Here's the fixed version:</p>
<pre><code class="language-plaintext">&lt;input id="phone" placeholder="Phone number" required&gt;
&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
function save() {
  const phone = document.getElementById("phone").value;

  if (!/^\d+$/.test(phone)) {
    alert("Phone number must contain digits only.");
    return;
  }

  console.log("Saving phone:", phone);
}
&lt;/script&gt;
</code></pre>
<p>This version fixes the earlier mistake by introducing a clear validation rule. Before the system accepts the phone number, it checks whether the input contains only digits. The regular expression <code>^\d+$</code> ensures that the value is made up entirely of numbers, with no letters or symbols allowed. If the user enters anything invalid, the function stops and displays an error message instead of saving bad data.</p>
<p>This approach prevents the format error that occurred in the previous example. Instead of blindly trusting whatever the user types, the code now enforces a rule that matches the expected format of a phone number. This is what a responsible developer should do: verify the input before using it.</p>
<h3 id="heading-range-and-limit-errors">Range and Limit Errors</h3>
<p>Allowing users to enter values outside acceptable limits – such as negative ages, quantities below zero, discounts above 100%, or measurements far beyond realistic ranges – that enables the ingestion of data that violates business rules. These errors distort analytics, break calculations, and create operational inconsistencies.</p>
<p>To mitigate these errors, you can apply min/max constraints, sliders, steppers, and numeric boundaries to ensure values fall within valid ranges.</p>
<p>Here's a bad example of allowing range and limit errors:</p>
<pre><code class="language-plaintext">&lt;input id="age" type="number"&gt;
&lt;button onclick="submitAge()"&gt;Submit&lt;/button&gt;

&lt;script&gt;
function submitAge() {
  console.log("Age:", document.getElementById("age").value);
}
&lt;/script&gt;
</code></pre>
<p>As seen above, we've created an input field for age but doesn't specify any limits or constraints. The browser allows the user to type any number — including values that make no sense, such as negative ages, extremely large ages, or decimals. The JavaScript function simply reads the value and logs it without checking whether the age is realistic.</p>
<p>Here's a better version:</p>
<pre><code class="language-plaintext">&lt;input id="age" type="number" min="0" max="120" required&gt;
&lt;button onclick="submitAge()"&gt;Submit&lt;/button&gt;

&lt;script&gt;
function submitAge() {
  const ageInput = document.getElementById("age");
  if (!ageInput.checkValidity()) {
    alert("Age must be between 0 and 120.");
    return;
  }
  console.log("Age:", ageInput.value);
}
&lt;/script&gt;
</code></pre>
<p>Now in this version, the inclusion of the <code>min="0"</code> and <code>max="120"</code> attributes sets clear boundaries for acceptable input values. This ensures that only realistic age values within a defined range are allowed, preventing invalid entries such as negative numbers or excessively large ages.</p>
<p>The JavaScript function further enhances this validation by using the <code>checkValidity()</code> method. This method checks whether the input satisfies all defined constraints, including the required condition and the specified numeric range. If the input doesn't meet these conditions, the function prevents further execution and displays an alert message, informing the user that the entered age must fall within the allowed range.</p>
<h3 id="heading-logical-consistency-errors">Logical Consistency Errors</h3>
<p>If you allow a user to select an end date before the start date, choose a checkout date earlier than check‑in at a hotel, or enter a delivery date before the order date, this will result in logically impossible data. The same applies when you allow a user to enter a graduation year earlier than their admission to a program, or submit working hours that exceed 24 hours in a day.</p>
<p>You can mitigate this by implementing cross‑field validation, business‑rule checks, and conditional logic that ensures related fields remain consistent.</p>
<p>Here's a bad example of a logical consistency error:</p>
<pre><code class="language-plaintext">&lt;input type="date" id="start"&gt;
&lt;input type="date" id="end"&gt;
&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
function save() {
  console.log({
    start: document.getElementById("start").value,
    end: document.getElementById("end").value
  });
}
&lt;/script&gt;
</code></pre>
<p>In the code above, the core issue is the complete absence of validation. Although the inputs use <code>type="date"</code>, which provides a structured way for users to select dates, the code doesn't enforce that either field is required. This means the user can leave one or both date fields empty, and the <code>save()</code> function will still run and log the values. As a result, the system may end up processing incomplete or meaningless data.</p>
<p>Beyond missing required checks, the code also fails to validate the logical relationship between the two dates. In any scenario involving a start date and an end date, it's expected that the start date shouldn't occur after the end date. But this code performs no such comparison.</p>
<p>This means that the user can select a start date that's later than the end date, and the system will accept it without warning. This leads to inconsistent or impossible data being recorded.</p>
<p>Also, the function simply logs the values without providing any feedback to the user. There's no mechanism to alert the user when a field is empty or when the dates are logically incorrect. This reduces usability and makes it difficult for users to understand or correct their mistakes.</p>
<p>Here's the fixed version:</p>
<pre><code class="language-plaintext">&lt;input type="date" id="start" required&gt;
&lt;input type="date" id="end" required&gt;
&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
function save() {
  const startValue = document.getElementById("start").value;
  const endValue = document.getElementById("end").value;

  // Extra safety: check empties (in case required is bypassed)
  if (!startValue || !endValue) {
    alert("Both start and end dates are required.");
    return;
  }

  const start = new Date(startValue);
  const end = new Date(endValue);

  if (end &lt; start) {
    alert("End date cannot be before start date.");
    return;
  }

  console.log({ start, end });
}
&lt;/script&gt;
</code></pre>
<p>In this improved version, first, both date fields now include the <code>required</code> attribute, ensuring that the user can't leave either field empty without triggering validation.</p>
<p>Second, we've added a logical validation check to ensure that the relationship between the two dates is correct. After retrieving the values, the function converts them into <code>Date</code> objects and compares them to verify that the end date doesn't occur before the start date. If this condition is violated, the function stops execution and displays an alert informing the user of the error.</p>
<p>This prevents inconsistent or impossible date ranges from being accepted.</p>
<h3 id="heading-duplicate-and-data-integrity-errors">Duplicate and Data Integrity Errors</h3>
<p>When you let a user submit an email that's already registered, choose a username that's already taken, or enter a duplicate employee ID or student number, this results in identity conflicts and duplicate records. Problems also arise when you allow users to upload unsupported file types, oversized files, or corrupted images.</p>
<p>Security risks can emerge when users are able to enter HTML/script tags (XSS), SQL‑injection patterns, or disallowed special characters. These issues compromise data quality, system integrity, and security.</p>
<p>You can prevent these types of issues by using uniqueness checks, file‑type and size validation, and input sanitization to block duplicates, invalid uploads, and malicious inputs.</p>
<p>Here's an example of a duplicate error:</p>
<pre><code class="language-plaintext">&lt;input id="email" placeholder="Enter email" required&gt;
&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
const savedEmails = [];

function save() {
  const email = document.getElementById("email").value;
  savedEmails.push(email);
  console.log("Saved emails:", savedEmails);
}
&lt;/script&gt;
</code></pre>
<p>This code blindly pushes every email into the <code>savedEmails</code> array without checking whether the email already exists. Because there is no duplicate detection, the user can enter the same email multiple times.</p>
<p>Here is the fixed version:</p>
<pre><code class="language-plaintext">&lt;input id="email" placeholder="Enter email" required&gt;
&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
const savedEmails = [];

function save() {
  const email = document.getElementById("email").value.trim();

  // Check if the field is empty
  if (!email) {
    alert("Please enter an email before saving.");
    return;
  }

  // Check for duplicate
  if (savedEmails.includes(email)) {
    alert("This email has already been saved.");
    return;
  }

  savedEmails.push(email);
  console.log("Saved emails:", savedEmails);
}
&lt;/script&gt;

</code></pre>
<p>In this improved version of the code, we've implemented proper validation steps to prevent duplicate email entries. Before saving the email, the function checks whether the value already exists in the <code>savedEmails</code> array using the <code>includes()</code> method. If the email is found, the function stops execution and displays an alert informing the user that the email has already been saved. This ensures that each email is stored only once, maintaining the uniqueness and integrity of the data.</p>
<h3 id="heading-relational-errors-reference-integrity">Relational Errors (Reference Integrity)</h3>
<p>If you let a user select a city that doesn’t belong to the chosen country, a product ID that no longer exists, a retired SKU, or a shipping method unavailable in the selected region, this can result in broken references.</p>
<p>The same applies when users can select a manager from a different department or choose a fully booked time slot, not setting the right roles and permissions. These errors break relationships between tables and corrupt downstream joins and reports.</p>
<p>Here, you can use dependent dropdowns, real‑time lookups, and foreign‑key validation to help ensure that users can only select valid, existing, and compatible options.</p>
<p>Here's a bad example of a relational error:</p>
<pre><code class="language-plaintext">&lt;select id="country"&gt;
  &lt;option value="uk"&gt;United Kingdom&lt;/option&gt;
  &lt;option value="usa"&gt;United States&lt;/option&gt;
&lt;/select&gt;

&lt;select id="city"&gt;
  &lt;option value="london"&gt;London&lt;/option&gt;
  &lt;option value="manchester"&gt;Manchester&lt;/option&gt;
  &lt;option value="newyork"&gt;New York&lt;/option&gt;
  &lt;option value="losangeles"&gt;Los Angeles&lt;/option&gt;
&lt;/select&gt;

&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
function save() {
  const country = document.getElementById("country").value;
  const city = document.getElementById("city").value;

  console.log("Saving:", { country, city });
}
&lt;/script&gt;
</code></pre>
<p>From the above, the mistake in this code is that we've treated country and city as completely independent fields, even though one is supposed to depend on the other. By presenting all cities regardless of the selected country, the interface allows users to create combinations that make no sense — such as choosing “United Kingdom” with “New York” or “United States” with “Manchester.”</p>
<p>Also, because the <code>save()</code> function performs no validation and simply logs whatever the user selects, the system ends up accepting and storing relationships that should never exist. This breaks the logical link between the two fields and leads to invalid, inconsistent data that can corrupt downstream.</p>
<p>Here's the fixed, production-ready version:</p>
<pre><code class="language-plaintext">&lt;select id="country" onchange="loadCities()" required&gt;
  &lt;option value=""&gt;Select country&lt;/option&gt;
  &lt;option value="uk"&gt;United Kingdom&lt;/option&gt;
  &lt;option value="usa"&gt;United States&lt;/option&gt;
&lt;/select&gt;

&lt;select id="city" required disabled&gt;
  &lt;option value=""&gt;Select city&lt;/option&gt;
&lt;/select&gt;

&lt;button onclick="save()"&gt;Save&lt;/button&gt;

&lt;script&gt;
const citiesByCountry = {
  uk: ["London", "Manchester"],
  usa: ["New York", "Los Angeles"]
};

function loadCities() {
  const country = document.getElementById("country").value;
  const citySelect = document.getElementById("city");

  // Reset city dropdown
  citySelect.innerHTML = '&lt;option value=""&gt;Select city&lt;/option&gt;';

  // Disable if no country selected
  if (!country) {
    citySelect.disabled = true;
    return;
  }

  // Enable dropdown
  citySelect.disabled = false;

  // Load cities safely
  (citiesByCountry[country] || []).forEach(city =&gt; {
    const option = document.createElement("option");
    option.value = city.toLowerCase().replace(/\s+/g, ""); // remove ALL spaces
    option.textContent = city;
    citySelect.appendChild(option);
  });
}

function save() {
  const country = document.getElementById("country").value;
  const city = document.getElementById("city").value;

  // Required validation
  if (!country || !city) {
    alert("Please select both a country and a city.");
    return;
  }

  // Build list of valid cities for this country
  const validCities = (citiesByCountry[country] || [])
    .map(c =&gt; c.toLowerCase().replace(/\s+/g, ""));

  // Relational validation
  if (!validCities.includes(city)) {
    alert("Selected city does not belong to the chosen country.");
    return;
  }

  console.log("Saving:", { country, city });
}
&lt;/script&gt;
</code></pre>
<p>This improved code turns the country–city form into a controlled, relationship‑aware flow instead of two loose dropdowns.</p>
<p>When the user selects a country, the <code>loadCities()</code> function runs. It first clears the city dropdown and, if no country is selected, keeps the city field disabled so the user can't choose a city on its own.</p>
<p>Once a valid country is chosen, the city dropdown is enabled and populated only with the cities that belong to that specific country, using the <code>citiesByCountry</code> mapping. Also, the city values are normalised (lowercased and stripped of spaces) so they’re consistent and safe to compare.</p>
<p>When the user clicks “Save,” the <code>save()</code> function checks that both a country and a city have been selected. If either is missing, it shows an alert and stops. It then rebuilds the list of valid city values for the chosen country and verifies that the selected city is actually in that list.</p>
<h3 id="heading-structural-errors-dropdowns-radio-buttons-enums">Structural Errors (Dropdowns, Radio Buttons, Enums)</h3>
<p>If users can type a country as “U.S.A”, “USA”, “United States”, or “us”, enter gender as “male”, “Male”, “M”, or “man”, or type a department as “Engineering”, “Eng”, or “engineer”, this can result in inconsistent categorical data.</p>
<p>The same applies to currencies typed as “usd”, “USD”, “US Dollars”, product categories spelled differently, status values like “active”, “Active”, “ACT”, “enabled”, or boolean values like “yes”, “Yes”, “Y”, “1”.</p>
<p>These inconsistencies make analytics, grouping, and reporting unreliable, and the analyst will spend time cleaning and standardizing these files.</p>
<p>You should replace free‑text fields with dropdowns, radio buttons, and enums to enforce standardized categorical values.</p>
<p>Bad example of a structural error:</p>
<pre><code class="language-plaintext">&lt;form id="profile"&gt;
  &lt;label&gt;Country&lt;/label&gt;
  &lt;input type="text" id="country" placeholder="Enter country"&gt;
  &lt;button type="submit"&gt;Save&lt;/button&gt;
&lt;/form&gt;

&lt;script&gt;
document.getElementById("profile").addEventListener("submit", e =&gt; {
  e.preventDefault();
  const country = document.getElementById("country").value;
  console.log("Saving:", country);
});
&lt;/script&gt;
</code></pre>
<p>The problem with this code is that it pretends to save a country value without doing any real validation or enforcing any rules, which makes the form unreliable and prone to bad data.</p>
<p>The form uses a plain text input for “country,” meaning the user can type anything they want — misspellings, random characters, invalid countries, or even leave it blank. Because the input isn’t marked as required and the JavaScript doesn’t check whether the field contains a meaningful value, the form will happily “save” an empty string or nonsense text.</p>
<p>The <code>submit</code> handler prevents the default form submission but does nothing beyond logging whatever the user typed, so the system accepts invalid, incomplete, or malformed data without question. In short, the code collects input but doesn't validate it, doesn't enforce correctness, and doesn't protect the system from bad or unusable values.</p>
<p>Here's the fixed version:</p>
<pre><code class="language-plaintext">&lt;form id="profile"&gt;
  &lt;label&gt;Country&lt;/label&gt;
  &lt;select id="country" required&gt;
    &lt;option value=""&gt;Select country&lt;/option&gt;
    &lt;option value="uk"&gt;United Kingdom&lt;/option&gt;
    &lt;option value="usa"&gt;United States&lt;/option&gt;
    &lt;option value="canada"&gt;Canada&lt;/option&gt;
  &lt;/select&gt;

  &lt;button type="submit"&gt;Save&lt;/button&gt;
&lt;/form&gt;

&lt;script&gt;
document.getElementById("profile").addEventListener("submit", e =&gt; {
  e.preventDefault();

  const country = document.getElementById("country").value;

  // Required validation
  if (!country) {
    alert("Please select a country before saving.");
    return;
  }

  console.log("Saving:", country);
});
&lt;/script&gt;
</code></pre>
<p>The biggest improvement is that we're no longer relying on a free‑text field for the country. By switching to a dropdown, the form now limits the user to a controlled set of valid options. This prevents misspellings, random text, or invalid country names from ever entering the system.</p>
<p>These are the main types of data errors you might come across in your work. Now that we've discussed what causes them and some key fixes/preventative measures you can take, let's move on to data quality itself.</p>
<h2 id="heading-what-makes-good-data">What Makes Good Data?</h2>
<p>So what, in fact, is data quality? <a href="https://www.ibm.com/products/tutorials/6-pillars-of-data-quality-and-how-to-improve-your-data">IBM defines it</a> as the degree of accuracy, consistency, completeness, reliability, and relevance of the data collected, stored, and used within an organization or a specific context.</p>
<p>Let's look at each of these features of quality data a bit more closely to understand what they entail.</p>
<h3 id="heading-completeness">Completeness:</h3>
<p>Completeness measures how much of the required data is actually present. When large portions of fields are missing, the dataset stops representing reality and any analysis built on it becomes unreliable.</p>
<p>An example would be a sign‑up form that stores users, but half of them are missing an email address. If you run an analysis on “email engagement,” your results will be skewed because a big chunk of users can’t even receive emails. This means that this data is incomplete.</p>
<h3 id="heading-uniqueness">Uniqueness:</h3>
<p>Uniqueness checks whether each real‑world entity appears only once in the dataset. Duplicate records inflate counts, break joins, and distort metrics.</p>
<p>An example would be a customer table containing two rows for the same person with the same customer ID. When calculating “active customers,” the system counts them twice, inflating revenue projections.</p>
<h3 id="heading-validity">Validity:</h3>
<p>Validity evaluates whether data follows the expected format, type, or business rules. This includes correct data types, allowed ranges, and patterns defined by the system.</p>
<p>An example would be a field meant to store dates contains values like “32/99/2025” or “tomorrow.” These invalid entries break downstream ETL jobs that expect a proper date format.</p>
<h3 id="heading-timeliness">Timeliness:</h3>
<p>Timeliness reflects whether data is available when it’s needed. Even accurate data becomes useless if it arrives too late for the process that depends on it. For example, after a customer places an order, the system should generate an order ID instantly.</p>
<h3 id="heading-accuracy">Accuracy:</h3>
<p>Accuracy measures how closely data matches the real‑world truth. When multiple systems report the same metric, one must be designated as the authoritative source to avoid conflicting values.</p>
<h3 id="heading-consistency">Consistency:</h3>
<p>Consistency checks whether data aligns across different datasets or within related fields. If two systems describe the same concept, their values shouldn't contradict each other.</p>
<p>For example, a company’s HR system reports 50 employees in Engineering, but the payroll system lists only 42. Since both describe the same group, the mismatch signals a data quality issue.</p>
<h3 id="heading-fitness-for-purpose">Fitness for Purpose:</h3>
<p>Fitness for purpose assesses whether the data is suitable for the specific business task at hand. Even complete, accurate, and timely data may be unhelpful if it doesn’t answer the intended question.</p>
<p>A dataset of website clicks might be perfect for analysing user engagement, for example, but it’s useless for forecasting revenue because it contains no purchase or pricing information.</p>
<h2 id="heading-data-validation-layers">Data Validation Layers</h2>
<p>Now that we've highlighted the characteristics that ensure quality data, it's important to discuss the layers of data validation.</p>
<p>There are five layers you'll need to check to enforce data quality.</p>
<h3 id="heading-frontend-layer-protect-the-user-not-the-system">Frontend Layer — “Protect the User, Not the System”</h3>
<p>Frontend validation plays an important role in enhancing the user experience – but it doesn't provide real protection for a system.</p>
<p>Since frontend logic operates within the user’s environment, we can't trust it as a mechanism for enforcing data quality. Any code executed in the browser is ultimately under the user’s control, meaning it can be disabled, modified, intercepted, or bypassed entirely.</p>
<p>For instance, a user can simply open browser developer tools, remove validation rules, and submit invalid or malicious data without restriction.</p>
<p>Frontend validation is incapable of enforcing complex business rules. Constraints such as ensuring that a discounted price is lower than the original price, validating that a start date precedes an end date, preventing stock levels from becoming negative, or confirming that a product belongs to a valid category within the database require deeper system-level checks.</p>
<p>At the frontend level, what is being validated is: required fields, email format, password strength, address fields, and payment input format.</p>
<p>So frontend validation doesn't guarantee data quality or security, as it can be bypassed through API tools (like Postman), disabled JavaScript, malicious bots, and third-party integrations.</p>
<p>Because of this, it's best to treat the front-end as a usability layer, not a trust layer.</p>
<h3 id="heading-backend-validation-the-real-gatekeeper">Backend Validation — “The Real Gatekeeper”</h3>
<p>You can only guarantee true data quality and system integrity at the backend and database layers.</p>
<p>The backend is responsible for enforcing request validation, implementing business logic, and managing authentication and authorization.</p>
<p>If validation fails here, invalid data is rejected before it can propagate. Without this layer, data corruption begins at ingestion.</p>
<p>For example:</p>
<pre><code class="language-plaintext">$request-&gt;validate([
   'name' =&gt; 'required|string|max:255',
   'price' =&gt; 'required|numeric|min:0',
   'stock' =&gt; 'required|integer|min:0',
   'category_id' =&gt; 'required|exists:categories,id',
]);
</code></pre>
<p>The code snippet above demonstrates how you can use request validation in Laravel to ensure that incoming data meets specific requirements before it's processed or stored in the database. This is an essential practice in web development, as it helps maintain data integrity, prevents errors, and enhances application security.</p>
<p>In this example, we're using the <code>$request-&gt;validate()</code> method to define a set of validation rules for four input fields: <code>name</code>, <code>price</code>, <code>stock</code>, and <code>category_id</code>. Each field is assigned a series of constraints that the incoming data must satisfy.</p>
<p>The name field is marked as required, meaning it must be included in the request and can't be empty. It must also be a string, ensuring that only textual data is accepted, and it's limited to a maximum length of 255 characters using <code>max:255</code>. This prevents excessively long inputs that could potentially cause issues in the database or user interface.</p>
<p>Similarly, the price field is required and must be numeric, allowing only numbers such as integers or decimal values. The rule <code>min:0</code> ensures that the price can't be negative, which is logically consistent for most product pricing scenarios.</p>
<p>The stock field is also required and must be an integer, meaning it can only accept whole numbers. This is appropriate for counting physical items. Like the price field, it includes a <code>min:0</code> rule to prevent negative stock values, which would not make sense in an inventory system.</p>
<p>Finally, the category_id field is validated to ensure it is both present and valid. The <code>required</code> rule ensures that a category is selected, while the <code>exists:categories,id</code> rule checks that the provided value corresponds to an existing id in the categories database table. This prevents invalid or non-existent category references, thereby preserving relational integrity within the database.</p>
<p>This layer validates null values, data types and formats, allowed ranges, and referential integrity (exists).</p>
<h3 id="heading-database-layer-protect-the-data-at-rest">Database Layer — “Protect the Data at Rest”</h3>
<p>Validation at the application level is insufficient on its own. You'll also need to enforce database-level constraints like NOT NULL constraints, UNIQUE constraints (email, SKU, order number), foreign keys (orders.user_id → users.id), and check constraints (for example, price &gt;= 0).</p>
<p>This layer is critical because application bugs may bypass validation, background jobs and imports may skip controllers, and malicious actors may attempt direct access.</p>
<p>The database layer acts as the final line of defense, ensuring structural integrity regardless of application failures. Database constraints are the last hard stop: they enforce correctness even when code is bypassed.</p>
<h3 id="heading-service-layer-business-logic-validate-real-world-rules">Service Layer / Business Logic — “Validate Real-World Rules”</h3>
<p>This layer enforces domain-specific logic that can't be captured by simple validation rules. The service layer is where the application stops asking “Is this data shaped correctly?” and starts asking “Is this allowed to happen in the real world?”.</p>
<p>This layer enforces domain‑specific rules that can't be captured by simple request validation or database constraints. These rules reflect business truth, not structural correctness.</p>
<p><strong>Example:</strong></p>
<pre><code class="language-plaintext">if (\(product-&gt;stock &lt; \)quantity) {
   throw new OutOfStockException();
}
</code></pre>
<p>This prevents overselling and ensures the system reflects physical reality.</p>
<pre><code class="language-plaintext">if (\(cartTotal !== \)calculatedTotal) {
   throw new PriceMismatchException();
}
</code></pre>
<p>This protects revenue and prevents tampering.</p>
<p>In this layer, you enforce real‑world business rules by ensuring inventory correctness, recalculating totals, applying discount logic, and checking user‑specific limits.</p>
<h3 id="heading-jobs-queues-data-ingestion-validate-external-data">Jobs / Queues / Data Ingestion — “Validate External Data”</h3>
<p>When importing or processing external data (for example, supplier feeds), validation must occur before processing. You'll need to ensure schema conformity, that the required columns are present, that you have the correct data types, that the JSON structure is valid, and that you're detecting duplicate batches.</p>
<p>This is because external data sources are a major source of data quality issues. Without validation here, corrupted data can silently enter the system at scale.</p>
<p>Now that we've discussed the layers of a modern application stack, it should be clear that data quality isn't something you “check once” at the UI.</p>
<p>It must be enforced repeatedly, at multiple depths of the system. Each layer catches a different class of defects, and together they form a defensive wall that prevents bad data from ever reaching storage, analytics, or downstream consumers.</p>
<h2 id="heading-testing-strategies-to-protect-data-quality">Testing Strategies to Protect Data Quality</h2>
<p>To wrap up, here are the three foundational testing strategy every developer should apply to protect data quality.</p>
<h3 id="heading-unit-testing">Unit Testing</h3>
<p>Unit tests are the first line of defense in data quality. In this context, a “unit” refers to a single column, a single transformation, or a single validation rule.</p>
<p>The purpose is straightforward: verify that the smallest building blocks of your data logic behave exactly as intended. This matters because if these low‑level rules are not tested and validated, incorrect or inconsistent data will flow into the database and contaminate everything built on top of it.</p>
<p>By isolating each rule or transformation, you can guarantee that schema constraints, field‑level assumptions, and low‑level logic remain correct before data ever flows into larger pipelines or business processes.</p>
<p>Typical questions answered at this layer include:</p>
<ol>
<li><p>Does this column allow nulls?</p>
</li>
<li><p>Does this regex correctly strip whitespace from email strings?</p>
</li>
<li><p>Does this transformation produce the expected output for a single row?</p>
</li>
</ol>
<p>This is where you can verify that the data contract is sound. If a column must be non‑null, unique, or follow a specific pattern, the unit test enforces it. When these rules fail here, they fail cheaply – before they can corrupt a table or mislead a dashboard.</p>
<p>To make this concrete, here’s what a unit test looks like in a real codebase. Even though this example comes from Laravel, the testing principle is identical to data‑quality unit tests: one rule, one expectation, isolated from everything else.</p>
<h4 id="heading-example-testing-a-discount-calculation-rule">Example: Testing a Discount Calculation Rule</h4>
<p>Imagine your e‑commerce shop has this rule:</p>
<ul>
<li><p>If a product costs more than £100, apply a 10% discount.</p>
</li>
<li><p>Otherwise, apply no discount.</p>
</li>
</ul>
<p>Let's say this is your discount logic:</p>
<pre><code class="language-plaintext">&lt;?php

namespace App\Services;

class DiscountService
{
    public function calculate(float $price): float
    {
        if ($price &gt; 100) {
            return $price * 0.10; // 10% discount
        }

        return 0;
    }
}
</code></pre>
<p>The unit test for this logic will be:</p>
<pre><code class="language-plaintext">&lt;?php

namespace Tests\Unit;

use Tests\TestCase;
use App\Services\DiscountService;

class DiscountServiceTest extends TestCase
{
    /** @test */
    public function it_applies_10_percent_discount_when_price_is_above_100()
    {
        $service = new DiscountService();

        \(discount = \)service-&gt;calculate(200);

        \(this-&gt;assertEquals(20, \)discount);
    }

    /** @test */
    public function it_applies_no_discount_when_price_is_100_or_below()
    {
        $service = new DiscountService();

        \(discount = \)service-&gt;calculate(100);

        \(this-&gt;assertEquals(0, \)discount);
    }
}
</code></pre>
<p>The <code>DiscountService</code> contains a simple rule: if a price is greater than 100, a 10% discount is applied. Otherwise, no discount is applied. The unit test verifies this rule in isolation, without involving controllers, databases, or HTTP requests. By testing the service directly, the developer ensures that the core calculation behaves exactly as intended.</p>
<p>The first test checks the positive case — a price of 200 should produce a discount of 20. The second test checks the boundary condition — a price of 100 should produce no discount. Together, these tests confirm both sides of the rule and protect against regressions if the logic changes in the future.</p>
<p>Now, since this is Laravel example, Laravel tests help you verify both your logic (unit tests) and your full application behaviour (feature tests). You can run them using <code>php artisan test</code>, which executes tests in a separate testing environment, ensuring your real database and main codebase remain safe and unaffected.</p>
<h3 id="heading-integration-testing-the-flow-amp-lineage-check">Integration Testing: The Flow &amp; Lineage Check</h3>
<p>While unit tests validate the correctness of individual rules, integration tests validate the movement of data across components. Integration testing verifies that multiple layers work together as a single data flow.</p>
<p>In this example, the controller receives an order, calls the discount service, applies the transformation, and persists the result to the database. That interaction across layers is what elevates this from a unit test to an integration test. This is where you test the real‑world flow:</p>
<ol>
<li><p>Controller → Service → Repository → MySQL</p>
</li>
<li><p>Check if MySQL migrations run correctly</p>
</li>
<li><p>Check foreign keys enforce relationships</p>
</li>
<li><p>Check to ensure services interact with the database as expected</p>
</li>
<li><p>Check to ensure models and repositories behave consistently</p>
</li>
</ol>
<p>Integration tests reveal issues that only appear when components interact: incorrect joins, broken migrations, mismatched field names, or subtle type mismatches that unit tests cannot detect.</p>
<p>This is the layer where you catch the bugs that would otherwise silently corrupt data lineage.</p>
<p><strong>Here's an example:</strong></p>
<pre><code class="language-plaintext">&lt;?php

namespace Tests\Feature;

use Tests\TestCase;
use App\Models\Order;
use Illuminate\Foundation\Testing\RefreshDatabase;

class ApplyDiscountTest extends TestCase
{
    use RefreshDatabase;

    /** @test */
    public function check_it_persists_the_correct_discounted_total_to_the_database()
    {
        $order = Order::factory()-&gt;create(['subtotal' =&gt; 150]);

        \(response = \)this-&gt;postJson("/orders/{$order-&gt;id}/apply-discount");

        $response-&gt;assertStatus(200);

        $this-&gt;assertDatabaseHas('orders', [
            'id' =&gt; $order-&gt;id,
            'grand_total' =&gt; 135, // 150 - 10% discount
            'discount_total' =&gt; 15
        ]);
    }
}
</code></pre>
<p>This represents a full flow rather than a single rule:</p>
<ul>
<li><p>Controller → Service</p>
</li>
<li><p>Service → Calculation</p>
</li>
<li><p>Controller → Database write</p>
</li>
<li><p>Database → Final state</p>
</li>
</ul>
<p>This test begins by creating an order using an Eloquent factory. It immediately steps beyond the boundaries of a unit test, since it interacts with the database and relies on Laravel’s model layer to persist real data.</p>
<p>From there, the test sends an actual HTTP POST request to the <code>/orders/{id}/apply-discount</code> endpoint, which means it's not calling a method directly, but instead it's traveling through Laravel’s routing layer, invoking the controller responsible for handling the request, and triggering whatever business logic is responsible for calculating and applying the discount.</p>
<p>This movement through multiple layers (routing, controller, service logic, and model persistence) is precisely what defines integration testing: the goal is to verify that these components work together correctly as a system.</p>
<p>Once the request is processed, the test asserts that the response returns a successful status code, which confirms that the HTTP layer behaved as expected.</p>
<p>But the most important part comes afterward, when the test checks the database to ensure that the correct <code>grand_total</code> and <code>discount_total</code> were saved. This final assertion proves that the discount logic was executed, the model was updated, and the changes were successfully written to the database.</p>
<p>In other words, the test isn't merely checking whether a calculation is correct. It's also checking whether the entire pipeline –&nbsp;from receiving the request to updating the database –&nbsp;functions as a coherent whole.</p>
<h3 id="heading-functional-testing-the-business-rule-check">Functional Testing: The Business Rule Check</h3>
<p>Functional tests validate the entire user experience, from the moment a request enters the system to the moment a response is returned. This includes:</p>
<ul>
<li><p>HTTP requests</p>
</li>
<li><p>Controller logic</p>
</li>
<li><p>Validation rules</p>
</li>
<li><p>Service operations</p>
</li>
<li><p>Database writes</p>
</li>
<li><p>Redirects or rendered views</p>
</li>
</ul>
<p>This is where you test the business rules that govern real‑world behaviour:</p>
<p>“A student can't register for two exams at the same time.”</p>
<p>“A cart can't have negative quantities.”</p>
<p>“A user can't update their profile without a valid email.”</p>
<p>Functional tests ensure that the system behaves correctly from the perspective of the user and the business, not just the code.</p>
<h4 id="heading-heres-an-example-functional-test">Here's an example: Functional Test</h4>
<pre><code class="language-plaintext">&lt;?php

namespace Tests\Feature;

use Tests\TestCase;
use App\Models\Product;
use Illuminate\Foundation\Testing\RefreshDatabase;

class CartQuantityFunctionalTest extends TestCase
{
    use RefreshDatabase;

    /** @test */
    public function a_user_cannot_set_a_negative_cart_quantity()
    {
        // Arrange: create a product
        $product = Product::factory()-&gt;create(['price' =&gt; 40]);

        // Simulate existing cart
        $this-&gt;withSession([
            'cart' =&gt; [
                $product-&gt;id =&gt; ['quantity' =&gt; 2]
            ]
        ]);

        // Act: user tries to update quantity to a negative number
        \(response = \)this-&gt;post('/cart/update', [
            'product_id' =&gt; $product-&gt;id,
            'quantity' =&gt; -5
        ]);

        // Assert: system rejects invalid business behaviour
        $response-&gt;assertStatus(302); // redirect back with errors
        $response-&gt;assertSessionHasErrors(['quantity']);

        // Assert: cart remains unchanged (business rule preserved)
        \(this-&gt;assertEquals(2, session('cart')[\)product-&gt;id]['quantity']);
    }
}
</code></pre>
<p>The test begins by creating a realistic environment in which a user interacts with a shopping cart. This is essential for understanding the behaviour the system is meant to enforce.</p>
<p>First, it generates a real product in the database using a factory, giving the product a price so that it resembles an item a customer might genuinely add to their cart.</p>
<p>Once the product exists, the test manually seeds the session with a cart containing that product and a quantity of two. This simulates a user who has already added the item to their cart in a previous interaction, and it establishes the baseline state the system must preserve if the user attempts an invalid update.</p>
<p>With the environment prepared, the test then imitates a user action by sending a POST request to the <code>/cart/update</code> endpoint. Instead of calling a method directly, it uses Laravel’s HTTP layer to reproduce the exact behaviour of a browser submitting a form. The request includes the product ID and a deliberately invalid quantity of negative five.</p>
<p>This is the heart of the scenario: the user is attempting something that violates the business rules of the application, and the test is designed to confirm that the system responds appropriately.</p>
<p>Now, when the request is processed, the test expects the application to reject the input, redirect the user back, and attach validation errors to the session. The assertion that the response has a 302 status code and contains validation errors confirms that the validation layer is functioning correctly and that the controller is enforcing the rule that quantities can't be negative.</p>
<p>The final part of the test is where the business rule is truly verified. After the failed update attempt, the test inspects the session to ensure that the cart remains unchanged. This is crucial because rejecting invalid input is only half of the requirement: the system must also protect the integrity of the existing cart data.</p>
<p>Functional tests answer questions like:</p>
<ul>
<li><p>Does the system prevent invalid real‑world behaviour?</p>
</li>
<li><p>Does the user get the correct feedback?</p>
</li>
<li><p>Does the data remain consistent after the request?</p>
</li>
<li><p>Does the final output match the business expectation?</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Data quality is never the result of a single check or a single team. It emerges from a disciplined, layered approach where each testing level catches a different category of defects.</p>
<p>Unit tests safeguard the smallest rules, integration tests validate the flow of data across components, and functional tests enforce the business logic that governs real‑world behaviour.</p>
<p>When these layers operate together, bad data has nowhere to hide. When they don’t, even a minor oversight can slip through the cracks and escalate into a costly downstream failure.</p>
<p>So as you can see, your role in data quality is fundamentally proactive, not reactive. By designing systems with validation, integrity, and monitoring in mind, you ensure that data flowing through the pipeline is accurate, timely, complete, unique, and fit for purpose – supporting reliable analytics, reporting, and intelligent systems.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an End-to-End ML Platform Locally: From Experiment Tracking to CI/CD ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and tru ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-end-to-end-ml-platform-locally-from-experiment-tracking-to-cicd/</link>
                <guid isPermaLink="false">69b9bab4c22d3eeb8afd5284</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Devops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Platform Engineering  ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ FastAPI ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sandeep Bharadwaj Mannapur ]]>
                </dc:creator>
                <pubDate>Tue, 17 Mar 2026 20:33:56 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/8401d978-0bed-4534-af93-f6bfc1b77c89.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning projects don’t end at training a model in a Jupyter notebook. The hard part is the “last mile”: turning that notebook model into something you can run reliably, update safely, and trust over time.</p>
<p>Most ML systems fail in production for boring (and painful) reasons: the training code and the serving code drift apart, input data changes shape, a “small” preprocessing tweak breaks predictions, or the model silently degrades because real-world behavior shifts. None of these problems are solved by a better algorithm, they’re solved by engineering: repeatable pipelines, validation, versioning, monitoring, and automated checks.</p>
<p>In this hands-on handbook, you’ll build a complete mini ML platform on your local machine, an end-to-end project that takes a model from training to deployment with the core “last mile” infrastructure in place. We’ll use a fraud detection example (predicting fraudulent transactions), but the same workflow works for churn prediction or any binary classification problem. Everything runs locally (no cloud required), and every step is copy-paste runnable so you can follow along and verify outputs as you go.</p>
<p>By the end, you'll have a production-ready ML pipeline running on your machine – from training the model to serving predictions, with the infrastructure to test, monitor, and iterate with confidence. And yes, we'll do it in a hands-on manner with code snippets you can copy-paste and run. Let's dive in!</p>
<p>📦 <strong>Get the Complete Code</strong><br>All code from this handbook is available in a ready-to-run repository:<br><strong>Repository:</strong> <a href="https://github.com/sandeepmb/freecodecamp-local-ml-platform">https://github.com/sandeepmb/freecodecamp-local-ml-platform</a><br>Clone it and follow along, or use it as a reference implementation.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a href="#heading-project-overview-and-setup">Project Overview and Setup</a></p>
</li>
<li><p><a href="#heading-1-build-a-simple-model-and-api-the-naive-approach">Build a Simple Model and API (The Naive Approach)</a></p>
<ul>
<li><p><a href="#heading-11-train-a-quick-model">Train a Quick Model</a></p>
</li>
<li><p><a href="#heading-12-serve-predictions-with-fastapi">Serve Predictions with FastAPI</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-2-where-the-naive-approach-breaks">Where the Naive Approach Breaks</a></p>
<ul>
<li><p><a href="#heading-problem-1-no-experiment-tracking-reproducibility">Problem 1: No Experiment Tracking (Reproducibility)</a></p>
</li>
<li><p><a href="#heading-problem-2-model-versioning-and-deployment-chaos">Problem 2: Model Versioning and Deployment Chaos</a></p>
</li>
<li><p><a href="#heading-problem-3-no-data-validation-garbage-in-garbage-out">Problem 3: No Data Validation – Garbage In, Garbage Out</a></p>
</li>
<li><p><a href="#heading-problem-4-model-drift-performance-decay-over-time">Problem 4: Model Drift – Performance Decay Over Time</a></p>
</li>
<li><p><a href="#heading-problem-5-no-ci-cd-or-deployment-safety">Problem 5: No CI/CD or Deployment Safety</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-3-add-experiment-tracking-and-model-registry-with-mlflow">Add Experiment Tracking and Model Registry with MLflow</a></p>
<ul>
<li><p><a href="#heading-31-how-to-set-up-the-mlflow-tracking-server">How to Set Up the MLflow Tracking Server</a></p>
</li>
<li><p><a href="#heading-32-how-to-log-experiments-in-code">How to Log Experiments in Code</a></p>
</li>
<li><p><a href="#heading-33-how-to-use-the-model-registry">How to Use the Model Registry</a></p>
</li>
<li><p><a href="#heading-34-update-api-to-load-from-registry">Update API to Load from Registry</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-4-ensure-feature-consistency-with-feast">Ensure Feature Consistency with Feast</a></p>
<ul>
<li><p><a href="#heading-41-what-is-feast-and-why-use-it">What Is Feast and Why Use It?</a></p>
</li>
<li><p><a href="#heading-42-install-and-initialize-feast">Install and Initialize Feast</a></p>
</li>
<li><p><a href="#heading-43-define-feature-definitions">Define Feature Definitions</a></p>
</li>
<li><p><a href="#heading-44-materialize-features-to-the-online-store">Materialize Features to the Online Store</a></p>
</li>
<li><p><a href="#heading-45-retrieve-features-for-training-and-serving">Retrieve Features for Training and Serving</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-5-add-data-validation-with-great-expectations">Add Data Validation with Great Expectations</a></p>
<ul>
<li><p><a href="#heading-51-define-expectations">Define Expectations</a></p>
</li>
<li><p><a href="#heading-52-integrate-validation-into-fastapi">Integrate Validation into FastAPI</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-6-monitor-model-performance-and-data-drift">Monitor Model Performance and Data Drift</a></p>
<ul>
<li><p><a href="#heading-61-the-four-pillars-of-ml-observability">The Four Pillars of ML Observability</a></p>
</li>
<li><p><a href="#heading-62-build-a-drift-monitor-with-evidently">Build a Drift Monitor with Evidently</a></p>
</li>
<li><p><a href="#heading-63-production-monitoring-strategy">Production Monitoring Strategy</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-7-automate-testing-and-deployment-with-ci-cd">Automate Testing and Deployment with CI/CD</a></p>
<ul>
<li><p><a href="#heading-71-write-tests-for-data-and-model">Write Tests for Data and Model</a></p>
</li>
<li><p><a href="#heading-72-github-actions-workflow">GitHub Actions Workflow</a></p>
</li>
<li><p><a href="#heading-73-dockerize-the-application">Dockerize the Application</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-8-incident-response-playbook">Incident Response Playbook</a></p>
<ul>
<li><p><a href="#heading-scenario-false-positive-spike">Scenario: False Positive Spike</a></p>
</li>
<li><p><a href="#heading-scenario-gradual-performance-decay">Scenario: Gradual Performance Decay</a></p>
</li>
<li><p><a href="#heading-scenario-upstream-data-schema-change">Scenario: Upstream Data Schema Change</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-9-how-to-put-it-all-together">How to Put It All Together</a></p>
</li>
<li><p><a href="#heading-10-whats-next-scale-to-production">What’s Next: Scale to Production</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-references">References</a></p>
</li>
</ol>
<h2 id="heading-project-overview-and-setup"><strong>Project Overview and Setup</strong></h2>
<p>Before we jump into coding, let's set the stage. Our use-case is <strong>credit card fraud detection</strong> – a binary classification problem where we predict whether a transaction is fraudulent (<code>is_fraud = 1</code>) or legitimate (<code>is_fraud = 0</code>). This is a common ML task and a good proxy for production ML challenges because fraud patterns can change over time (allowing us to discuss model drift), and bad input data (for example, malformed transaction info) can cause serious issues if not handled properly.</p>
<h3 id="heading-tech-stack"><strong>Tech Stack</strong></h3>
<p>We will use Python-based tools that are popular in MLOps but still beginner-friendly:</p>
<table>
<thead>
<tr>
<th><strong>Tool</strong></th>
<th><strong>Purpose</strong></th>
<th><strong>Why We Chose It</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>MLflow</strong></td>
<td>Experiment tracking and model registry</td>
<td>Open-source, widely adopted, great UI</td>
</tr>
<tr>
<td><strong>Feast</strong></td>
<td>Feature store for consistent feature serving</td>
<td>Production-grade, runs locally, same API for offline/online</td>
</tr>
<tr>
<td><strong>FastAPI</strong></td>
<td>High-performance web framework for serving predictions</td>
<td>Fast, automatic docs, modern Python</td>
</tr>
<tr>
<td><strong>Great Expectations</strong></td>
<td>Data validation framework</td>
<td>Declarative expectations, great reports</td>
</tr>
<tr>
<td><strong>Evidently</strong></td>
<td>Monitoring for data drift and model decay</td>
<td>Beautiful reports, easy to integrate</td>
</tr>
<tr>
<td><strong>Docker</strong></td>
<td>Containerization for environment consistency</td>
<td>Industry standard, works everywhere</td>
</tr>
<tr>
<td><strong>GitHub Actions</strong></td>
<td>CI/CD automation</td>
<td>Free for public repos, tight GitHub integration</td>
</tr>
</tbody></table>
<p>Let me explain each tool briefly:</p>
<p><strong>MLflow</strong> is an open-source platform designed to manage the ML lifecycle. It provides experiment tracking (logging parameters, metrics, and artifacts), a model registry (versioning models with aliases), and model serving capabilities. We'll use it to ensure our experiments are reproducible and our models are versioned.</p>
<p><strong>Feast</strong> (Feature Store) is an open-source feature store that helps manage and serve features consistently between training and inference. This prevents a common problem called "training-serving skew" where the features used in production differ slightly from those used in training, causing silent accuracy degradation.</p>
<p><strong>FastAPI</strong> is a modern, fast web framework for building APIs with Python. It's known for being easy to use, efficient, and producing automatic interactive documentation. We'll use it to serve our model predictions.</p>
<p><strong>Great Expectations</strong> is an open-source tool for data quality testing. It allows us to define "expectations" on data (like "amount should be positive" or "hour should be between 0 and 23") and test incoming data against them.</p>
<p><strong>Evidently</strong> is an open-source library for monitoring data and model performance over time. It can detect data drift (when input distributions change) and model decay (when accuracy drops).</p>
<p><strong>Docker</strong> ensures the same environment and dependencies in development and deployment, avoiding the classic "works on my machine" problem.</p>
<p><strong>GitHub Actions</strong> provides CI/CD automation. An efficient CI/CD pipeline helps integrate and deploy changes faster and with fewer errors.</p>
<p>💡 <strong>Mental Model</strong>: Think of this as building a "safety net" around your ML model. Each tool we add catches a different failure mode, like defensive driving for machine learning.</p>
<h3 id="heading-prerequisites"><strong>Prerequisites</strong></h3>
<p>You'll need:</p>
<ul>
<li><p><strong>Python 3.9+</strong> installed on your machine</p>
</li>
<li><p><strong>Docker Desktop</strong> installed and running</p>
</li>
<li><p><strong>GitHub account</strong> (if you want to try the CI/CD pipeline)</p>
</li>
<li><p><strong>Basic familiarity with Python</strong> and ML concepts (what training and prediction mean)</p>
</li>
</ul>
<p>You don't need MLOps or Kubernetes experience. Everything will be done locally with just Python and Docker – <strong>no cloud and no Kubernetes needed</strong>.</p>
<h3 id="heading-project-structure"><strong>Project Structure</strong></h3>
<p>Let's set up a basic project structure on your local machine. Open your terminal and run:</p>
<pre><code class="language-python"># Create project directory and subfolders
mkdir ml-platform-tutorial &amp;&amp; cd ml-platform-tutorial
mkdir -p data models src tests feature_repo

# Set up a virtual environment (recommended)
python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate
</code></pre>
<p>Your project structure should look like this:</p>
<pre><code class="language-python">ml-platform-tutorial/
├── data/              # Training and test datasets
├── models/            # Saved model files
├── src/               # Source code
├── tests/             # Test files
├── feature_repo/      # Feast feature repository
├── venv/              # Virtual environment
└── requirements.txt   # Dependencies
</code></pre>
<p>Next, create a <code>requirements.txt</code> with all the necessary libraries:</p>
<pre><code class="language-python"># requirements.txt

# Core ML libraries
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0

# Experiment tracking and model registry
mlflow==2.10.0

# Feature store
feast==0.36.0

# API framework
fastapi==0.109.0
uvicorn==0.27.0
httpx==0.26.0

# Data validation
great-expectations==0.18.8

# Monitoring
evidently==0.7.20

# Testing
pytest==8.0.0
pytest-cov==4.1.0

# Utilities
pyarrow==15.0.0
pydantic==2.6.0
</code></pre>
<p>📌 <strong>Version Note:</strong> Exact versions are pinned to ensure reproducibility. Newer versions may work, but all examples were tested with the versions listed here.</p>
<p>Install the dependencies:</p>
<pre><code class="language-python">pip install -r requirements.txt
</code></pre>
<p>This might take a few minutes as it installs all the packages. Once complete, we're ready to start building our project step by step.</p>
<p><strong>Checkpoint:</strong> You should have a project folder with <code>data/</code>, <code>models/</code>, <code>src/</code>, <code>tests/</code>, and <code>feature_repo/</code> directories, and an activated virtual environment with all dependencies installed. Verify by running <code>python -c "import mlflow; import feast; import fastapi; print('All imports successful!')"</code>.</p>
<p><strong>Figure 1: The Complete ML Platform We'll Build</strong></p>
<p><em>Don't worry if this looks complex, we'll build each component step by step, starting with the simplest piece and connecting them together.</em></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771392341567/4bfdd727-32fb-4f30-a63e-c94f61a9f2db.png" alt="Architecture diagram of a local end-to-end machine learning platform for fraud detection. Transaction data flows through model training, experiment tracking and model registry in MLflow, feature management in Feast, data validation with Great Expectations, prediction serving through FastAPI, monitoring with Evidently, and automated testing and deployment with Docker and GitHub Actions." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-1-build-a-simple-model-and-api-the-naive-approach"><strong>1. Build a Simple Model and API (The Naive Approach)</strong></h2>
<p>To illustrate why we need all these tools, let's start by building a <strong>naive ML system without any MLOps infrastructure</strong>. We'll train a simple model and deploy it quickly, then observe what problems arise. This "naive approach" is how most ML projects start – and understanding its limitations will motivate the solutions we implement later.</p>
<h3 id="heading-11-train-a-quick-model"><strong>1.1 Train a Quick Model</strong></h3>
<p>First, we need some data. For simplicity, we'll generate a synthetic dataset for fraud detection so that we don't rely on any external data files. The dataset will have features like:</p>
<ul>
<li><p><code>amount</code>: Transaction amount in dollars</p>
</li>
<li><p><code>hour</code>: Hour of the day (0-23) when the transaction occurred</p>
</li>
<li><p><code>day_of_week</code>: Day of the week (0=Monday, 6=Sunday)</p>
</li>
<li><p><code>merchant_category</code>: Type of merchant (grocery, restaurant, retail, online, travel)</p>
</li>
<li><p><code>is_fraud</code>: Label indicating if the transaction is fraudulent (1) or legitimate (0)</p>
</li>
</ul>
<p>We will simulate that only ~2% of transactions are fraud, which is an imbalance typical in real fraud data. This imbalance is important because it affects how we evaluate our model.</p>
<p>Create <code>src/generate_data.py</code>:</p>
<pre><code class="language-python"># src/generate_data.py
"""
Generate synthetic fraud detection dataset.

This script creates realistic-looking transaction data where fraudulent
transactions have different patterns than legitimate ones:
- Fraud tends to have higher amounts
- Fraud tends to occur late at night
- Fraud is more common for online and travel merchants
"""
import pandas as pd
import numpy as np

def generate_transactions(n_samples=10000, fraud_ratio=0.02, seed=42):
    """
    Generate synthetic fraud detection dataset.
    
    Args:
        n_samples: Total number of transactions to generate
        fraud_ratio: Proportion of fraudulent transactions (default 2%)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with transaction features and fraud labels
    
    Fraud transactions have different patterns:
    - Higher amounts (mean \(245 vs \)33 for legit)
    - Late night hours (0-5, 23)
    - More likely to be online or travel merchants
    """
    np.random.seed(seed)
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud

    # Legitimate transactions: normal shopping patterns
    # - Amounts follow a log-normal distribution (most small, some large)
    # - Hours are uniformly distributed throughout the day
    # - Merchant categories weighted toward everyday shopping
    legit = pd.DataFrame({
        "amount": np.random.lognormal(mean=3.5, sigma=1.2, size=n_legit),  # ~$33 average
        "hour": np.random.randint(0, 24, size=n_legit),
        "day_of_week": np.random.randint(0, 7, size=n_legit),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_legit,
            p=[0.30, 0.25, 0.25, 0.15, 0.05]  # Weighted toward everyday shopping
        ),
        "is_fraud": 0
    })
    
    # Fraudulent transactions: suspicious patterns
    # - Higher amounts (fraudsters go big)
    # - Late night hours (less scrutiny)
    # - More online and travel (easier to exploit)
    fraud = pd.DataFrame({
        "amount": np.random.lognormal(mean=5.5, sigma=1.5, size=n_fraud),  # ~$245 average
        "hour": np.random.choice([0, 1, 2, 3, 4, 5, 23], size=n_fraud),  # Late night
        "day_of_week": np.random.randint(0, 7, size=n_fraud),
        "merchant_category": np.random.choice(
            ["grocery", "restaurant", "retail", "online", "travel"],
            size=n_fraud,
            p=[0.05, 0.05, 0.10, 0.60, 0.20]  # Weighted toward online/travel
        ),
        "is_fraud": 1
    })
    
    # Combine and shuffle
    df = pd.concat([legit, fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    return df

if __name__ == "__main__":
    # Generate dataset
    print("Generating synthetic fraud detection dataset...")
    df = generate_transactions(n_samples=10000, fraud_ratio=0.02)
    
    # Split into train (80%) and test (20%)
    train_df = df.sample(frac=0.8, random_state=42)
    test_df = df.drop(train_df.index)
    
    # Save to CSV files
    train_df.to_csv("data/train.csv", index=False)
    test_df.to_csv("data/test.csv", index=False)
    
    # Print summary statistics
    print(f"\nDataset generated successfully!")
    print(f"Training set: {len(train_df):,} transactions")
    print(f"Test set: {len(test_df):,} transactions")
    print(f"Overall fraud ratio: {df['is_fraud'].mean():.2%}")
    print(f"\nLegitimate transactions - Average amount: ${df[df['is_fraud']==0]['amount'].mean():.2f}")
    print(f"Fraudulent transactions - Average amount: ${df[df['is_fraud']==1]['amount'].mean():.2f}")
    print(f"\nMerchant category distribution (fraud):")
    print(df[df['is_fraud']==1]['merchant_category'].value_counts(normalize=True))
</code></pre>
<p>Run the data generation script:</p>
<pre><code class="language-python">python src/generate_data.py
</code></pre>
<p>You should see output like:</p>
<pre><code class="language-python">Generating synthetic fraud detection dataset...

Dataset generated successfully!
Training set: 8,000 transactions
Test set: 2,000 transactions
Overall fraud ratio: 2.00%

Legitimate transactions - Average amount: $33.45
Fraudulent transactions - Average amount: $245.67

Merchant category distribution (fraud):
online        0.60
travel        0.20
retail        0.10
restaurant    0.05
grocery       0.05
</code></pre>
<p>Now you have <code>data/train.csv</code> and <code>data/test.csv</code> with ~8000 training and ~2000 testing transactions.</p>
<p><strong>Why This Matters:</strong> The synthetic data has realistic patterns — fraud is rare (2%), high-value, late-night, and concentrated in certain merchant categories. These patterns give our model something to learn.</p>
<p>Now, let's train a quick model. We'll use a simple <strong>Random Forest classifier</strong> from scikit-learn to predict <code>is_fraud</code>. In this naive version, we won't do much feature engineering – just label encode the categorical <code>merchant_category</code> and feed everything to the model.</p>
<p>Create <code>src/train_naive.py</code>:</p>
<pre><code class="language-python"># src/train_naive.py
"""
Train a fraud detection model - NAIVE VERSION.

This script demonstrates the "quick and dirty" approach to ML:
- No experiment tracking
- No model versioning
- Just train and save to a pickle file

We'll improve on this in later sections.
"""
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score,
    confusion_matrix,
    classification_report
)

def main():
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    print(f"Training samples: {len(train_df):,}")
    print(f"Test samples: {len(test_df):,}")
    print(f"Training fraud ratio: {train_df['is_fraud'].mean():.2%}")
    
    # Encode the categorical feature
    # We need to save the encoder to use the same mapping at inference time
    print("\nEncoding categorical features...")
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    print(f"Merchant category mapping: {dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))}")
    
    # Prepare features and labels
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    # Train a Random Forest classifier
    print("\nTraining Random Forest model...")
    model = RandomForestClassifier(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of each tree
        random_state=42,       # For reproducibility
        n_jobs=-1              # Use all CPU cores
    )
    model.fit(X_train, y_train)
    print("Training complete!")
    
    # Evaluate on test data
    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    
    print("\nConfusion Matrix:")
    cm = confusion_matrix(y_test, y_pred)
    print(f"  True Negatives:  {cm[0][0]:,} (correctly identified legitimate)")
    print(f"  False Positives: {cm[0][1]:,} (legitimate flagged as fraud)")
    print(f"  False Negatives: {cm[1][0]:,} (fraud missed - DANGEROUS!)")
    print(f"  True Positives:  {cm[1][1]:,} (correctly caught fraud)")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    # Feature importance
    print("\nFeature Importance:")
    for name, importance in sorted(
        zip(feature_cols, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    ):
        print(f"  {name}: {importance:.4f}")
    
    # Save the model and encoder together
    print("\nSaving model to models/model.pkl...")
    with open("models/model.pkl", "wb") as f:
        pickle.dump((model, encoder), f)
    
    print("\nModel trained and saved successfully!")
    print("\nWARNING: This naive approach has several problems:")
    print("  - No record of hyperparameters or metrics")
    print("  - No model versioning")
    print("  - No way to reproduce this exact model")
    print("  - We'll fix these issues in the following sections!")

if __name__ == "__main__":
    main()
</code></pre>
<p>Run the training script:</p>
<pre><code class="language-python">python src/train_naive.py
</code></pre>
<p>You should see output similar to:</p>
<pre><code class="language-python">Loading data...
Training samples: 8,000
Test samples: 2,000
Training fraud ratio: 2.00%

Encoding categorical features...
Merchant category mapping: {'grocery': 0, 'online': 1, 'restaurant': 2, 'retail': 3, 'travel': 4}

Training Random Forest model...
Training complete!

==================================================
MODEL EVALUATION
==================================================

Accuracy:  0.9820
Precision: 0.7273
Recall:    0.6154
F1-score:  0.6667

Confusion Matrix:
  True Negatives:  1,956 (correctly identified legitimate)
  False Positives: 4 (legitimate flagged as fraud)
  False Negatives: 32 (fraud missed - DANGEROUS!)
  True Positives:  8 (correctly caught fraud)

Feature Importance:
  amount: 0.5423
  hour: 0.2156
  merchant_encoded: 0.1345
  day_of_week: 0.1076
</code></pre>
<p><strong>Important observation:</strong> You'll see ~98% accuracy but a lower F1-score (around 0.5-0.7). <strong>With only 2% fraud, accuracy is extremely misleading!</strong> A model that always predicts "not fraud" would achieve 98% accuracy while catching zero fraud. This is why we focus on F1-score, precision, and recall for imbalanced classification problems.</p>
<p>💡 If you're new to imbalanced classification, remember: high accuracy can be meaningless when the positive class is rare.</p>
<p>The script outputs a file <code>models/model.pkl</code> containing both the trained model and the label encoder (we need both for inference).</p>
<p><strong>Checkpoint:</strong> You should now have:</p>
<ul>
<li><p><code>data/train.csv</code> (~8,000 rows)</p>
</li>
<li><p><code>data/test.csv</code> (~2,000 rows)</p>
</li>
<li><p><code>models/model.pkl</code> (trained model + encoder)</p>
</li>
</ul>
<p>The model should show ~98% accuracy but F1 around 0.5-0.7. Verify the files exist: <code>ls -la data/ models/</code></p>
<h3 id="heading-12-serve-predictions-with-fastapi"><strong>1.2 Serve Predictions with FastAPI</strong></h3>
<p>Now that we have a model, let's deploy it as an API so that clients can get predictions. We'll use <strong>FastAPI</strong> because it's straightforward, very fast, and produces automatic interactive documentation.</p>
<p>FastAPI is known for:</p>
<ul>
<li><p><strong>Easy to use</strong>: Pythonic syntax with type hints</p>
</li>
<li><p><strong>High performance</strong>: One of the fastest Python frameworks</p>
</li>
<li><p><strong>Automatic documentation</strong>: Swagger UI out of the box</p>
</li>
<li><p><strong>Data validation</strong>: Using Pydantic models</p>
</li>
</ul>
<p>Create <code>src/serve_naive.py</code>:</p>
<pre><code class="language-python"># src/serve_naive.py
"""
Serve fraud detection model as a REST API - NAIVE VERSION.

This is a simple API that:
1. Loads the trained model at startup
2. Accepts transaction data via POST request
3. Returns fraud prediction

We'll improve this with validation, monitoring, and better
model loading in later sections.
"""
import pickle
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional

# Load the trained model and encoder at startup
# This is loaded once when the server starts, not on every request
print("Loading model...")
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)
print("Model loaded successfully!")

# Create the FastAPI application
app = FastAPI(
    title="Fraud Detection API",
    description="""
    Predict whether a credit card transaction is fraudulent.
    
    This API accepts transaction details and returns:
    - Whether the transaction is predicted to be fraud
    - The probability of fraud (0.0 to 1.0)
    
    **Note:** This is the naive version without validation or monitoring.
    """,
    version="1.0.0"
)

# Define the input schema using Pydantic
# This provides automatic validation and documentation
class Transaction(BaseModel):
    """Schema for a transaction to be evaluated for fraud."""
    amount: float = Field(
        ..., 
        description="Transaction amount in dollars",
        example=150.00
    )
    hour: int = Field(
        ..., 
        description="Hour of the day (0-23)",
        example=14
    )
    day_of_week: int = Field(
        ..., 
        description="Day of week (0=Monday, 6=Sunday)",
        example=3
    )
    merchant_category: str = Field(
        ..., 
        description="Type of merchant",
        example="online"
    )

class PredictionResponse(BaseModel):
    """Schema for the prediction response."""
    is_fraud: bool = Field(description="Whether the transaction is predicted as fraud")
    fraud_probability: float = Field(description="Probability of fraud (0.0 to 1.0)")
    
@app.post("/predict", response_model=PredictionResponse)
def predict(transaction: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Takes transaction details and returns a fraud prediction
    along with the probability score.
    """
    # Convert the request to a dictionary
    data = transaction.dict()
    
    # Encode the merchant category using the same encoder from training
    # This ensures consistency between training and serving
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        # Handle unknown merchant categories
        # In production, we'd want better handling here
        data["merchant_encoded"] = 0
    
    # Prepare features in the same order as training
    X = [[
        data["amount"],
        data["hour"],
        data["day_of_week"],
        data["merchant_encoded"]
    ]]
    
    # Get prediction and probability
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0][1]  # Probability of class 1 (fraud)
    
    return PredictionResponse(
        is_fraud=bool(prediction),
        fraud_probability=round(float(probability), 4)
    )

@app.get("/health")
def health_check():
    """
    Health check endpoint.
    
    Returns the status of the API. Useful for:
    - Load balancer health checks
    - Kubernetes liveness probes
    - Monitoring systems
    """
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/")
def root():
    """Root endpoint with API information."""
    return {
        "message": "Fraud Detection API",
        "version": "1.0.0",
        "docs": "/docs",
        "health": "/health"
    }
</code></pre>
<p>A few important things to note about this code:</p>
<ol>
<li><p><strong>Pydantic Models</strong>: We use <code>BaseModel</code> to define the expected input JSON schema. FastAPI automatically validates incoming requests against this schema.</p>
</li>
<li><p><strong>Type Hints</strong>: The type hints (<code>float</code>, <code>int</code>, <code>str</code>) provide both documentation and runtime validation.</p>
</li>
<li><p><strong>Feature Encoding</strong>: On each request, we encode the merchant category using the same <code>LabelEncoder</code> we saved from training. This ensures consistency between training and serving.</p>
</li>
<li><p><strong>Health Endpoint</strong>: The <code>/health</code> endpoint is standard practice for production APIs - it allows load balancers and monitoring systems to check if the service is running.</p>
</li>
</ol>
<p>To run this API, use Uvicorn (an ASGI server):</p>
<pre><code class="language-python">uvicorn src.serve_naive:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>The <code>--reload</code> flag enables auto-reload during development (the server restarts when you change code).</p>
<p>You should see:</p>
<pre><code class="language-python">Loading model...
Model loaded successfully!
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process
</code></pre>
<p>Now open your browser and go to <code>http://localhost:8000/docs</code>. You'll see the <strong>Swagger UI</strong> – an auto-generated interactive documentation where you can test the API directly from your browser!</p>
<p>Test the API using curl in another terminal:</p>
<pre><code class="language-python"># Test with a legitimate-looking transaction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}'
</code></pre>
<p>Expected response:</p>
<pre><code class="language-python">{"is_fraud": false, "fraud_probability": 0.02}
</code></pre>
<pre><code class="language-python"># Test with a suspicious transaction (high amount, late night, online)
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": 500.0, "hour": 3, "day_of_week": 1, "merchant_category": "online"}'
</code></pre>
<p>Expected response:</p>
<pre><code class="language-python">{"is_fraud": true, "fraud_probability": 0.78}
</code></pre>
<p><strong>We have a working model served as an API!</strong> In a real scenario, we could now integrate this API with a payment processing frontend, mobile app, or any system that needs fraud predictions.</p>
<p>But before we celebrate, let's examine this naive approach for potential pitfalls...</p>
<p><strong>Checkpoint:</strong> Your API should be running at <code>http://localhost:8000</code>. The Swagger UI at <code>/docs</code> should show both endpoints (<code>/predict</code> and <code>/health</code>). Test with curl or the Swagger UI to verify predictions are returned.</p>
<h2 id="heading-2-where-the-naive-approach-breaks"><strong>2. Where the Naive Approach Breaks</strong></h2>
<p>Our quick-and-dirty ML pipeline works on the surface: it can train a model and serve predictions. However, <strong>hidden problems will emerge</strong> if we try to maintain or scale this system in production.</p>
<p>This section is critical: understanding these issues will motivate the solutions we implement in the following sections. Let's go through the problems one by one.</p>
<h3 id="heading-problem-1-no-experiment-tracking-reproducibility"><strong>Problem 1: No Experiment Tracking (Reproducibility)</strong></h3>
<p>Try this thought experiment: Run <code>train_naive.py</code> again with different hyperparameters (change <code>n_estimators</code> to 200, or <code>max_depth</code> to 15). Would you be able to <strong>exactly reproduce the previous model's results</strong> if someone asked?</p>
<p>Probably not. Currently, we have <strong>no record</strong> of:</p>
<ul>
<li><p>Which hyperparameters we used</p>
</li>
<li><p>What metrics we achieved</p>
</li>
<li><p>What version of the data we trained on</p>
</li>
<li><p>What library versions were installed</p>
</li>
<li><p>When the training happened</p>
</li>
<li><p>Who ran the training</p>
</li>
</ul>
<p>Three months from now, if your manager asks "How was this model trained? Can you reproduce the results?" – you'd be in trouble. You might have the code, but you don't know which version of the code, which parameters, or which data produced the model that's currently in production.</p>
<p><strong>Experiment tracking</strong> is the practice of logging all these details (code versions, parameters, metrics, data versions, artifacts) so experiments can be compared and replicated. Our naive approach lacks this entirely, making our results hard to trust or build upon.</p>
<h3 id="heading-problem-2-model-versioning-and-deployment-chaos"><strong>Problem 2: Model Versioning and Deployment Chaos</strong></h3>
<p>We trained one model and saved it as <code>model.pkl</code>. Now consider this scenario:</p>
<ol>
<li><p>You train a new model with different hyperparameters</p>
</li>
<li><p>You overwrite <code>model.pkl</code> with the new model</p>
</li>
<li><p>You deploy it to production</p>
</li>
<li><p>Users start complaining about more false positives</p>
</li>
<li><p>You want to roll back to the previous model</p>
</li>
<li><p><strong>Problem:</strong> The previous model was overwritten and is gone forever</p>
</li>
</ol>
<p>There's no systematic versioning. Questions you cannot answer:</p>
<ul>
<li><p>Which model version is currently in production?</p>
</li>
<li><p>What were the metrics for model v1 vs v2?</p>
</li>
<li><p>When was each model trained and by whom?</p>
</li>
<li><p>Can we instantly roll back if the new model performs worse?</p>
</li>
<li><p>What changed between versions?</p>
</li>
</ul>
<p>Without version control for models, you're flying blind. Imagine deploying code without Git – that's what we're doing with our model.</p>
<h3 id="heading-problem-3-no-data-validation-garbage-in-garbage-out"><strong>Problem 3: No Data Validation – Garbage In, Garbage Out</strong></h3>
<p>Right now, our API will accept <strong>any input</strong> and try to make a prediction. Let's see what happens with bad data.</p>
<p>Create a test script <code>src/test_bad_data.py</code>:</p>
<pre><code class="language-python"># src/test_bad_data.py
"""Test what happens when we send garbage data to the API."""
import requests

BASE_URL = "http://localhost:8000"

print("Testing API with various bad inputs...\n")

# Test 1: Negative amount
print("Test 1: Negative amount")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -500.0,        # Negative amount - impossible!
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 2: Invalid hour
print("Test 2: Hour = 25 (should be 0-23)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 25,              # Invalid hour!
    "day_of_week": 3,
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 3: Invalid day of week
print("Test 3: day_of_week = 10 (should be 0-6)")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 10,       # Invalid day!
    "merchant_category": "online"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 4: Unknown merchant category
print("Test 4: Unknown merchant category")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": 100.0,
    "hour": 14,
    "day_of_week": 3,
    "merchant_category": "unknown_category"  # Not in training data!
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

# Test 5: All bad at once
print("Test 5: Everything wrong")
response = requests.post(f"{BASE_URL}/predict", json={
    "amount": -1000.0,
    "hour": 99,
    "day_of_week": 15,
    "merchant_category": "totally_fake"
})
print(f"  Status: {response.status_code}")
print(f"  Response: {response.json()}\n")

print("Observation: The API happily accepts ALL garbage and returns predictions!")
print("This is dangerous - bad data leads to bad predictions with no warning.")
</code></pre>
<p>Run it (make sure your API is still running):</p>
<pre><code class="language-python">python src/test_bad_data.py
</code></pre>
<p>You'll see something like:</p>
<pre><code class="language-python">Testing API with various bad inputs...

Test 1: Negative amount
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.15}

Test 2: Hour = 25 (should be 0-23)
  Status: 200
  Response: {'is_fraud': False, 'fraud_probability': 0.08}

...

Observation: The API happily accepts ALL garbage and returns predictions!
</code></pre>
<p><strong>The API accepts garbage and returns predictions with no warning!</strong> In production, this could mean:</p>
<ul>
<li><p>Incorrect predictions based on impossible data</p>
</li>
<li><p>Fraud going undetected because of malformed input</p>
</li>
<li><p>Legitimate transactions blocked based on corrupted data</p>
</li>
<li><p>No way to debug why predictions are wrong</p>
</li>
</ul>
<p>As the saying goes: <strong>"Garbage in, garbage out."</strong> But even worse – we don't even know garbage went in!</p>
<h3 id="heading-problem-4-model-drift-performance-decay-over-time"><strong>Problem 4: Model Drift – Performance Decay Over Time</strong></h3>
<p>Here's a scenario that happens in every production ML system:</p>
<ol>
<li><p><strong>January</strong>: You train your model on historical fraud data. It achieves 98% accuracy and 0.67 F1-score. Everyone's happy.</p>
</li>
<li><p><strong>February</strong>: The model is deployed and working well. Fraud is being caught.</p>
</li>
<li><p><strong>March</strong>: Fraudsters adapt. They start using different patterns – smaller amounts, different merchant categories, different times of day.</p>
</li>
<li><p><strong>April</strong>: Your model's accuracy has dropped from 98% to 85%. F1-score dropped from 0.67 to 0.35. Fraud is slipping through.</p>
</li>
<li><p><strong>May</strong>: A major fraud incident occurs. Investigation reveals the model has been underperforming for 2 months.</p>
</li>
</ol>
<p><strong>The problem:</strong> Nobody noticed for 2 months because there was no monitoring.</p>
<p>This phenomenon is called <strong>data drift</strong> (when input data distributions change) or <strong>concept drift</strong> (when the relationship between inputs and outputs changes). Both are inevitable in real-world systems.</p>
<p>Without monitoring:</p>
<ul>
<li><p>You don't know when performance degrades</p>
</li>
<li><p>You don't know why performance degrades</p>
</li>
<li><p>You can't take corrective action until users complain</p>
</li>
<li><p>By then, significant damage may have occurred</p>
</li>
</ul>
<h3 id="heading-problem-5-no-cicd-or-deployment-safety"><strong>Problem 5: No CI/CD or Deployment Safety</strong></h3>
<p>Our "deployment process" was literally:</p>
<ol>
<li><p>SSH into the server (or run locally)</p>
</li>
<li><p>Run <code>python src/train_naive.py</code></p>
</li>
<li><p>Copy model.pkl to the right place</p>
</li>
<li><p>Restart the API</p>
</li>
<li><p>Hope for the best</p>
</li>
</ol>
<p>There's:</p>
<ul>
<li><p><strong>No automated testing</strong>: A typo could break everything</p>
</li>
<li><p><strong>No staging environment</strong>: We test directly in production</p>
</li>
<li><p><strong>No gradual rollout</strong>: 100% of traffic hits the new model immediately</p>
</li>
<li><p><strong>No rollback capability</strong>: If something breaks, we have to manually fix it</p>
</li>
<li><p><strong>No audit trail</strong>: Who deployed what and when?</p>
</li>
</ul>
<p>This is how production incidents happen. A rushed deployment at 5 PM on Friday breaks the fraud detection system, and nobody notices until Monday when fraud losses have spiked.</p>
<p><strong>Figure 2:</strong> Problems with the Naive Approach</p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771392425864/75c51059-5ab3-4e08-b3ad-7f5e9c3e7445.png" alt="Diagram showing the weaknesses of a naive machine learning setup: manual training and deployment, no experiment tracking, no model versioning, inconsistent features between training and serving, no data validation, no drift or performance monitoring, and no CI/CD safeguards such as automated tests, rollback, or audit trail." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-summary-what-we-need-to-fix"><strong>Summary: What We Need to Fix</strong></h3>
<p>Our simple ML service is missing critical infrastructure. Here's the mapping of problems to solutions:</p>
<table>
<thead>
<tr>
<th><strong>Problem</strong></th>
<th><strong>Impact</strong></th>
<th><strong>Solution</strong></th>
<th><strong>Section</strong></th>
</tr>
</thead>
<tbody><tr>
<td>No experiment tracking</td>
<td>Can't reproduce or compare models</td>
<td>MLflow Tracking</td>
<td>3</td>
</tr>
<tr>
<td>No model versioning</td>
<td>Can't roll back or audit</td>
<td>MLflow Registry</td>
<td>3</td>
</tr>
<tr>
<td>No feature consistency</td>
<td>Training-serving skew</td>
<td>Feast Feature Store</td>
<td>4</td>
</tr>
<tr>
<td>No data validation</td>
<td>Garbage predictions</td>
<td>Great Expectations</td>
<td>5</td>
</tr>
<tr>
<td>No monitoring</td>
<td>Drift goes unnoticed</td>
<td>Evidently</td>
<td>6</td>
</tr>
<tr>
<td>No CI/CD</td>
<td>Risky deployments</td>
<td>GitHub Actions + Docker</td>
<td>7</td>
</tr>
</tbody></table>
<p><strong>The good news:</strong> We can fix each of these by incrementally adding components to our pipeline. Each tool addresses a specific problem, and together they form a robust ML platform.</p>
<p>Let's start fixing these issues, one by one.</p>
<h2 id="heading-3-add-experiment-tracking-and-model-registry-with-mlflow"><strong>3. Add Experiment Tracking and Model Registry with MLflow</strong></h2>
<p><strong>What breaks without this:</strong> You can't reproduce yesterday's results, can't compare experiments, and can't roll back when a new model fails in production.</p>
<p>Our first fix addresses <strong>Problems 1 and 2</strong>: experiment reproducibility and model versioning.</p>
<p><strong>MLflow</strong> is an open-source platform designed to manage the ML lifecycle. We'll use two of its key components:</p>
<ol>
<li><p><strong>MLflow Tracking</strong>: Log experiments (parameters, metrics, artifacts) so you can compare runs and reproduce results</p>
</li>
<li><p><strong>MLflow Model Registry</strong>: Version your models with aliases (champion, challenger) and manage the deployment lifecycle</p>
</li>
</ol>
<p><strong>Why This Matters:</strong> Without tracking, ML is guesswork. With MLflow, every run is logged with parameters, metrics, and artifacts. You can compare runs side-by-side, understand what actually improved your model, and reproduce any past experiment. The Model Registry adds governance – you know exactly which model is in production and can roll back in seconds.</p>
<h3 id="heading-31-how-to-set-up-the-mlflow-tracking-server"><strong>3.1</strong> How to Set Up the MLflow Tracking Server</h3>
<p>MLflow can log experiments to a local directory by default, but to use the full UI and model registry, it's best to run the MLflow tracking server.</p>
<p>Open a <strong>new terminal</strong> (keep it separate from your API terminal) and run:</p>
<pre><code class="language-python"># Create a directory for MLflow data
mkdir -p mlruns

# Start the MLflow server
mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns
</code></pre>
<p>Let's break down these parameters:</p>
<ul>
<li><p><code>--host 0.0.0.0</code>: Listen on all network interfaces</p>
</li>
<li><p><code>--port 5000</code>: Run on port 5000</p>
</li>
<li><p><code>--backend-store-uri sqlite:///mlflow.db</code>: Store experiment metadata in a SQLite database (for production, you'd use PostgreSQL or MySQL)</p>
</li>
<li><p><code>--default-artifact-root ./mlruns</code>: Store model artifacts (files) in the <code>mlruns</code> directory</p>
</li>
</ul>
<p>You should see:</p>
<pre><code class="language-python">[INFO] Starting gunicorn 21.2.0
[INFO] Listening at: http://0.0.0.0:5000
</code></pre>
<p>Now open your browser and navigate to <code>http://localhost:5000</code>. You'll see the <strong>MLflow UI</strong> – it should be empty initially since we haven't logged any experiments yet.</p>
<h3 id="heading-32-how-to-log-experiments-in-code"><strong>3.2</strong> How to Log Experiments in Code</h3>
<p>Now let's modify our training script to log everything to MLflow. Create <code>src/train_mlflow.py</code>:</p>
<pre><code class="language-python"># src/train_mlflow.py
"""
Train fraud detection model with MLflow experiment tracking.

This script demonstrates proper ML experiment tracking:
- Log all hyperparameters
- Log all metrics (train and test)
- Log the trained model as an artifact
- Register the model in the Model Registry

Compare this to train_naive.py to see the difference!
"""
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    roc_auc_score
)
import pickle
from datetime import datetime

# Configure MLflow to use our tracking server
mlflow.set_tracking_uri("http://localhost:5000")

# Create or get the experiment
# All runs will be grouped under this experiment name
mlflow.set_experiment("fraud-detection")

def load_and_preprocess_data():
    """Load and preprocess the training and test data."""
    print("Loading data...")
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # Encode categorical feature
    encoder = LabelEncoder()
    train_df["merchant_encoded"] = encoder.fit_transform(train_df["merchant_category"])
    test_df["merchant_encoded"] = encoder.transform(test_df["merchant_category"])
    
    # Prepare features
    feature_cols = ["amount", "hour", "day_of_week", "merchant_encoded"]
    X_train = train_df[feature_cols]
    y_train = train_df["is_fraud"]
    X_test = test_df[feature_cols]
    y_test = test_df["is_fraud"]
    
    return X_train, y_train, X_test, y_test, encoder

def train_and_log_model(
    n_estimators: int = 100,
    max_depth: int = 10,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1
):
    """
    Train a model and log everything to MLflow.
    
    Args:
        n_estimators: Number of trees in the forest
        max_depth: Maximum depth of each tree
        min_samples_split: Minimum samples required to split a node
        min_samples_leaf: Minimum samples required at a leaf node
    """
    X_train, y_train, X_test, y_test, encoder = load_and_preprocess_data()
    
    # Start an MLflow run - everything logged will be associated with this run
    with mlflow.start_run():
        # Add a descriptive run name
        run_name = f"rf_est{n_estimators}_depth{max_depth}_{datetime.now().strftime('%H%M%S')}"
        mlflow.set_tag("mlflow.runName", run_name)
        
        # Log all hyperparameters
        # These are the "knobs" we can tune
        mlflow.log_param("n_estimators", n_estimators)
        mlflow.log_param("max_depth", max_depth)
        mlflow.log_param("min_samples_split", min_samples_split)
        mlflow.log_param("min_samples_leaf", min_samples_leaf)
        mlflow.log_param("model_type", "RandomForestClassifier")
        
        # Log data information
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("test_samples", len(X_test))
        mlflow.log_param("fraud_ratio", float(y_train.mean()))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Train the model
        print(f"\nTraining model: n_estimators={n_estimators}, max_depth={max_depth}")
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Evaluate and log metrics for BOTH train and test sets
        # This helps detect overfitting
        for dataset_name, X, y in [("train", X_train, y_train), ("test", X_test, y_test)]:
            y_pred = model.predict(X)
            y_prob = model.predict_proba(X)[:, 1]
            
            # Calculate all metrics
            accuracy = accuracy_score(y, y_pred)
            precision = precision_score(y, y_pred, zero_division=0)
            recall = recall_score(y, y_pred, zero_division=0)
            f1 = f1_score(y, y_pred, zero_division=0)
            roc_auc = roc_auc_score(y, y_prob)
            
            # Log metrics with dataset prefix
            mlflow.log_metric(f"{dataset_name}_accuracy", accuracy)
            mlflow.log_metric(f"{dataset_name}_precision", precision)
            mlflow.log_metric(f"{dataset_name}_recall", recall)
            mlflow.log_metric(f"{dataset_name}_f1", f1)
            mlflow.log_metric(f"{dataset_name}_roc_auc", roc_auc)
            
            print(f"  {dataset_name.upper()} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}, ROC-AUC: {roc_auc:.4f}")
        
        # Log feature importance
        for feature, importance in zip(
            ["amount", "hour", "day_of_week", "merchant_encoded"],
            model.feature_importances_
        ):
            mlflow.log_metric(f"importance_{feature}", importance)
        
        # Log the model to MLflow AND register it in the Model Registry
        # This creates a new version of the model automatically
        print("\nRegistering model in MLflow Model Registry...")
        mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            registered_model_name="fraud-detection-model",
            input_example=X_train.iloc[:5]  # Example input for documentation
        )
        
        # Save and log the encoder as a separate artifact
        # We need this for inference
        with open("encoder.pkl", "wb") as f:
            pickle.dump(encoder, f)
        mlflow.log_artifact("encoder.pkl")
        
        # Get the run ID for reference
        run_id = mlflow.active_run().info.run_id
        print(f"\nMLflow Run ID: {run_id}")
        print(f"View this run: http://localhost:5000/#/experiments/1/runs/{run_id}")
        
        return model, encoder

def run_experiment_sweep():
    """
    Run multiple experiments with different hyperparameters.
    
    This demonstrates how MLflow helps compare different configurations.
    """
    print("="*60)
    print("RUNNING HYPERPARAMETER EXPERIMENT SWEEP")
    print("="*60)
    
    # Define different configurations to try
    experiments = [
        {"n_estimators": 50, "max_depth": 5},
        {"n_estimators": 100, "max_depth": 10},
        {"n_estimators": 100, "max_depth": 15},
        {"n_estimators": 200, "max_depth": 10},
        {"n_estimators": 200, "max_depth": 20},
    ]
    
    for i, params in enumerate(experiments, 1):
        print(f"\n--- Experiment {i}/{len(experiments)} ---")
        train_and_log_model(**params)
    
    print("\n" + "="*60)
    print("EXPERIMENT SWEEP COMPLETE!")
    print("="*60)
    print("\nView all experiments at: http://localhost:5000")
    print("Compare runs to find the best hyperparameters!")

if __name__ == "__main__":
    run_experiment_sweep()
</code></pre>
<p>This script:</p>
<ol>
<li><p><strong>Connects to MLflow</strong>: <code>mlflow.set_tracking_uri("</code><a href="http://localhost:5000"><code>http://localhost:5000</code></a><code>")</code></p>
</li>
<li><p><strong>Creates an experiment</strong>: <code>mlflow.set_experiment("fraud-detection")</code></p>
</li>
<li><p><strong>Logs parameters</strong>: All hyperparameters and data info</p>
</li>
<li><p><strong>Logs metrics</strong>: Accuracy, precision, recall, F1, ROC-AUC for both train and test sets</p>
</li>
<li><p><strong>Logs the model</strong>: Saves the trained model as an artifact</p>
</li>
<li><p><strong>Registers the model</strong>: Adds it to the Model Registry with automatic versioning</p>
</li>
</ol>
<p>Run the experiment sweep:</p>
<pre><code class="language-python">python src/train_mlflow.py
</code></pre>
<p>You'll see output for each experiment:</p>
<pre><code class="language-python">============================================================
RUNNING HYPERPARAMETER EXPERIMENT SWEEP
============================================================

--- Experiment 1/5 ---
Loading data...
Training model: n_estimators=50, max_depth=5
  TRAIN - Accuracy: 0.9821, F1: 0.6545, ROC-AUC: 0.9234
  TEST - Accuracy: 0.9795, F1: 0.5714, ROC-AUC: 0.8956

Registering model in MLflow Model Registry...
MLflow Run ID: abc123...

--- Experiment 5/5 ---
Training model: n_estimators=200, max_depth=20
  TRAIN - Accuracy: 0.9856, F1: 0.7123, ROC-AUC: 0.9567
  TEST - Accuracy: 0.9810, F1: 0.6667, ROC-AUC: 0.9234

============================================================
EXPERIMENT SWEEP COMPLETE!
============================================================
</code></pre>
<p>All 5 runs are now logged to MLflow with full metrics comparison available in the UI.</p>
<p>Now refresh the MLflow UI at <code>http://localhost:5000</code>. You'll see:</p>
<ol>
<li><p><strong>Experiments tab</strong>: Shows the "fraud-detection" experiment with 5 runs</p>
</li>
<li><p><strong>Each run</strong>: Shows parameters, metrics, and artifacts</p>
</li>
<li><p><strong>Compare</strong>: You can select multiple runs and compare them side-by-side</p>
</li>
<li><p><strong>Models tab</strong>: Shows "fraud-detection-model" with 5 versions</p>
</li>
</ol>
<p><strong>MLflow Tracking UI: Compare runs, metrics, and models at a glance</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771396202929/c5a7d547-31b6-4783-acea-f4e9433d81ef.png" alt="c5a7d547-31b6-4783-acea-f4e9433d81ef" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-33-how-to-use-the-model-registry"><strong>3.3</strong> How to Use the Model Registry</h3>
<p>The <strong>Model Registry</strong> provides a central hub for managing model versions and their lifecycle stages.</p>
<p>In the MLflow UI:</p>
<ol>
<li><p>Click the <strong>"Models"</strong> tab in the top navigation</p>
</li>
<li><p>Click <strong>"fraud-detection-model"</strong></p>
</li>
<li><p>You'll see all 5 versions listed with their metrics</p>
</li>
</ol>
<p><strong>Model Aliases:</strong> MLflow now uses <strong>aliases</strong> instead of stages. If you've seen older tutorials using "Staging" and "Production" stages, aliases are the newer, more flexible approach.</p>
<ul>
<li><p><strong>@champion</strong>: The production model serving live traffic</p>
</li>
<li><p><strong>@challenger</strong>: Candidate model being tested</p>
</li>
<li><p>You can create custom aliases like @baseline, @latest and so on.</p>
</li>
</ul>
<p><strong>Assign an alias:</strong></p>
<ol>
<li><p>Open MLflow UI → Models → fraud-detection-model</p>
</li>
<li><p>Click on the version you want to promote</p>
</li>
<li><p>Click <strong>"Add Alias"</strong></p>
</li>
<li><p>Enter <code>champion</code> and save</p>
</li>
</ol>
<p>Now you've assigned the <code>@champion</code> alias to your best model. Your API will load whichever version has this alias, making rollbacks as simple as moving the alias to a different version.</p>
<p><strong>Figure 3: MLflow Model Lifecycle — From Training to Production</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1771396081377/da67d89f-b82d-4189-8150-ecc142ed198a.png" alt="Diagram showing the MLflow model lifecycle for a fraud detection system: a model is trained with experiment parameters, logged to MLflow tracking with metrics and artifacts, registered in the model registry as multiple versions, assigned aliases such as champion and challenger, and served in production by loading the model through the champion alias. The diagram also shows rollback by moving the alias to an earlier version and restarting the API." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-34-update-api-to-load-from-registry"><strong>3.4 Update API to Load from Registry</strong></h3>
<p>Now let's update our API to load the champion model from the MLflow Registry instead of a pickle file. Create <code>src/serve_mlflow.py</code>:</p>
<pre><code class="language-python"># src/serve_mlflow.py
"""
Serve fraud detection model from MLflow Model Registry.

This version loads the @champion model from MLflow, which means:
- Always serves the latest @champion model
- Can roll back by changing the @champion alias
- No manual file copying needed
"""
import mlflow
import mlflow.sklearn
import pickle
import os
from fastapi import FastAPI
from pydantic import BaseModel, Field

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

print("Loading model from MLflow Model Registry...")

# Load the champion model from the registry
# This automatically gets whichever version has the @champion alias
try:
    model = mlflow.sklearn.load_model("models:/fraud-detection-model@champion")
    print("Successfully loaded champion model from MLflow!")
except Exception as e:
    print(f"Error loading from MLflow: {e}")
    print("Make sure you've assigned the @champion alias to a model in the MLflow UI")
    raise

# Load the encoder (saved as an artifact)
# In a real system, you might also version this in MLflow
with open("encoder.pkl", "rb") as f:
    encoder = pickle.load(f)
print("Encoder loaded successfully!")

app = FastAPI(
    title="Fraud Detection API (MLflow)",
    description="""
    Fraud detection API that loads models from MLflow Model Registry.
    
    This version always serves the model with the @champion alias.
    To update the model:
    1. Train a new model with train_mlflow.py
    2. Compare metrics in MLflow UI
    3. Promote the best model to Production
    4. Restart this API
    
    To roll back: Move the @champion alias to a previous version in MLflow UI.
    """,
    version="2.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount in dollars", example=150.00)
    hour: int = Field(..., description="Hour of the day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Monday, 6=Sunday)", example=3)
    merchant_category: str = Field(..., description="Type of merchant", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_source: str = "MLflow Production"

@app.post("/predict", response_model=PredictionResponse)
def predict(tx: Transaction):
    """Predict whether a transaction is fraudulent using the champion model."""
    data = tx.dict()
    
    try:
        data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    except ValueError:
        data["merchant_encoded"] = 0
    
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        model_source="MLflow Production"
    )

@app.get("/health")
def health():
    return {"status": "healthy", "model_source": "MLflow Registry"}

@app.get("/model-info")
def model_info():
    """Get information about the currently loaded model."""
    return {
        "registry": "MLflow",
        "model_name": "fraud-detection-model",
        "alias": "champion",
        "tracking_uri": "http://localhost:5000"
    }
</code></pre>
<p>Stop your old API (Ctrl+C) and start this new one:</p>
<pre><code class="language-python">uvicorn src.serve_mlflow:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>Now deploying a new model is a <strong>controlled, auditable process</strong>:</p>
<ol>
<li><p><strong>Train new model</strong> → Automatically registered as new version</p>
</li>
<li><p><strong>Compare metrics</strong> → Use MLflow UI to compare with current Production</p>
</li>
<li><p><strong>Set as champion</strong> → Assign @champion alias in MLflow UI</p>
</li>
<li><p><strong>Restart API</strong> → Loads new Production model</p>
</li>
<li><p><strong>Roll back if needed</strong> → Move @champion alias to previous version</p>
</li>
</ol>
<p><strong>Checkpoint:</strong></p>
<ul>
<li><p>MLflow UI (<code>http://localhost:5000</code>) should show the "fraud-detection" experiment with 5 runs</p>
</li>
<li><p>The "Models" tab should show "fraud-detection-model" with 5 versions</p>
</li>
<li><p>One version should have @champion alias</p>
</li>
<li><p>The API should load and serve @champion model</p>
</li>
</ul>
<h2 id="heading-4-ensure-feature-consistency-with-feast"><strong>4. Ensure Feature Consistency with Feast</strong></h2>
<p>⚠️ <strong>First time hearing about feature stores?</strong> Don't worry.<br>You don't need to master every Feast detail on the first read.<br>Focus on <em>why</em> feature consistency matters — you can revisit the implementation later.<br><strong>Key takeaway:</strong> Training and serving must compute features the same way, or your model silently fails.</p>
<p><strong>What breaks without this:</strong> Your model sees different feature values in production than it saw during training. Accuracy drops silently. This is called "training-serving skew" and it's one of the most common causes of ML system failures.</p>
<p>One subtle but critical issue in ML systems is <strong>training-serving skew</strong> – when data transformations at training time differ from inference time. Even small discrepancies can severely degrade performance.</p>
<p><strong>Why This Matters:</strong> Imagine you're computing "average transaction amount per merchant category" as a feature. During training, you compute it using pandas in a notebook. During serving, you compute it using SQL in a different system. Small differences in how these computations handle edge cases (nulls, rounding, time windows) cause the model to see different features in production than it was trained on.</p>
<p>The result? <strong>Silent failures</strong> where accuracy drops but nothing errors out. Your model is making predictions based on features it's never seen before, and you have no idea.</p>
<p>In our naive implementation, we did handle one simple case: we saved the <code>LabelEncoder</code> to ensure <code>merchant_category</code> is encoded the same way in training and serving. But imagine if we had more complex feature engineering:</p>
<ul>
<li><p>Rolling averages over time windows</p>
</li>
<li><p>User-level aggregations</p>
</li>
<li><p>Cross-feature interactions</p>
</li>
<li><p>Real-time features from streaming data</p>
</li>
</ul>
<p>Maintaining consistency manually becomes impossible.</p>
<h3 id="heading-41-what-is-feast-and-why-use-it"><strong>4.1 What is Feast and Why Use It?</strong></h3>
<p>In production ML platforms, teams use a <strong>feature store</strong> to guarantee feature consistency between training and serving. <strong>Feast</strong> is one popular open-source option.</p>
<p>In this tutorial, we use Feast not because you <em>must</em>, but because it makes the training-serving contract explicit and teachable. The principles apply whether you use Feast, Tecton, Featureform, or a custom solution.</p>
<p>Feast provides:</p>
<table>
<thead>
<tr>
<th><strong>Capability</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Single source of truth</strong></td>
<td>Define features once, use everywhere</td>
</tr>
<tr>
<td><strong>Offline/online consistency</strong></td>
<td>Same features for training and serving</td>
</tr>
<tr>
<td><strong>Point-in-time correctness</strong></td>
<td>Prevents data leakage in training</td>
</tr>
<tr>
<td><strong>Low-latency serving</strong></td>
<td>Millisecond feature retrieval</td>
</tr>
<tr>
<td><strong>Feature versioning</strong></td>
<td>Track changes to feature definitions</td>
</tr>
</tbody></table>
<p><strong>How Feast works:</strong></p>
<ol>
<li><p><strong>Define features</strong> in Python code (feature definitions)</p>
</li>
<li><p><strong>Materialize features</strong> from your data sources to the online store</p>
</li>
<li><p><strong>Retrieve features</strong> using the same API for both training (offline) and serving (online)</p>
</li>
</ol>
<p>This ensures that training and serving use <strong>exactly the same feature computation logic</strong>.</p>
<h3 id="heading-42-install-and-initialize-feast"><strong>4.2 Install and Initialize Feast</strong></h3>
<p>We already installed Feast via requirements.txt. Now let's initialize a feature repository.</p>
<pre><code class="language-python"># Navigate to the feature_repo directory
cd feature_repo

# Initialize Feast (this creates template files)
feast init . --minimal

# Go back to project root
cd ..
</code></pre>
<p>This creates the basic Feast structure:</p>
<pre><code class="language-python">feature_repo/
├── feature_store.yaml    # Feast configuration
└── __init__.py
</code></pre>
<h3 id="heading-43-define-feature-definitions"><strong>4.3 Define Feature Definitions</strong></h3>
<p>First, let's create the Feast configuration file:</p>
<pre><code class="language-python"># feature_repo/feature_store.yaml
project: fraud_detection
registry: ../data/registry.db
provider: local
online_store:
  type: sqlite
  path: ../data/online_store.db
offline_store:
  type: file
entity_key_serialization_version: 3
</code></pre>
<p>This configuration:</p>
<ul>
<li><p>Names our project "fraud_detection"</p>
</li>
<li><p>Uses SQLite for the online store (for production, you'd use Redis or DynamoDB)</p>
</li>
<li><p>Uses local files for the offline store (for production, you'd use BigQuery or Snowflake)</p>
</li>
</ul>
<p>Now create the feature definitions:</p>
<pre><code class="language-python"># feature_repo/features.py
"""
Feast feature definitions for fraud detection.

This file defines:
- Entities: The keys we use to look up features (merchant_category)
- Data Sources: Where the raw feature data comes from (Parquet file)
- Feature Views: The features themselves and their schemas

The key insight: These definitions are the SINGLE SOURCE OF TRUTH.
Both training and serving use these exact definitions.
"""
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

# =============================================================================
# ENTITIES
# =============================================================================
# An entity is the "key" we use to look up features.
# For merchant-level features, the entity is merchant_category.

merchant = Entity(
    name="merchant_category",
    description="Merchant category for the transaction (for example, 'online', 'grocery')",
    value_type=ValueType.STRING,
)

# =============================================================================
# DATA SOURCES
# =============================================================================
# Data sources tell Feast where to find the raw feature data.
# For local development, we use a Parquet file.
# For production, this could be BigQuery, Snowflake, S3, etc.

merchant_stats_source = FileSource(
    name="merchant_stats_source",
    path="../data/merchant_features.parquet",  # We'll create this file
    timestamp_field="event_timestamp",       # Required for point-in-time joins
)

# =============================================================================
# FEATURE VIEWS
# =============================================================================
# A Feature View defines a group of related features.
# It specifies:
# - Which entity the features are for
# - The schema (names and types of features)
# - Where the data comes from
# - How long features are valid (TTL)

merchant_stats_fv = FeatureView(
    name="merchant_stats",
    description="Aggregated statistics per merchant category",
    entities=[merchant],
    ttl=timedelta(days=7),  # Features are valid for 7 days
    schema=[
        Field(name="avg_amount", dtype=Float32, description="Average transaction amount"),
        Field(name="transaction_count", dtype=Int64, description="Number of transactions"),
        Field(name="fraud_rate", dtype=Float32, description="Historical fraud rate"),
    ],
    source=merchant_stats_source,
    online=True,  # Enable online serving (low-latency retrieval)
)
</code></pre>
<h3 id="heading-44-materialize-features-to-online-store"><strong>4.4 Materialize Features to Online Store</strong></h3>
<p>Now we need to:</p>
<ol>
<li><p>Compute the features from our training data</p>
</li>
<li><p>Save them in a format Feast can read</p>
</li>
<li><p>Apply the Feast definitions</p>
</li>
<li><p>Materialize features to the online store</p>
</li>
</ol>
<p>Create <code>src/prepare_feast_features.py</code>:</p>
<pre><code class="language-python"># src/prepare_feast_features.py
"""
Prepare feature data for Feast.

This script:
1. Computes aggregated merchant features from training data
2. Saves them in Parquet format (Feast's offline store format)
3. Applies Feast feature definitions
4. Materializes features to the online store for low-latency serving

Run this whenever your training data changes or you want to refresh features.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import subprocess
import os

def compute_merchant_features(df: pd.DataFrame) -&gt; pd.DataFrame:
    """
    Compute aggregated features by merchant category.
    
    THIS IS THE SINGLE SOURCE OF TRUTH FOR FEATURE COMPUTATION.
    
    Both training and serving will use features computed by this exact logic.
    Any change here automatically applies everywhere.
    
    Args:
        df: Transaction DataFrame with columns: amount, merchant_category, is_fraud
        
    Returns:
        DataFrame with computed features per merchant category
    """
    print("Computing merchant-level features...")
    
    # Group by merchant category and compute aggregates
    stats = df.groupby('merchant_category').agg({
        'amount': ['mean', 'count'],
        'is_fraud': 'mean'
    }).reset_index()
    
    # Flatten column names
    stats.columns = ['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']
    
    # Add timestamp for Feast (required for point-in-time correct joins)
    stats['event_timestamp'] = datetime.now()
    
    # Convert types to match Feast schema
    stats['avg_amount'] = stats['avg_amount'].astype('float32')
    stats['transaction_count'] = stats['transaction_count'].astype('int64')
    stats['fraud_rate'] = stats['fraud_rate'].astype('float32')
    
    return stats

def main():
    print("="*60)
    print("FEAST FEATURE PREPARATION")
    print("="*60)
    
    # Load training data
    print("\n1. Loading training data...")
    train_df = pd.read_csv('data/train.csv')
    print(f"   Loaded {len(train_df):,} transactions")
    
    # Compute merchant features
    print("\n2. Computing merchant features...")
    merchant_features = compute_merchant_features(train_df)
    
    print("\n   Computed features:")
    print(merchant_features.to_string(index=False))
    
    # Save as Parquet (required format for Feast file source)
    print("\n3. Saving features to Parquet...")
    os.makedirs('data', exist_ok=True)
    output_path = 'data/merchant_features.parquet'
    merchant_features.to_parquet(output_path, index=False)
    print(f"   Saved to {output_path}")
    
    # Apply Feast feature definitions
    print("\n4. Applying Feast feature definitions...")
    try:
        result = subprocess.run(
            ['feast', 'apply'],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Feature definitions applied successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error applying Feast: {e.stderr}")
        raise
    
    # Materialize features to online store
    print("\n5. Materializing features to online store...")
    try:
        result = subprocess.run(
            ['feast', 'materialize-incremental', datetime.now().isoformat()],
            cwd='feature_repo',
            capture_output=True,
            text=True,
            check=True
        )
        print("   Features materialized successfully!")
        if result.stdout:
            print(f"   {result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"   Error materializing: {e.stderr}")
        raise
    
    print("\n" + "="*60)
    print("FEAST FEATURE PREPARATION COMPLETE!")
    print("="*60)
    print("\nYou can now:")
    print("  - Retrieve features for training: get_training_features()")
    print("  - Retrieve features for serving: get_online_features()")
    print("  - View feature stats: feast feature-views list")

if __name__ == "__main__":
    main()
</code></pre>
<p>Run the feature preparation:</p>
<pre><code class="language-python">python src/prepare_feast_features.py
</code></pre>
<p>You should see:</p>
<pre><code class="language-python">============================================================
FEAST FEATURE PREPARATION
============================================================

1. Loading training data... 8,000 transactions
2. Computing merchant features...
   grocery: avg=$31.24, fraud_rate=0.85%
   online: avg=$98.45, fraud_rate=4.87%
   restaurant: avg=$28.12, fraud_rate=0.50%
   retail: avg=$45.67, fraud_rate=1.02%
   travel: avg=$156.23, fraud_rate=4.18%
3. Saving to data/merchant_features.parquet ✓
4. Applying Feast definitions... ✓
5. Materializing to online store... ✓

FEAST FEATURE PREPARATION COMPLETE!
</code></pre>
<h3 id="heading-45-retrieve-features-for-training-and-serving"><strong>4.5 Retrieve Features for Training and Serving</strong></h3>
<p>Now let's create utilities to retrieve features consistently for both training and serving:</p>
<pre><code class="language-python"># src/feast_features.py
"""
Feast feature retrieval for training and serving.

This module provides functions to retrieve features from Feast:
- get_training_features(): For offline training (historical features)
- get_online_features(): For real-time serving (low-latency)

IMPORTANT: Both functions use the SAME feature definitions,
ensuring consistency between training and serving.
"""
import pandas as pd
from feast import FeatureStore
from datetime import datetime

# Initialize Feast store (points to our feature_repo)
store = FeatureStore(repo_path="feature_repo")

def get_training_features(df: pd.DataFrame) -&gt; pd.DataFrame:
    """
    Get features for training using Feast's offline store.
    
    Uses point-in-time correct joins to prevent data leakage.
    This means features are looked up as of the time each transaction occurred,
    not as of "now" - preventing you from accidentally using future data.
    
    Args:
        df: DataFrame with at least 'merchant_category' column
        
    Returns:
        DataFrame with original columns plus Feast features
    """
    print("Retrieving training features from Feast offline store...")
    
    # Prepare entity dataframe with timestamps
    # Each row needs: entity key(s) + event_timestamp
    entity_df = df[['merchant_category']].copy()
    entity_df['event_timestamp'] = datetime.now()  # See note below
    entity_df = entity_df.drop_duplicates()
    
    # ⚠️ Simplification: For clarity, we use the current timestamp here.
    # In real systems, this would be the actual event time of each transaction.
    
    # Retrieve historical features
    # Feast handles the point-in-time join automatically
    training_data = store.get_historical_features(
        entity_df=entity_df,
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
    ).to_df()
    
    # Merge features back with original dataframe
    result = df.merge(
        training_data[['merchant_category', 'avg_amount', 'transaction_count', 'fraud_rate']],
        on='merchant_category',
        how='left'
    )
    
    print(f"Retrieved features for {len(entity_df)} unique merchants")
    return result

def get_online_features(merchant_category: str) -&gt; dict:
    """
    Get features for real-time serving using Feast's online store.
    
    This is optimized for low-latency retrieval (milliseconds).
    Use this in your prediction API for real-time inference.
    
    Args:
        merchant_category: The merchant category to look up
        
    Returns:
        Dictionary with feature names and values
    """
    # Retrieve from online store (low-latency)
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": merchant_category}],
    ).to_dict()
    
    # Format the response
    return {
        'merchant_avg_amount': feature_vector['avg_amount'][0],
        'merchant_tx_count': feature_vector['transaction_count'][0],
        'merchant_fraud_rate': feature_vector['fraud_rate'][0],
    }

def get_online_features_batch(merchant_categories: list) -&gt; pd.DataFrame:
    """
    Get features for multiple merchants at once (batch serving).
    
    More efficient than calling get_online_features() in a loop.
    
    Args:
        merchant_categories: List of merchant categories to look up
        
    Returns:
        DataFrame with features for each merchant
    """
    feature_vector = store.get_online_features(
        features=[
            "merchant_stats:avg_amount",
            "merchant_stats:transaction_count",
            "merchant_stats:fraud_rate",
        ],
        entity_rows=[{"merchant_category": mc} for mc in merchant_categories],
    ).to_df()
    
    return feature_vector

if __name__ == "__main__":
    # Test the feature retrieval functions
    print("="*60)
    print("TESTING FEAST FEATURE RETRIEVAL")
    print("="*60)
    
    # Test offline retrieval (for training)
    print("\n1. Testing OFFLINE feature retrieval (for training)...")
    train_df = pd.read_csv('data/train.csv').head(10)
    enriched = get_training_features(train_df)
    print("\n   Sample enriched training data:")
    print(enriched[['amount', 'merchant_category', 'avg_amount', 'fraud_rate']].head())
    
    # Test online retrieval (for serving)
    print("\n2. Testing ONLINE feature retrieval (for serving)...")
    for category in ['online', 'grocery', 'travel', 'restaurant', 'retail']:
        features = get_online_features(category)
        print(f"   {category}: avg_amount=${features['merchant_avg_amount']:.2f}, "
              f"fraud_rate={features['merchant_fraud_rate']:.2%}")
    
    # Test batch retrieval
    print("\n3. Testing BATCH online retrieval...")
    batch_features = get_online_features_batch(['online', 'grocery', 'travel'])
    print(batch_features)
    
    print("\n" + "="*60)
    print("FEAST FEATURE RETRIEVAL TEST COMPLETE!")
    print("="*60)
</code></pre>
<p>Test the feature retrieval:</p>
<pre><code class="language-python">python src/feast_features.py
</code></pre>
<p>You should see:</p>
<pre><code class="language-python">============================================================
TESTING FEAST FEATURE RETRIEVAL
============================================================

1. Testing OFFLINE feature retrieval (for training)...
Retrieving training features from Feast offline store...
Retrieved features for 5 unique merchants

   Sample enriched training data:
   amount merchant_category  avg_amount  fraud_rate
    45.23           grocery       31.24      0.0085
   123.45            online       98.45      0.0487
    ...

2. Testing ONLINE feature retrieval (for serving)...
   online: avg_amount=$98.45, fraud_rate=4.87%
   grocery: avg_amount=$31.24, fraud_rate=0.85%
   travel: avg_amount=$156.23, fraud_rate=4.18%
   restaurant: avg_amount=$28.12, fraud_rate=0.50%
   retail: avg_amount=$45.67, fraud_rate=1.02%

3. Testing BATCH online retrieval...
  merchant_category  avg_amount  transaction_count  fraud_rate
               online       98.45               1234      0.0487
              grocery       31.24               2345      0.0085
               travel      156.23                478      0.0418
</code></pre>
<h3 id="heading-why-feast-over-custom-code"><strong>Why Feast Over Custom Code?</strong></h3>
<table>
<thead>
<tr>
<th><strong>Aspect</strong></th>
<th><strong>Custom Code</strong></th>
<th><strong>Feast</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Consistency</strong></td>
<td>Manual effort to keep in sync</td>
<td>Automatic - same definitions everywhere</td>
</tr>
<tr>
<td><strong>Point-in-time correctness</strong></td>
<td>Must implement yourself</td>
<td>Built-in</td>
</tr>
<tr>
<td><strong>Online serving</strong></td>
<td>Must build your own cache</td>
<td>Built-in online store</td>
</tr>
<tr>
<td><strong>Feature versioning</strong></td>
<td>Not supported</td>
<td>Built-in</td>
</tr>
<tr>
<td><strong>Scalability</strong></td>
<td>Limited</td>
<td>Production-ready (BigQuery, Redis, etc.)</td>
</tr>
<tr>
<td><strong>Team collaboration</strong></td>
<td>Difficult</td>
<td>Feature registry with documentation</td>
</tr>
<tr>
<td><strong>Monitoring</strong></td>
<td>Manual</td>
<td>Built-in feature statistics</td>
</tr>
</tbody></table>
<p>💡 <strong>Mental Model</strong>: Treat feature definitions like database schemas.<br>You wouldn't compute a column one way in your application and a different way in your reports. Features deserve the same discipline — define once, use everywhere.</p>
<p><strong>Checkpoint:</strong> After running <code>prepare_feast_</code><a href="http://features.py"><code>features.py</code></a>, you should have:</p>
<ul>
<li><p><code>data/merchant_features.parquet</code> (computed features)</p>
</li>
<li><p><code>data/registry.db</code> (Feast registry)</p>
</li>
<li><p><code>data/online_store.db</code> (SQLite online store)</p>
</li>
</ul>
<p>Running <code>python src/feast_</code><a href="http://features.py"><code>features.py</code></a> should successfully retrieve features for all merchant categories.</p>
<h2 id="heading-5-add-data-validation-with-great-expectations"><strong>5. Add Data Validation with Great Expectations</strong></h2>
<p><strong>What breaks without this:</strong> Your API accepts garbage input (negative amounts, invalid hours) and returns meaningless predictions. Worse, you have no idea it happened.</p>
<p>Recall that our API currently trusts input blindly. We saw how garbage data produces a prediction with no warning. <strong>Great Expectations</strong> is an open-source tool for data quality testing – defining rules (expectations) and testing data against them.</p>
<p><strong>Why This Matters:</strong> Data validation acts as a gatekeeper. Bad data is rejected <strong>before</strong> it can harm predictions. As the saying goes, "Garbage in, garbage out" – feeding unreliable data yields unreliable results. With validation, we transform this to "Garbage in, <strong>error out</strong>" – much better for debugging and reliability.</p>
<h3 id="heading-51-define-expectations"><strong>5.1 Define Expectations</strong></h3>
<p>What are reasonable expectations for our transaction data? Based on domain knowledge:</p>
<table>
<thead>
<tr>
<th><strong>Field</strong></th>
<th><strong>Expectation</strong></th>
<th><strong>Reason</strong></th>
</tr>
</thead>
<tbody><tr>
<td><code>amount</code></td>
<td>Positive (&gt; 0)</td>
<td>Negative transactions don't make sense</td>
</tr>
<tr>
<td><code>amount</code></td>
<td>Below $50,000</td>
<td>Extremely large amounts are outliers/errors</td>
</tr>
<tr>
<td><code>hour</code></td>
<td>0-23 inclusive</td>
<td>Valid hours in a day</td>
</tr>
<tr>
<td><code>day_of_week</code></td>
<td>0-6 inclusive</td>
<td>Valid days (Mon=0, Sun=6)</td>
</tr>
<tr>
<td><code>merchant_category</code></td>
<td>One of known categories</td>
<td>Must match training data</td>
</tr>
<tr>
<td>All fields</td>
<td>Not null</td>
<td>Required for prediction</td>
</tr>
</tbody></table>
<p>Create <code>src/data_validation.py</code>:</p>
<pre><code class="language-python"># src/data_validation.py
"""
Data validation for fraud detection.

This module provides functions to validate input data BEFORE making predictions.
Invalid data is rejected with clear error messages.

The key insight: It's better to reject bad input than to make garbage predictions.
"""
import pandas as pd
from typing import Dict, List, Any, Optional

# Define the valid merchant categories (must match training data!)
VALID_CATEGORIES = ["grocery", "restaurant", "retail", "online", "travel"]

def validate_transaction(data: Dict[str, Any]) -&gt; Dict[str, Any]:
    """
    Validate a single transaction for fraud prediction.
    
    Checks all business rules and data quality requirements.
    Returns a dictionary with 'valid' (bool) and 'errors' (list).
    
    Args:
        data: Dictionary with transaction fields
        
    Returns:
        {"valid": bool, "errors": list of error messages}
        
    Example:
        &gt;&gt;&gt; validate_transaction({"amount": -100, "hour": 25, ...})
        {"valid": False, "errors": ["amount must be positive", "hour must be 0-23"]}
    """
    errors = []
    
    # ==========================================================================
    # Amount Validation
    # ==========================================================================
    amount = data.get("amount")
    if amount is None:
        errors.append("amount is required")
    elif not isinstance(amount, (int, float)):
        errors.append(f"amount must be a number (got {type(amount).__name__})")
    elif amount &lt;= 0:
        errors.append("amount must be positive")
    elif amount &gt; 50000:
        errors.append(f"amount exceeds maximum allowed value of \(50,000 (got \){amount:,.2f})")
    
    # ==========================================================================
    # Hour Validation
    # ==========================================================================
    hour = data.get("hour")
    if hour is None:
        errors.append("hour is required")
    elif not isinstance(hour, int):
        errors.append(f"hour must be an integer (got {type(hour).__name__})")
    elif not (0 &lt;= hour &lt;= 23):
        errors.append(f"hour must be between 0 and 23 (got {hour})")
    
    # ==========================================================================
    # Day of Week Validation
    # ==========================================================================
    day = data.get("day_of_week")
    if day is None:
        errors.append("day_of_week is required")
    elif not isinstance(day, int):
        errors.append(f"day_of_week must be an integer (got {type(day).__name__})")
    elif not (0 &lt;= day &lt;= 6):
        errors.append(f"day_of_week must be between 0 (Monday) and 6 (Sunday) (got {day})")
    
    # ==========================================================================
    # Merchant Category Validation
    # ==========================================================================
    category = data.get("merchant_category")
    if category is None:
        errors.append("merchant_category is required")
    elif not isinstance(category, str):
        errors.append(f"merchant_category must be a string (got {type(category).__name__})")
    elif category not in VALID_CATEGORIES:
        errors.append(
            f"merchant_category must be one of {VALID_CATEGORIES} (got '{category}')"
        )
    
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }

def validate_batch(df: pd.DataFrame) -&gt; Dict[str, Any]:
    """
    Validate a batch of transactions using Great Expectations.
    
    This is useful for validating training data or batch prediction requests.
    Uses Great Expectations for more sophisticated validation.
    
    Args:
        df: DataFrame with transaction data
        
    Returns:
        Dictionary with validation results
    """
    import great_expectations as gx
    
    # Convert to Great Expectations dataset
    ge_df = gx.from_pandas(df)
    
    results = []
    
    # Amount expectations
    r = ge_df.expect_column_values_to_be_between(
        'amount', min_value=0.01, max_value=50000, mostly=0.99
    )
    results.append(('amount_range', r.success, r.result))
    
    # Hour expectations
    r = ge_df.expect_column_values_to_be_between(
        'hour', min_value=0, max_value=23
    )
    results.append(('hour_range', r.success, r.result))
    
    # Day of week expectations
    r = ge_df.expect_column_values_to_be_between(
        'day_of_week', min_value=0, max_value=6
    )
    results.append(('day_range', r.success, r.result))
    
    # Merchant category expectations
    r = ge_df.expect_column_values_to_be_in_set(
        'merchant_category', VALID_CATEGORIES
    )
    results.append(('category_valid', r.success, r.result))
    
    # No nulls in critical fields
    for col in ['amount', 'hour', 'day_of_week', 'merchant_category']:
        r = ge_df.expect_column_values_to_not_be_null(col)
        results.append((f'{col}_not_null', r.success, r.result))
    
    # Summarize results
    passed = sum(1 for _, success, _ in results if success)
    total = len(results)
    
    return {
        'success': passed == total,
        'passed': passed,
        'total': total,
        'pass_rate': passed / total,
        'details': {name: {'passed': success, 'result': result} 
                   for name, success, result in results}
    }

if __name__ == "__main__":
    print("="*60)
    print("TESTING DATA VALIDATION")
    print("="*60)
    
    # Test single transaction validation
    print("\n1. Single Transaction Validation")
    print("-"*40)
    
    test_cases = [
        {
            "name": "Valid transaction",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Negative amount",
            "data": {"amount": -100.0, "hour": 14, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Invalid hour",
            "data": {"amount": 50.0, "hour": 25, "day_of_week": 3, "merchant_category": "grocery"}
        },
        {
            "name": "Unknown merchant",
            "data": {"amount": 50.0, "hour": 14, "day_of_week": 3, "merchant_category": "unknown"}
        },
        {
            "name": "Everything wrong",
            "data": {"amount": -999, "hour": 99, "day_of_week": 15, "merchant_category": "fake"}
        },
    ]
    
    for tc in test_cases:
        result = validate_transaction(tc["data"])
        status = "PASS" if result["valid"] else "FAIL"
        print(f"\n{tc['name']}: {status}")
        if result["errors"]:
            for error in result["errors"]:
                print(f"  - {error}")
    
    # Test batch validation
    print("\n\n2. Batch Validation with Great Expectations")
    print("-"*40)
    
    train_df = pd.read_csv('data/train.csv')
    results = validate_batch(train_df)
    
    print(f"\nTraining data validation: {results['passed']}/{results['total']} checks passed")
    print(f"Pass rate: {results['pass_rate']:.1%}")
    
    if not results['success']:
        print("\nFailed checks:")
        for name, detail in results['details'].items():
            if not detail['passed']:
                print(f"  - {name}")
</code></pre>
<h3 id="heading-when-to-use-which-validation-approach"><strong>When to Use Which Validation Approach</strong></h3>
<table>
<thead>
<tr>
<th><strong>Approach</strong></th>
<th><strong>Use Case</strong></th>
<th><strong>Latency</strong></th>
<th><strong>When to Use</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Custom Python</strong> (<code>validate_transaction</code>)</td>
<td>Real-time API requests</td>
<td>&lt;1ms</td>
<td>Every prediction request</td>
</tr>
<tr>
<td><strong>Great Expectations</strong></td>
<td>Batch data quality</td>
<td>Seconds</td>
<td>Training data, periodic audits, CI/CD</td>
</tr>
</tbody></table>
<p>We use <strong>both</strong> in this tutorial because they serve different purposes:</p>
<ul>
<li><p>Custom validation is your <strong>runtime gatekeeper</strong> — fast enough for every request</p>
</li>
<li><p>Great Expectations is your <strong>batch auditor</strong> — thorough checks on datasets</p>
</li>
</ul>
<h3 id="heading-52-integrate-validation-into-fastapi"><strong>5.2 Integrate Validation into FastAPI</strong></h3>
<p>Now let's update our API to reject invalid input with clear error messages:</p>
<pre><code class="language-python"># src/serve_validated.py
"""
Serve fraud detection model with input validation.

This version adds data validation BEFORE making predictions:
- Invalid inputs are rejected with HTTP 400 and clear error messages
- Valid inputs are processed and predictions returned

This is much safer than the naive version which accepted garbage.
"""
import pickle
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.data_validation import validate_transaction

# Load model
with open("models/model.pkl", "rb") as f:
    model, encoder = pickle.load(f)

app = FastAPI(
    title="Fraud Detection API (Validated)",
    description="""
    Fraud detection API with input validation.
    
    All inputs are validated before prediction:
    - amount: Must be positive and below $50,000
    - hour: Must be 0-23
    - day_of_week: Must be 0-6
    - merchant_category: Must be one of: grocery, restaurant, retail, online, travel
    
    Invalid inputs return HTTP 400 with detailed error messages.
    """,
    version="3.0.0"
)

class Transaction(BaseModel):
    amount: float = Field(..., description="Transaction amount (must be positive)", example=150.00)
    hour: int = Field(..., description="Hour of day (0-23)", example=14)
    day_of_week: int = Field(..., description="Day of week (0=Mon, 6=Sun)", example=3)
    merchant_category: str = Field(..., description="Merchant type", example="online")

class PredictionResponse(BaseModel):
    is_fraud: bool
    fraud_probability: float
    validation_passed: bool = True

class ValidationErrorResponse(BaseModel):
    detail: dict

@app.post("/predict", response_model=PredictionResponse, responses={400: {"model": ValidationErrorResponse}})
def predict(tx: Transaction):
    """
    Predict whether a transaction is fraudulent.
    
    Input is validated before prediction. Invalid inputs return HTTP 400.
    """
    data = tx.dict()
    
    # VALIDATE INPUT BEFORE MAKING PREDICTION
    validation = validate_transaction(data)
    
    if not validation["valid"]:
        raise HTTPException(
            status_code=400,
            detail={
                "message": "Validation failed",
                "errors": validation["errors"],
                "input": data
            }
        )
    
    # Input is valid - make prediction
    data["merchant_encoded"] = encoder.transform([data["merchant_category"]])[0]
    X = [[data["amount"], data["hour"], data["day_of_week"], data["merchant_encoded"]]]
    
    pred = model.predict(X)[0]
    prob = model.predict_proba(X)[0][1]
    
    return PredictionResponse(
        is_fraud=bool(pred),
        fraud_probability=round(float(prob), 4),
        validation_passed=True
    )

@app.get("/health")
def health():
    return {"status": "healthy", "validation": "enabled"}
</code></pre>
<p>Start the validated API:</p>
<pre><code class="language-python">uvicorn src.serve_validated:app --reload --host 0.0.0.0 --port 8000
</code></pre>
<p>Now test with bad data:</p>
<pre><code class="language-python">curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}'
</code></pre>
<p>Response (HTTP 400):</p>
<pre><code class="language-python">{
  "detail": {
    "message": "Validation failed",
    "errors": [
      "amount must be positive",
      "hour must be between 0 and 23 (got 25)",
      "day_of_week must be between 0 (Monday) and 6 (Sunday) (got 10)",
      "merchant_category must be one of ['grocery', 'restaurant', 'retail', 'online', 'travel'] (got 'fake')"
    ],
    "input": {"amount": -500, "hour": 25, "day_of_week": 10, "merchant_category": "fake"}
  }
}
</code></pre>
<p><strong>This is a huge improvement!</strong> Instead of silently accepting garbage and returning meaningless predictions, we now:</p>
<ul>
<li><p>Reject invalid input immediately</p>
</li>
<li><p>Provide clear, actionable error messages</p>
</li>
<li><p>Return the original input for debugging</p>
</li>
<li><p>Use proper HTTP status codes (400 for client error)</p>
</li>
</ul>
<p><strong>Checkpoint:</strong> Your validated API should:</p>
<ul>
<li><p>Accept valid transactions and return predictions</p>
</li>
<li><p>Reject invalid transactions with HTTP 400 and detailed error messages</p>
</li>
<li><p>Show validation errors for each invalid field</p>
</li>
</ul>
<h2 id="heading-6-monitor-model-performance-and-data-drift"><strong>6. Monitor Model Performance and Data Drift</strong></h2>
<p><strong>What breaks without this:</strong> Your model's accuracy drops from 98% to 70% over two months. Nobody notices until customers complain. By then, significant damage has occurred.</p>
<p>Even with a great model and clean input data, <strong>time can be an enemy</strong>. Model performance can decline as real-world data evolves – this is known as <strong>model drift</strong> or <strong>model decay</strong>.</p>
<p><strong>Why This Matters:</strong> In traditional software, you monitor CPU, memory, error rates, and response times. In ML, you must <strong>also</strong> monitor:</p>
<ul>
<li><p>Data quality (are inputs within expected ranges?)</p>
</li>
<li><p>Model performance (is accuracy holding up?)</p>
</li>
<li><p>Data drift (has input distribution changed?)</p>
</li>
<li><p>Prediction drift (has the distribution of predictions changed?)</p>
</li>
</ul>
<p>Without monitoring, your model could be silently failing for weeks before anyone notices. By then, significant damage may have occurred – fraud slipping through, good customers blocked, revenue lost.</p>
<h3 id="heading-61-the-four-pillars-of-ml-observability"><strong>6.1 The Four Pillars of ML Observability</strong></h3>
<table>
<thead>
<tr>
<th><strong>Pillar</strong></th>
<th><strong>What to Monitor</strong></th>
<th><strong>Why It Matters</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Data Quality</strong></td>
<td>Are inputs valid? Nulls? Outliers?</td>
<td>Bad data causes bad predictions</td>
</tr>
<tr>
<td><strong>Model Performance</strong></td>
<td>Accuracy, precision, recall, F1</td>
<td>Is the model still working?</td>
</tr>
<tr>
<td><strong>Data Drift</strong></td>
<td>Has input distribution changed from training?</td>
<td>Model may not generalize to new data</td>
</tr>
<tr>
<td><strong>Prediction Drift</strong></td>
<td>Has prediction distribution changed?</td>
<td>May indicate data or concept drift</td>
</tr>
</tbody></table>
<h3 id="heading-62-build-a-drift-monitor-with-evidently"><strong>6.2 Build a Drift Monitor with Evidently</strong></h3>
<p><strong>Evidently</strong> is an open-source library specifically designed for ML monitoring. It can detect drift, generate reports, and integrate with monitoring systems.</p>
<p>Create <code>src/monitoring.py</code>:</p>
<pre><code class="language-python"># src/monitoring.py
"""
Model monitoring with Evidently.

This module provides tools to:
1. Detect data drift between training and production data
2. Generate detailed HTML reports
3. Track drift over time
4. Alert when drift exceeds thresholds

In production, you would run drift checks periodically (hourly, daily)
and alert when significant drift is detected.
"""
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric
)
from datetime import datetime
from typing import List, Dict, Any, Optional

class DriftMonitor:
    """
    Monitor for detecting data drift between reference (training) and current data.
    
    Implementation Note: We use two approaches here:
    1. Scipy's KS-test — A lightweight statistical method that works anywhere (our fallback)
    2. Evidently — A full-featured library with beautiful reports (our primary tool)
    
    The KS-test is included as defensive coding — if Evidently fails to generate 
    a report, we still get drift detection.
    
    Usage:
        monitor = DriftMonitor(training_data)
        result = monitor.check_drift(production_data)
        if result['drift_detected']:
            alert("Drift detected!")
    """
    
    def __init__(self, reference_data: pd.DataFrame, feature_columns: Optional[List[str]] = None):
        """
        Initialize the drift monitor with reference (training) data.
        
        Args:
            reference_data: The training data to compare against
            feature_columns: Columns to monitor (default: all numeric columns)
        """
        self.reference = reference_data
        self.feature_columns = feature_columns or reference_data.select_dtypes(
            include=[np.number]
        ).columns.tolist()
        self.history: List[Dict[str, Any]] = []
        
        print(f"Drift monitor initialized with {len(self.reference):,} reference samples")
        print(f"Monitoring columns: {self.feature_columns}")
    
    def check_drift(self, current_data: pd.DataFrame, threshold: float = 0.1) -&gt; Dict[str, Any]:
        """
        Check for drift between reference and current data.
        
        Args:
            current_data: Current/production data to check
            threshold: Drift share threshold for alerting (default 10%)
            
        Returns:
            Dictionary with drift results
        """
        from scipy import stats
        
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        # Simple statistical drift detection using KS test
        drifted_columns = []
        for col in self.feature_columns:
            statistic, p_value = stats.ks_2samp(
                ref_subset[col].dropna(),
                cur_subset[col].dropna()
            )
            if p_value &lt; 0.05:  # 5% significance level
                drifted_columns.append(col)
        
        n_features = len(self.feature_columns)
        n_drifted = len(drifted_columns)
        drift_share = n_drifted / n_features if n_features &gt; 0 else 0
        
        result = {
            'timestamp': datetime.now().isoformat(),
            'drift_detected': n_drifted &gt; 0,
            'drift_share': drift_share,
            'drifted_columns': drifted_columns,
            'n_features': n_features,
            'n_drifted': n_drifted,
            'current_samples': len(current_data),
            'threshold': threshold,
            'alert': drift_share &gt; threshold
        }
        
        self.history.append(result)
        
        return result
    
    def generate_report(self, current_data: pd.DataFrame, output_path: str = "drift_report.html"):
        """
        Generate a detailed HTML drift report using Evidently.
        
        Opens in browser for visual inspection of drift patterns.
        """
        ref_subset = self.reference[self.feature_columns]
        cur_subset = current_data[self.feature_columns]
        
        try:
            report = Report(metrics=[DataDriftPreset()])
            report.run(reference_data=ref_subset, current_data=cur_subset)
            
            # Save HTML report
            with open(output_path, 'w') as f:
                f.write(report.show(mode='inline').data)
            
            print(f"Drift report saved to {output_path}")
            print(f"Open this file in a browser to view detailed visualizations.")
        except Exception as e:
            print(f"Could not generate Evidently report: {e}")
            print(f"Using simplified drift detection instead.")
    
    def get_alerts(self, threshold: float = 0.1) -&gt; List[Dict[str, Any]]:
        """
        Get all alerts from history where drift exceeded threshold.
        """
        return [
            {
                'timestamp': r['timestamp'],
                'severity': 'HIGH' if r['drift_share'] &gt; 0.3 else 'MEDIUM',
                'drift_share': r['drift_share'],
                'message': f"Drift detected: {r['drift_share']:.1%} of features drifted",
                'drifted_columns': r['drifted_columns']
            }
            for r in self.history
            if r['drift_share'] &gt; threshold
        ]
    
    def summary(self) -&gt; Dict[str, Any]:
        """Get summary statistics from monitoring history."""
        if not self.history:
            return {"message": "No drift checks performed yet"}
        
        drift_shares = [r['drift_share'] for r in self.history]
        alerts = [r for r in self.history if r['alert']]
        
        return {
            'total_checks': len(self.history),
            'total_alerts': len(alerts),
            'avg_drift_share': np.mean(drift_shares),
            'max_drift_share': np.max(drift_shares),
            'first_check': self.history[0]['timestamp'],
            'last_check': self.history[-1]['timestamp']
        }


def simulate_drift_scenarios():
    """
    Demonstrate drift detection with different scenarios.
    
    This simulates what happens when production data differs from training data.
    """
    from src.generate_data import generate_transactions
    
    print("="*70)
    print("DRIFT DETECTION SIMULATION")
    print("="*70)
    
    # Load reference (training) data
    print("\n1. Loading reference data (training set)...")
    reference = pd.read_csv('data/train.csv')
    feature_cols = ['amount', 'hour', 'day_of_week']
    
    # Initialize drift monitor
    monitor = DriftMonitor(reference, feature_cols)
    
    # Scenario 1: Similar data (should show minimal drift)
    print("\n" + "-"*70)
    print("SCENARIO 1: Test data (similar distribution)")
    print("-"*70)
    test_data = pd.read_csv('data/test.csv')
    result = monitor.check_drift(test_data)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 2: Fraud spike (10% fraud instead of 2%)
    print("\n" + "-"*70)
    print("SCENARIO 2: Fraud spike (10% fraud rate instead of 2%)")
    print("-"*70)
    fraud_spike = generate_transactions(n_samples=2000, fraud_ratio=0.10, seed=101)
    result = monitor.check_drift(fraud_spike)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 3: Amount inflation (everything costs more)
    print("\n" + "-"*70)
    print("SCENARIO 3: Amount inflation (2x multiplier)")
    print("-"*70)
    inflated = test_data.copy()
    inflated['amount'] = inflated['amount'] * 2
    result = monitor.check_drift(inflated)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Scenario 4: Time shift (more late-night transactions)
    print("\n" + "-"*70)
    print("SCENARIO 4: Time shift (mostly late-night transactions)")
    print("-"*70)
    night_shift = test_data.copy()
    night_shift['hour'] = np.random.choice([0, 1, 2, 3, 22, 23], size=len(night_shift))
    result = monitor.check_drift(night_shift)
    print(f"  Drift detected: {result['drift_detected']}")
    print(f"  Drift share: {result['drift_share']:.1%}")
    print(f"  Drifted columns: {result['drifted_columns']}")
    print(f"  Alert triggered: {result['alert']}")
    
    # Generate detailed report for the most drifted scenario
    print("\n" + "-"*70)
    print("GENERATING DETAILED REPORT")
    print("-"*70)
    monitor.generate_report(night_shift, "drift_report.html")
    
    # Print summary
    print("\n" + "-"*70)
    print("MONITORING SUMMARY")
    print("-"*70)
    summary = monitor.summary()
    print(f"  Total checks: {summary['total_checks']}")
    print(f"  Total alerts: {summary['total_alerts']}")
    print(f"  Average drift share: {summary['avg_drift_share']:.1%}")
    print(f"  Maximum drift share: {summary['max_drift_share']:.1%}")
    
    # Print alerts
    alerts = monitor.get_alerts()
    if alerts:
        print(f"\n  Alerts ({len(alerts)}):")
        for alert in alerts:
            print(f"    [{alert['severity']}] {alert['message']}")
    
    print("\n" + "="*70)
    print("DRIFT DETECTION SIMULATION COMPLETE")
    print("="*70)
    print("\nOpen drift_report.html in your browser to see detailed visualizations!")


if __name__ == "__main__":
    simulate_drift_scenarios()
</code></pre>
<p>Run the drift simulation:</p>
<pre><code class="language-python">python src/monitoring.py
</code></pre>
<p>You'll see output showing how drift detection works in different scenarios. Then open <code>drift_report.html</code> in your browser to see beautiful visualizations of the drift patterns.</p>
<h3 id="heading-63-production-monitoring-strategy"><strong>6.3 Production Monitoring Strategy</strong></h3>
<p>In a production environment, you would:</p>
<ol>
<li><p><strong>Log all predictions</strong> to a database or data warehouse</p>
</li>
<li><p><strong>Run drift checks periodically</strong> (hourly for high-traffic systems, daily for lower traffic)</p>
</li>
<li><p><strong>Set up alerts</strong> when drift exceeds thresholds (integrate with PagerDuty, Slack, etc.)</p>
</li>
<li><p><strong>Trigger retraining</strong> if drift is severe or sustained</p>
</li>
<li><p><strong>Create dashboards</strong> to track drift over time (Grafana, Datadog, etc.)</p>
</li>
</ol>
<p><strong>Checkpoint:</strong> Running <code>python src/</code><a href="http://monitoring.py"><code>monitoring.py</code></a> should:</p>
<ul>
<li><p>Show minimal drift for similar data (test set)</p>
</li>
<li><p>Show significant drift for modified data (fraud spike, inflation, time shift)</p>
</li>
<li><p>Generate an HTML report that you can view in your browser</p>
</li>
</ul>
<h2 id="heading-7-automate-testing-and-deployment-with-cicd"><strong>7. Automate Testing and Deployment with CI/CD</strong></h2>
<p><strong>What breaks without this:</strong> A typo in your code breaks the API. You deploy on Friday at 5 PM. Nobody notices until Monday. Fraud losses spike over the weekend.</p>
<p><strong>CI/CD</strong> (Continuous Integration/Continuous Deployment) ensures reliable, repeatable releases. As JFrog notes: <em>"A strong CI/CD pipeline enables ML teams to build robust, bug-free models more quickly and efficiently."</em></p>
<p><strong>Why This Matters:</strong> In ML, changes aren't just code – they're also data and models. CI/CD ensures that when you change training logic, data preprocessing, or hyperparameters, tests verify the change doesn't break anything before it reaches production. It's the difference between deploying with confidence and deploying with crossed fingers.</p>
<h3 id="heading-71-write-tests-for-data-and-model"><strong>7.1 Write Tests for Data and Model</strong></h3>
<p>Create <code>tests/test_data_and_</code><a href="http://model.py"><code>model.py</code></a>:</p>
<pre><code class="language-python"># tests/test_data_and_model.py
"""
Tests for data quality and model performance.

These tests run in CI/CD to ensure:
1. Data meets quality requirements
2. Model meets performance thresholds
3. No regressions are introduced

Run with: pytest tests/test_data_and_model.py -v
"""
import pandas as pd
import pickle
import pytest
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

class TestDataQuality:
    """Tests for training data quality."""
    
    @pytest.fixture
    def train_data(self):
        return pd.read_csv("data/train.csv")
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_train_data_has_expected_columns(self, train_data):
        """Training data must have all required columns."""
        required_columns = {"amount", "hour", "day_of_week", "merchant_category", "is_fraud"}
        actual_columns = set(train_data.columns)
        missing = required_columns - actual_columns
        assert not missing, f"Missing columns: {missing}"
    
    def test_train_data_not_empty(self, train_data):
        """Training data must have rows."""
        assert len(train_data) &gt; 0, "Training data is empty"
        assert len(train_data) &gt;= 1000, f"Training data too small: {len(train_data)} rows"
    
    def test_no_negative_amounts(self, train_data):
        """Transaction amounts must be non-negative."""
        negative_count = (train_data["amount"] &lt; 0).sum()
        assert negative_count == 0, f"Found {negative_count} negative amounts"
    
    def test_amounts_reasonable(self, train_data):
        """Transaction amounts should be within reasonable bounds."""
        max_amount = train_data["amount"].max()
        assert max_amount &lt;= 100000, f"Max amount {max_amount} exceeds reasonable limit"
    
    def test_hours_valid(self, train_data):
        """Hours must be 0-23."""
        invalid = train_data[(train_data["hour"] &lt; 0) | (train_data["hour"] &gt; 23)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid hours"
    
    def test_days_valid(self, train_data):
        """Days of week must be 0-6."""
        invalid = train_data[(train_data["day_of_week"] &lt; 0) | (train_data["day_of_week"] &gt; 6)]
        assert len(invalid) == 0, f"Found {len(invalid)} invalid days"
    
    def test_merchant_categories_valid(self, train_data):
        """Merchant categories must be from known set."""
        valid_categories = {"grocery", "restaurant", "retail", "online", "travel"}
        actual_categories = set(train_data["merchant_category"].unique())
        invalid = actual_categories - valid_categories
        assert not invalid, f"Invalid merchant categories: {invalid}"
    
    def test_fraud_ratio_reasonable(self, train_data):
        """Fraud ratio should be realistic (between 0.1% and 50%)."""
        fraud_ratio = train_data["is_fraud"].mean()
        assert 0.001 &lt;= fraud_ratio &lt;= 0.5, f"Fraud ratio {fraud_ratio:.2%} is unrealistic"
    
    def test_no_nulls_in_critical_columns(self, train_data):
        """Critical columns must not have null values."""
        critical = ["amount", "hour", "day_of_week", "merchant_category", "is_fraud"]
        for col in critical:
            null_count = train_data[col].isnull().sum()
            assert null_count == 0, f"Column {col} has {null_count} null values"


class TestModelPerformance:
    """Tests for model performance thresholds."""
    
    @pytest.fixture
    def model_and_encoder(self):
        with open("models/model.pkl", "rb") as f:
            return pickle.load(f)
    
    @pytest.fixture
    def test_data(self):
        return pd.read_csv("data/test.csv")
    
    def test_model_loads_successfully(self, model_and_encoder):
        """Model file must load without errors."""
        model, encoder = model_and_encoder
        assert model is not None, "Model is None"
        assert encoder is not None, "Encoder is None"
    
    def test_model_can_predict(self, model_and_encoder, test_data):
        """Model must be able to make predictions."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        predictions = model.predict(X)
        assert len(predictions) == len(X), "Prediction count mismatch"
    
    def test_accuracy_threshold(self, model_and_encoder, test_data):
        """Model accuracy must be at least 90%."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        accuracy = model.score(X, y)
        assert accuracy &gt;= 0.90, f"Accuracy {accuracy:.2%} below 90% threshold"
    
    def test_f1_threshold(self, model_and_encoder, test_data):
        """Model F1-score must be at least 0.3 (sanity check for imbalanced data)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        f1 = f1_score(y, y_pred)
        assert f1 &gt;= 0.3, f"F1-score {f1:.2f} below 0.3 threshold"
    
    def test_precision_not_zero(self, model_and_encoder, test_data):
        """Model precision must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        precision = precision_score(y, y_pred, zero_division=0)
        assert precision &gt; 0, "Model has zero precision (predicts no fraud)"
    
    def test_recall_not_zero(self, model_and_encoder, test_data):
        """Model recall must be greater than 0 (catches at least some fraud)."""
        model, encoder = model_and_encoder
        test_data["merchant_encoded"] = encoder.transform(test_data["merchant_category"])
        X = test_data[["amount", "hour", "day_of_week", "merchant_encoded"]]
        y = test_data["is_fraud"]
        y_pred = model.predict(X)
        recall = recall_score(y, y_pred, zero_division=0)
        assert recall &gt; 0, "Model has zero recall (misses all fraud)"
</code></pre>
<p>Create <code>tests/test_</code><a href="http://api.py"><code>api.py</code></a>:</p>
<pre><code class="language-python"># tests/test_api.py
"""
Tests for the FastAPI prediction service.

These tests ensure the API:
1. Returns correct responses for valid inputs
2. Rejects invalid inputs with proper error messages
3. Health check works

Run with: pytest tests/test_api.py -v
Note: Requires the API to be running on localhost:8000
"""
import pytest
import httpx

BASE_URL = "http://localhost:8000"

class TestPredictionEndpoint:
    """Tests for the /predict endpoint."""
    
    def test_valid_prediction_returns_200(self):
        """Valid input should return HTTP 200 with prediction."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        assert "is_fraud" in data
        assert "fraud_probability" in data
        assert isinstance(data["is_fraud"], bool)
        assert 0 &lt;= data["fraud_probability"] &lt;= 1
    
    def test_high_risk_transaction(self):
        """High-risk transaction should have higher fraud probability."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 500.0,
            "hour": 3,  # Late night
            "day_of_week": 1,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 200
        data = response.json()
        # High-risk transactions should have elevated probability
        # (not asserting exact value as model may vary)
        assert data["fraud_probability"] &gt;= 0.0
    
    def test_negative_amount_rejected(self):
        """Negative amount should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": -100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
        assert "errors" in response.json()["detail"]
    
    def test_invalid_hour_rejected(self):
        """Invalid hour should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 25,  # Invalid
            "day_of_week": 3,
            "merchant_category": "online"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_invalid_merchant_rejected(self):
        """Unknown merchant category should be rejected with 400."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14,
            "day_of_week": 3,
            "merchant_category": "unknown_category"
        }, timeout=10)
        
        assert response.status_code == 400
    
    def test_missing_field_rejected(self):
        """Missing required field should be rejected."""
        response = httpx.post(f"{BASE_URL}/predict", json={
            "amount": 100.0,
            "hour": 14
            # Missing day_of_week and merchant_category
        }, timeout=10)
        
        assert response.status_code == 422  # Pydantic validation error


class TestHealthEndpoint:
    """Tests for the /health endpoint."""
    
    def test_health_returns_200(self):
        """Health endpoint should return 200."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        assert response.status_code == 200
    
    def test_health_returns_healthy_status(self):
        """Health endpoint should indicate healthy status."""
        response = httpx.get(f"{BASE_URL}/health", timeout=10)
        data = response.json()
        assert data["status"] == "healthy"
</code></pre>
<p>Run tests locally:</p>
<pre><code class="language-python"># Run data and model tests (API not needed)
pytest tests/test_data_and_model.py -v

# Run API tests (requires API to be running)
pytest tests/test_api.py -v
</code></pre>
<h3 id="heading-72-github-actions-workflow"><strong>7.2 GitHub Actions Workflow</strong></h3>
<p>⚠️ <strong>Note for Production Teams</strong><br>In real ML teams, you typically don't retrain full models inside CI — it's slow and resource-intensive.<br>Here we do it to keep everything local, reproducible, and self-contained for learning.<br>Production pipelines usually separate training (scheduled jobs) from testing (CI/CD).</p>
<p>Create <code>.github/workflows/ci.yml</code>:</p>
<pre><code class="language-python"># .github/workflows/ci.yml
name: ML Pipeline CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      
      - name: Generate training data
        run: python src/generate_data.py
      
      - name: Train model
        run: python src/train_naive.py
      
      - name: Run data quality tests
        run: pytest tests/test_data_and_model.py -v --tb=short
      
      - name: Build Docker image
        run: docker build -t fraud-detection-api .
      
      - name: Run container for API tests
        run: |
          docker run -d -p 8000:8000 --name test-api fraud-detection-api
          sleep 10  # Wait for API to start
          curl -f http://localhost:8000/health || exit 1
      
      - name: Run API tests
        run: pytest tests/test_api.py -v --tb=short
      
      - name: Cleanup
        if: always()
        run: docker stop test-api || true
</code></pre>
<h3 id="heading-73-dockerize-the-application"><strong>7.3 Dockerize the Application</strong></h3>
<p>Create <code>Dockerfile</code>:</p>
<pre><code class="language-python"># Dockerfile
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update &amp;&amp; apt-get install -y \
    curl \
    &amp;&amp; rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ src/
COPY models/ models/
COPY data/ data/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run the API
CMD ["uvicorn", "src.serve_validated:app", "--host", "0.0.0.0", "--port", "8000"]
</code></pre>
<p>Create <code>.dockerignore</code>:</p>
<pre><code class="language-python"># .dockerignore
venv/
__pycache__/
*.pyc
.git/
.github/
mlruns/
*.db
*.html
.pytest_cache/
</code></pre>
<p>Build and run locally:</p>
<pre><code class="language-python"># Build the Docker image
docker build -t fraud-detection-api .

# Run the container
docker run -p 8000:8000 fraud-detection-api

# Test it
curl http://localhost:8000/health
</code></pre>
<p><strong>Checkpoint:</strong></p>
<ul>
<li><p>All tests pass: <code>pytest tests/test_data_and_</code><a href="http://model.py"><code>model.py</code></a> <code>-v</code></p>
</li>
<li><p>Docker image builds successfully</p>
</li>
<li><p>Container runs and responds to health checks</p>
</li>
</ul>
<h2 id="heading-8-incident-response-playbook"><strong>8. Incident Response Playbook</strong></h2>
<p>When things go wrong in production (and they will), you need a plan. This section provides playbooks for common ML incidents.</p>
<h3 id="heading-scenario-false-positive-spike"><strong>Scenario: False Positive Spike</strong></h3>
<p><strong>Symptoms:</strong> Your fraud model suddenly flags 40% of legitimate transactions as fraud, blocking customers and overwhelming your manual review team.</p>
<p><strong>Severity:</strong> HIGH - Direct customer impact</p>
<p><strong>Phase 1: Mitigation (0-5 minutes)</strong></p>
<ol>
<li><p><strong>Acknowledge the incident</strong> - Notify stakeholders that you're aware and responding</p>
</li>
<li><p><strong>Roll back to previous model</strong> - In MLflow UI, move the @champion alias to the previous model version</p>
</li>
<li><p><strong>Restart the API</strong> - <code>docker restart fraud-api</code> or redeploy</p>
</li>
<li><p><strong>Verify</strong> - Check that false positive rate has returned to normal</p>
</li>
<li><p><strong>Communicate</strong> - "Issue detected and mitigated. Investigating root cause."</p>
</li>
</ol>
<p><strong>Phase 2: Diagnosis (5-60 minutes)</strong></p>
<ol>
<li><p><strong>Check drift report</strong> - Run <code>python src/</code><a href="http://monitoring.py"><code>monitoring.py</code></a> with recent production data</p>
</li>
<li><p><strong>Check data validation logs</strong> - Did upstream data format change?</p>
</li>
<li><p><strong>Check recent deployments</strong> - Was there a new model or code deployed recently?</p>
</li>
<li><p><strong>Compare metrics</strong> - What's different between the rolled-back and problematic model?</p>
</li>
</ol>
<p><strong>Example root causes:</strong></p>
<ul>
<li><p>Upstream system sent amounts in cents instead of dollars</p>
</li>
<li><p>New merchant category appeared that wasn't in training data</p>
</li>
<li><p>Holiday shopping patterns differed significantly from training data</p>
</li>
</ul>
<p><strong>Phase 3: Remediation (1-24 hours)</strong></p>
<ol>
<li><p><strong>Fix the root cause</strong> - Add validation for the edge case, or update training data</p>
</li>
<li><p><strong>Retrain if needed</strong> - Include new patterns in training data</p>
</li>
<li><p><strong>Add test case</strong> - Prevent this from happening again</p>
</li>
<li><p><strong>Document</strong> - Add to runbook for future reference</p>
</li>
</ol>
<h3 id="heading-scenario-gradual-performance-decay"><strong>Scenario: Gradual Performance Decay</strong></h3>
<p><strong>Symptoms:</strong> Monitoring shows fraud recall dropping 2% per week over a month. No sudden failures, just slow degradation.</p>
<p><strong>Severity:</strong> MEDIUM - Gradual impact, time to respond</p>
<p><strong>Response:</strong></p>
<ol>
<li><p><strong>Investigate drift report</strong> - Look for gradual distribution changes</p>
<pre><code class="language-python">python src/monitoring.py
</code></pre>
</li>
<li><p><strong>Collect recent labeled data</strong> - Get confirmed fraud cases from the past month</p>
</li>
<li><p><strong>Analyze patterns</strong> - What's different about recent fraud?</p>
<ul>
<li><p>New attack vectors?</p>
</li>
<li><p>Different time patterns?</p>
</li>
<li><p>New merchant categories?</p>
</li>
</ul>
</li>
<li><p><strong>Retrain on combined data</strong> - Include both old and new patterns</p>
<pre><code class="language-python">python src/train_mlflow.py
</code></pre>
</li>
<li><p><strong>Deploy via canary</strong> - Route 10% of traffic to the new model first</p>
<ul>
<li><p>Monitor metrics for 1-2 days</p>
</li>
<li><p>If metrics improve, increase to 50%, then 100%</p>
</li>
<li><p>If metrics worsen, roll back</p>
</li>
</ul>
</li>
<li><p><strong>Set up recurring retraining</strong> - Schedule weekly or monthly retraining</p>
</li>
</ol>
<h3 id="heading-scenario-upstream-data-schema-change"><strong>Scenario: Upstream Data Schema Change</strong></h3>
<p><strong>Symptoms:</strong> API starts returning 500 errors. Logs show <code>KeyError: 'merchant_category'</code>.</p>
<p><strong>Severity:</strong> HIGH - Service is down</p>
<p><strong>Response:</strong></p>
<ol>
<li><p><strong>Check error logs</strong> - Identify the exact error</p>
<pre><code class="language-python">KeyError: 'merchant_category'
</code></pre>
</li>
<li><p><strong>Check upstream data</strong> - Did the field name change?</p>
<ul>
<li><p><code>merchant_category</code> -&gt; <code>category</code></p>
</li>
<li><p><code>amount</code> -&gt; <code>transaction_amount</code></p>
</li>
</ul>
</li>
<li><p><strong>Immediate fix</strong> - Add field name mapping</p>
<pre><code class="language-python"># Quick fix in API
if 'category' in data and 'merchant_category' not in data:
    data['merchant_category'] = data['category']
</code></pre>
</li>
<li><p><strong>Long-term fix</strong> - Add validation that catches schema changes</p>
<pre><code class="language-python">required_fields = ['amount', 'hour', 'day_of_week', 'merchant_category']
missing = [f for f in required_fields if f not in data]
if missing:
    raise ValidationError(f"Missing fields: {missing}")
</code></pre>
</li>
<li><p><strong>Add integration test</strong> - Test with upstream system in CI/CD</p>
</li>
</ol>
<h2 id="heading-9-how-to-put-it-all-together"><strong>9.</strong> How to Put It All Together</h2>
<p>Let's step back and appreciate what we've built. Our initial naive system has transformed into a <strong>local ML platform</strong> with production-grade components.</p>
<blockquote>
<p>💡 <strong>Mental Model</strong>: Each tool in this stack is a "catch net" for a specific failure mode:</p>
<ul>
<li><p>MLflow catches "which model is this?"</p>
</li>
<li><p>Feast catches "are features consistent?"</p>
</li>
<li><p>Great Expectations catches "is this data valid?"</p>
</li>
<li><p>Evidently catches "has the world changed?"</p>
</li>
<li><p>CI/CD catches "did we break something?"</p>
</li>
</ul>
<p>Together, they form defense-in-depth for ML systems.</p>
</blockquote>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Tool</strong></th>
<th><strong>Problem Solved</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Experiment Tracking</strong></td>
<td>MLflow</td>
<td>Every run logged, reproducible</td>
</tr>
<tr>
<td><strong>Model Registry</strong></td>
<td>MLflow</td>
<td>Versioned models, rollback capability</td>
</tr>
<tr>
<td><strong>Feature Store</strong></td>
<td>Feast</td>
<td>Consistent features, no training-serving skew</td>
</tr>
<tr>
<td><strong>Data Validation</strong></td>
<td>Great Expectations</td>
<td>Bad data rejected with clear errors</td>
</tr>
<tr>
<td><strong>Monitoring</strong></td>
<td>Evidently</td>
<td>Drift detected before it causes problems</td>
</tr>
<tr>
<td><strong>Containerization</strong></td>
<td>Docker</td>
<td>Environment consistency everywhere</td>
</tr>
<tr>
<td><strong>CI/CD</strong></td>
<td>GitHub Actions</td>
<td>Automated testing and safe deployments</td>
</tr>
</tbody></table>
<h3 id="heading-the-complete-workflow"><strong>The Complete Workflow</strong></h3>
<p>Here's how all the pieces work together in practice:</p>
<ol>
<li><p><strong>Data arrives</strong> - New transaction data comes in from upstream systems</p>
</li>
<li><p><strong>Validation gate</strong> - Great Expectations rules check data quality. Bad data is rejected with clear error messages before it can cause harm.</p>
</li>
<li><p><strong>Feature computation</strong> - Feast computes features using the same definitions for both training and serving. No more training-serving skew.</p>
</li>
<li><p><strong>Training</strong> - When you retrain, MLflow logs all parameters, metrics, and artifacts. Every experiment is reproducible and comparable.</p>
</li>
<li><p><strong>Model registry</strong> - Trained models are automatically versioned. You can compare metrics, promote the best to Production, and roll back if needed.</p>
</li>
<li><p><strong>Serving</strong> - FastAPI loads the @champion model from MLflow. Each request is validated, features are retrieved from Feast, and predictions are returned.</p>
</li>
<li><p><strong>Monitoring</strong> - Evidently checks for drift periodically. If input distributions change significantly, alerts are triggered.</p>
</li>
<li><p><strong>Retraining loop</strong> - When drift is detected, you retrain on new data, compare metrics, and promote if better. The cycle continues.</p>
</li>
<li><p><strong>CI/CD safety net</strong> - All code changes go through automated tests. Docker ensures environment consistency. Nothing reaches production without passing the pipeline.</p>
</li>
</ol>
<h2 id="heading-10-whats-next-scale-to-production"><strong>10. What's Next: Scale to Production</strong></h2>
<p>This project runs locally, but the principles and tools extend directly to production deployments. Here's how each component scales:</p>
<h3 id="heading-scaling-feast-for-production"><strong>Scaling Feast for Production</strong></h3>
<p>We used Feast with local SQLite stores. For production:</p>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Local</strong></th>
<th><strong>Production</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Online Store</td>
<td>SQLite</td>
<td>Redis, DynamoDB, or PostgreSQL</td>
</tr>
<tr>
<td>Offline Store</td>
<td>Parquet files</td>
<td>BigQuery, Snowflake, or Redshift</td>
</tr>
<tr>
<td>Feature Server</td>
<td>Embedded</td>
<td>Dedicated Feast serving cluster</td>
</tr>
</tbody></table>
<p>Benefits at scale:</p>
<ul>
<li><p>Sub-10ms feature retrieval</p>
</li>
<li><p>Horizontal scaling for high throughput</p>
</li>
<li><p>Feature monitoring and statistics</p>
</li>
<li><p>Point-in-time joins at petabyte scale</p>
</li>
</ul>
<h3 id="heading-scaling-mlflow-for-production"><strong>Scaling MLflow for Production</strong></h3>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Local</strong></th>
<th><strong>Production</strong></th>
</tr>
</thead>
<tbody><tr>
<td>Backend Store</td>
<td>SQLite</td>
<td>PostgreSQL or MySQL</td>
</tr>
<tr>
<td>Artifact Store</td>
<td>Local filesystem</td>
<td>S3, GCS, or Azure Blob</td>
</tr>
<tr>
<td>Tracking Server</td>
<td>Single instance</td>
<td>Load-balanced cluster</td>
</tr>
</tbody></table>
<h3 id="heading-kubernetes-deployment"><strong>Kubernetes Deployment</strong></h3>
<p>When you outgrow Docker Compose:</p>
<ul>
<li><p><strong>KServe or Seldon</strong> for serverless model serving with auto-scaling</p>
</li>
<li><p><strong>Horizontal Pod Autoscaler</strong> to scale based on CPU/memory/custom metrics</p>
</li>
<li><p><strong>Canary deployments</strong> to safely roll out new models (route 10% traffic first)</p>
</li>
<li><p><strong>GPU scheduling</strong> for inference-heavy models</p>
</li>
</ul>
<h3 id="heading-advanced-monitoring"><strong>Advanced Monitoring</strong></h3>
<p>Expand observability with:</p>
<ul>
<li><p><strong>Prometheus + Grafana</strong> for real-time dashboards</p>
</li>
<li><p><strong>OpenTelemetry</strong> for distributed tracing</p>
</li>
<li><p><strong>PagerDuty/Slack integration</strong> for alerts</p>
</li>
<li><p><strong>Labeled data collection</strong> for continuous model evaluation</p>
</li>
</ul>
<h3 id="heading-ab-testing-and-multi-armed-bandits"><strong>A/B Testing and Multi-Armed Bandits</strong></h3>
<p>How to Use the Model Registry:</p>
<ul>
<li><p>Serve <strong>multiple models</strong> concurrently (champion vs challengers)</p>
</li>
<li><p><strong>Route traffic</strong> dynamically based on context</p>
</li>
<li><p><strong>Collect metrics</strong> for each model variant</p>
</li>
<li><p><strong>Automatically promote</strong> the best performer</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Congratulations on building a production-ready ML system on your local machine!</p>
<p>What we assembled here is a microcosm of real-world ML platforms:</p>
<ul>
<li><p>We started with just a model saved to a pickle file</p>
</li>
<li><p>We ended up with <strong>MLOps best practices</strong>: experiment tracking, model versioning, feature stores, data validation, monitoring, containerization, and CI/CD</p>
</li>
</ul>
<p><strong>The tools we used are production-grade:</strong></p>
<ul>
<li><p><strong>MLflow</strong> powers ML platforms at companies like Microsoft, Facebook, and Databricks</p>
</li>
<li><p><strong>Feast</strong> is used by companies like Gojek, Shopify, and Robinhood</p>
</li>
<li><p><strong>FastAPI</strong> is one of the fastest Python web frameworks</p>
</li>
<li><p><strong>Great Expectations</strong> is used at companies like GitHub and Shopify</p>
</li>
<li><p><strong>Evidently</strong> is used for monitoring ML in production at scale</p>
</li>
</ul>
<p><strong>The principles apply at any scale:</strong></p>
<ul>
<li><p>Always track experiments</p>
</li>
<li><p>Always version models</p>
</li>
<li><p>Always validate data</p>
</li>
<li><p>Always monitor for drift</p>
</li>
<li><p>Always containerize for consistency</p>
</li>
<li><p>Always automate testing</p>
</li>
</ul>
<h3 id="heading-next-steps-you-can-try"><strong>Next Steps You Can Try</strong></h3>
<ol>
<li><p><strong>Deploy to the cloud</strong> - Push your Docker container to AWS ECS, Google Cloud Run, or Azure Container Instances</p>
</li>
<li><p><strong>Add model explainability</strong> - Use SHAP or LIME to explain individual predictions</p>
</li>
<li><p><strong>Implement A/B testing</strong> - Serve multiple models and compare performance</p>
</li>
<li><p><strong>Add feature importance monitoring</strong> - Track how feature importance changes over time</p>
</li>
<li><p><strong>Set up real-time alerting</strong> - Connect Evidently to Slack or PagerDuty</p>
</li>
<li><p><strong>Implement continuous training</strong> - Automatically retrain when drift is detected</p>
</li>
<li><p><strong>Add bias and fairness monitoring</strong> - Ensure your model treats all groups fairly</p>
</li>
</ol>
<p>Remember that productionizing ML is an <strong>iterative process</strong>. There's always another layer of robustness to add, another edge case to handle, another metric to track. But with the foundation you've built here, you're well on your way to taking models from promising notebook experiments to deployed, monitored, and maintainable production applications.</p>
<p>Happy building, and may your models be accurate and your pipelines resilient!</p>
<h2 id="heading-get-the-complete-code">Get the Complete Code</h2>
<p>The entire project from this handbook is available as a public GitHub repository:</p>
<p><strong>🔗</strong> <a href="http://github.com/sandeepmb/freecodecamp-local-ml-platform"><strong>github.com/sandeepmb/freecodecamp-local-ml-platform</strong></a></p>
<p>The repository includes:</p>
<ul>
<li><p>All source code (<code>src/</code> directory)</p>
</li>
<li><p>Test files (<code>tests/</code> directory)</p>
</li>
<li><p>Feast feature definitions (<code>feature_repo/</code>)</p>
</li>
<li><p>Docker and CI/CD configuration</p>
</li>
<li><p>Ready-to-run scripts</p>
</li>
</ul>
<p><strong>Quick Start:</strong></p>
<pre><code class="language-bash">git clone https://github.com/sandeepmb/freecodecamp-local-ml-platform.git
cd freecodecamp-local-ml-platform
python -m venv venv &amp;&amp; source venv/bin/activate
pip install -r requirements.txt
python src/generate_data.py
python src/train_naive.py
</code></pre>
<hr>
<h2 id="heading-references"><strong>References</strong></h2>
<ul>
<li><p><a href="https://mlflow.org/docs/latest/">MLflow Documentation</a> - Experiment tracking and model registry</p>
</li>
<li><p><a href="https://docs.feast.dev/">Feast Documentation</a> - Feature store</p>
</li>
<li><p><a href="https://docs.feast.dev/getting-started/quickstart">Feast Quickstart</a> - Getting started with Feast</p>
</li>
<li><p><a href="https://fastapi.tiangolo.com/">FastAPI Documentation</a> - Modern Python web framework</p>
</li>
<li><p><a href="https://greatexpectations.io/">Great Expectations</a> - Data validation</p>
</li>
<li><p><a href="https://docs.evidentlyai.com/">Evidently AI Documentation</a> - ML monitoring</p>
</li>
<li><p><a href="https://jfrog.com/learn/mlops/cicd-for-machine-learning/">CI/CD for Machine Learning (JFrog)</a> - CI/CD best practices</p>
</li>
<li><p><a href="https://www.qwak.com/post/training-serving-skew-in-machine-learning">Training-Serving Skew Explained</a> - Understanding skew</p>
</li>
<li><p><a href="https://docs.docker.com/">Docker Documentation</a> - Containerization</p>
</li>
<li><p><a href="https://docs.github.com/en/actions">GitHub Actions Documentation</a> - CI/CD automation</p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ A Comprehensive Guide to Financial Storytelling using Data Visualization ]]>
                </title>
                <description>
                    <![CDATA[ In any analysis project, raw tables of numbers often don’t tell the full story. Visualisations simplify complexity by transforming data into shapes that our brains can quickly understand, emphasising  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/financial-storytelling-using-data-visualization/</link>
                <guid isPermaLink="false">69b1ced06c896b0519c207be</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ finance ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nikhil Adithyan ]]>
                </dc:creator>
                <pubDate>Wed, 11 Mar 2026 20:21:36 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/64ab3674-959f-44b5-8be2-4ca00a798621.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In any analysis project, raw tables of numbers often don’t tell the full story. Visualisations simplify complexity by transforming data into shapes that our brains can quickly understand, emphasising trends, outliers, and regime shifts that might be overlooked in raw data.</p>
<p>This is especially vital in finance and trading, where clear visuals can uncover risks, opportunities, and patterns, directly affecting decisions on position sizing, timing, and confidence.</p>
<p>Today, we'll use FMP APIs to interpret earnings data: extracting announcements, surprises, and price reactions across almost 1,000 stocks to identify actionable patterns in post‑earnings movements.</p>
<p>Here’s exactly what we’ll build:</p>
<ul>
<li><p><strong>Sector heatmap</strong>: Maps strongest 3/10-day post-earnings reactions by sector/market-cap buckets.</p>
</li>
<li><p><strong>EPS scatter</strong>: Tests if earnings beats drive returns (sector-colored, with regression).</p>
</li>
<li><p><strong>Return violins</strong>: Shows 3-day post-earnings volatility/skew by sector and market-cap.</p>
</li>
<li><p><strong>Mega-tech time series</strong>: Tracks AAPL/MSFT/NVDA post-earnings patterns over time.</p>
</li>
<li><p><strong>Monthly seasonality</strong>: Reveals calendar edges in post-earnings returns/surprises.</p>
</li>
<li><p><strong>Regime cross-section</strong>: Tests sector robustness across bull/bear/sideways markets.</p>
</li>
</ul>
<h3 id="heading-what-well-cover">What we'll cover:</h3>
<ol>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-data-extraction">Data Extraction</a></p>
</li>
<li><p><a href="#heading-storytelling-with-charts-and-visuals">Storytelling with Charts and Visuals</a></p>
<ul>
<li><p><a href="#heading-sector-heatmap">Sector Heatmap</a></p>
</li>
<li><p><a href="#heading-megacap-tech-time-series">Mega‑Cap Tech Time Series</a></p>
</li>
<li><p><a href="#heading-eps-surprise-scatter-plot">EPS Surprise Scatter Plot</a></p>
</li>
<li><p><a href="#heading-return-distribution-violins">Return Distribution Violins</a></p>
</li>
<li><p><a href="#heading-monthly-seasonality">Monthly Seasonality</a></p>
</li>
<li><p><a href="#heading-regime-crosssection">Regime Cross-Section</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-what-did-we-get-out-of-all-this-storyline">What Did We Get Out of All This Storyline?</a></p>
</li>
<li><p><a href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you should be comfortable with Python and basic data manipulation in pandas.</p>
<p>This is a code-first guide. I’ll focus on the workflow and the story the charts reveal, and I won’t explain every line of Python. You should be comfortable reading pandas code, loops, and basic plotting logic so you can follow along without needing a step-by-step breakdown of each block.</p>
<p>You’ll need:</p>
<ul>
<li><p>Python 3.10+</p>
</li>
<li><p>A Financial Modeling Prep (FMP) API key</p>
</li>
<li><p>pandas, numpy, matplotlib, seaborn, scipy installed</p>
</li>
<li><p>Enough local compute and patience to run API loops across a large stock universe</p>
</li>
</ul>
<h2 id="heading-data-extraction">Data Extraction</h2>
<p>In the first part of this article, we need to collect all the data required for our visualisation exercise. Using FMP’s Stock Screener API, we will retrieve NASDAQ stocks. The first API call will return 1,000 stocks.</p>
<pre><code class="language-python">import requests
import pandas as pd
import numpy as np
import json
from datetime import datetime, timedelta
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

token = 'YOUR FMP TOKEN'

url = f'https://financialmodelingprep.com/stable/company-screener'
querystring = {"apikey":token,"country":"US", "exchange": "NASDAQ", "isActiveTrading": True, "isEtf": False, "isFund": False}
resp = requests.get(url, querystring).json()

df_universe = pd.DataFrame(resp)
df_universe = df_universe[df_universe['exchangeShortName'] == 'NASDAQ']
df_universe
</code></pre>
<p>This will give us 1,000 stocks! Next, we'll bin the market capitalisation to gain a better understanding of the results later on, and we will keep only four columns that are necessary: the symbol, name, market cap, and sector.</p>
<pre><code class="language-python">bins = [0,
        250_000_000,    # 250M
        2_000_000_000,  # 2B
        10_000_000_000, # 10B
        200_000_000_000,# 200B
        float("inf")]

labels = ["Micro", "Small", "Mid", "Large", "Mega"]

df_universe["marketCap"] = pd.cut(df_universe["marketCap"], bins=bins, labels=labels, right=False)
df_universe = df_universe[['symbol', 'companyName', 'marketCap', 'sector']]
df_universe
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1000/0*rAiF7Q5TqSNlRG4h.png" alt="0*rAiF7Q5TqSNlRG4h" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Now it is time to retrieve the earnings using FMP’s Earnings Report API. We'll loop through each symbol and collect all the earnings the endpoint provides to us.</p>
<pre><code class="language-python">symbols = df_universe['symbol'].to_list()

all_dfs = []

for symbol in symbols:
    url = f"https://financialmodelingprep.com/stable/earnings?symbol={symbol}"
    params = {"apikey": token}
    resp = requests.get(url, params=params)

    if resp.status_code != 200:
        print(f"Error for {symbol}: {resp.status_code} - {resp.text}")
        continue

    data = resp.json()
    if not data:
        print(f"No data for {symbol}")
        continue

    df_symbol = pd.DataFrame(data)
    df_symbol["symbol"] = symbol
    all_dfs.append(df_symbol)

# Single DataFrame with all earnings
df_earnings = pd.concat(all_dfs, ignore_index=True)
df_earnings = df_earnings.dropna(subset=['epsActual', 'epsEstimated', 'revenueActual','revenueEstimated'])
df_earnings
</code></pre>
<p>Now we'll calculate the surprise, both for earnings and revenue in percentage terms, so we can later compare apples with apples! We'll keep everything from 2010 onwards.</p>
<pre><code class="language-python">df_earnings["eps_surprise"] = ((df_earnings["epsActual"] - df_earnings["epsEstimated"]) /
                               abs(df_earnings["epsEstimated"]) * 100).round(2)

df_earnings["revenue_surprise"] = ((df_earnings["revenueActual"] - df_earnings["revenueEstimated"]) /
                                   abs(df_earnings["revenueEstimated"]) * 100).round(2)

df_earnings = df_earnings[['symbol', 'date', 'eps_surprise', 'revenue_surprise']]

df_earnings["date"] = pd.to_datetime(df_earnings["date"])
df_earnings = df_earnings[df_earnings["date"] &gt; "2009-12-31"]
</code></pre>
<p>Lastly, as a final step in gathering the data needed for visualization, using FMP’s Historical Index Full Chart API, we'll loop through the stocks in our dataframe, retrieve the historical daily prices, and calculate the return of the stock 3 and 10 trading days before and after the earnings announcement.</p>
<pre><code class="language-python">unique_symbols = df_earnings["symbol"].unique()

price_results = []

print(f"Processing {len(unique_symbols)} symbols...")

for symbol in unique_symbols:
    # Fetch full historical prices
    url = f"https://financialmodelingprep.com/stable/historical-price-eod/full"
    params = {"apikey":token, "symbol":symbol, "from":'2009-10-01'}
    resp = requests.get(url, params=params)

    if resp.status_code != 200:
        print(f"Error for {symbol}: {resp.status_code}")
        continue

    data = resp.json()

    hist_df = pd.DataFrame(data)
    hist_df["date"] = pd.to_datetime(hist_df["date"])
    hist_df = hist_df.sort_values("date").reset_index(drop=True)

    # Get matching earnings rows
    earnings_symbol = df_earnings[df_earnings["symbol"] == symbol].copy()

    for _, row in earnings_symbol.iterrows():
        earn_date = pd.to_datetime(row["date"]).date()

        # === 3-DAY WINDOWS ===
        pre3_mask = (hist_df["date"].dt.date &lt; earn_date) &amp; \
                    (hist_df["date"].dt.date &gt;= earn_date - timedelta(days=10))
        pre3 = hist_df[pre3_mask].tail(3)

        post3_mask = (hist_df["date"].dt.date &gt; earn_date) &amp; \
                     (hist_df["date"].dt.date &lt;= earn_date + timedelta(days=10))
        post3 = hist_df[post3_mask].head(3)

        pre3_start = pre3["close"].iloc[0] if len(pre3) &gt;= 3 else None
        pre3_end = pre3["close"].iloc[-1] if len(pre3) &gt;= 1 else None
        post3_end = post3["close"].iloc[-1] if len(post3) &gt;= 3 else None

        pct_pre_3d = ((pre3_end - pre3_start) / pre3_start * 100) if pre3_start and pre3_end else None
        pct_post_3d = ((post3_end - pre3_end) / pre3_end * 100) if pre3_end and post3_end else None

        # === 10-DAY WINDOWS ===
        pre10_mask = (hist_df["date"].dt.date &lt; earn_date) &amp; \
                     (hist_df["date"].dt.date &gt;= earn_date - timedelta(days=20))
        pre10 = hist_df[pre10_mask].tail(10)

        post10_mask = (hist_df["date"].dt.date &gt; earn_date) &amp; \
                      (hist_df["date"].dt.date &lt;= earn_date + timedelta(days=20))
        post10 = hist_df[post10_mask].head(10)

        pre10_start = pre10["close"].iloc[0] if len(pre10) &gt;= 10 else None
        pre10_end = pre10["close"].iloc[-1] if len(pre10) &gt;= 1 else None
        post10_end = post10["close"].iloc[-1] if len(post10) &gt;= 10 else None

        pct_pre_10d = ((pre10_end - pre10_start) / pre10_start * 100) if pre10_start and pre10_end else None
        pct_post_10d = ((post10_end - pre10_end) / pre10_end * 100) if pre10_end and post10_end else None

        price_results.append({
            "symbol": symbol,
            "earn_date": earn_date,
            "month": earn_date.month,
            "pct_pre_3d": round(pct_pre_3d, 2) if pct_pre_3d else None,
            "pct_post_3d": round(pct_post_3d, 2) if pct_post_3d else None,
            "pct_pre_10d": round(pct_pre_10d, 2) if pct_pre_10d else None,
            "pct_post_10d": round(pct_post_10d, 2) if pct_post_10d else None,
            "eps_surprise": row["eps_surprise"],
            "revenue_surprise": row["revenue_surprise"]
        })



df_earnings = pd.DataFrame(price_results)
df_earnings.dropna(inplace=True)
df_earnings = df_universe.merge(df_earnings, on="symbol")
df_earnings
</code></pre>
<p>As you can see, at the end of the code, we have also merged the initial dataset, so all the information, such as name, marketCap, and sector, is now in a single dataset.</p>
<h2 id="heading-storytelling-with-charts-and-visuals">Storytelling with Charts and Visuals</h2>
<h3 id="heading-sector-heatmap">Sector Heatmap</h3>
<p>First, we'll present the Sector Heatmap of average 3-day post-earnings returns segmented by sector and market-cap category. This basic visualisation highlights areas with the most significant reactions, enabling traders to swiftly identify high-alpha sectors and market caps for earnings strategies.</p>
<pre><code class="language-python"># Aggregate: average post-earnings returns and EPS surprise
agg = (
    df_earnings
    .dropna(subset=['pct_post_3d', 'pct_post_10d', 'eps_surprise', 'marketCap', 'sector'])
    .groupby(['sector', 'marketCap'])
    .agg(
        avg_post3d=('pct_post_3d', 'mean'),
        avg_post10d=('pct_post_10d', 'mean'),
        avg_eps_surprise=('eps_surprise', 'mean')
    )
    .reset_index()
)

# Heatmap: average 3-day post-earnings return
heatmap_3d = agg.pivot(index='sector', columns='marketCap', values='avg_post3d')

plt.figure(figsize=(12, 8))
sns.heatmap(
    heatmap_3d,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0,
    linewidths=0.5,
    linecolor='grey'
)
plt.title('Average 3-Day Post-Earnings Return by Sector and Market-Cap Bucket')
plt.xlabel('Market-cap bucket')
plt.ylabel('Sector')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1000/0*u0AOCzVCWJ4NQMIS.png" alt="Heatmap of average 3-day post-earnings returns by sector and market-cap bucket for NASDAQ stocks" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Consumer Cyclical and Materials are performing really well, with small and mid caps seeing positive reactions over 1.1%. Real Estate is also doing great, jumping up to +4.0% in mid caps. Energy and Financials are holding steady, staying close to zero. Technology, on the other hand, is showing more muted gains, under 1.1%, indicating there might be limited immediate upside from the big tech earnings.</p>
<p>Building on the 3‑day heatmap, we'll now look at the Sector Heatmap for average <em>10‑day</em> post‑earnings returns by sector and market‑cap category. This extends the timeframe to capture momentum persistence, revealing which sectors maintain or reverse short‑term reactions.</p>
<pre><code class="language-python"># Heatmap: average 10-day post-earnings return
heatmap_10d = agg.pivot(index='sector', columns='marketCap', values='avg_post10d')

plt.figure(figsize=(12, 8))
sns.heatmap(
    heatmap_10d,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0,
    linewidths=0.5,
    linecolor='grey'
)
plt.title('Average 10-Day Post-Earnings Return by Sector and Market-Cap Bucket')
plt.xlabel('Market-cap bucket')
plt.ylabel('Sector')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1000/0*DB7p_HYR-6jWYaaP.png" alt="Heatmap of average 10-day post-earnings returns by sector and market-cap bucket for NASDAQ stocks" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Consumer Cyclical stands out with peaks at 3.2% (mega caps), and Industrials and Health Care show consistent gains in mid and large caps around 1.1%. Real Estate has eased after its 3-day surge. Technology has seen a small boost in mega caps (+1.8%) but remains less active overall compared to cyclicals.</p>
<h3 id="heading-megacap-tech-time-series"><strong>Mega‑Cap Tech Time&nbsp;Series</strong></h3>
<p>Extending the heatmaps, we’ll now look at a Mega-Cap Tech time series. It tracks 10-day post-earnings returns over time for AAPL, MSFT, NVDA, and a few other mega-cap tech names.</p>
<p>A bubble chart works well here because it encodes more than one thing at once. The x-axis is the earnings date, the y-axis is the 10-day post-earnings return, the bubble size scales with the absolute EPS surprise magnitude, and the color shows whether the surprise was a beat or a miss. This makes it easy to spot outlier quarters and see whether big surprises consistently lead to bigger post-earnings moves.</p>
<pre><code class="language-python"># Define mega-cap tech tickers (top ones from data: AAPL, MSFT, NVDA, AMZN, GOOG/GOOGL, META)
tech_tickers = ['AAPL', 'MSFT', 'NVDA', 'AMZN', 'GOOG', 'GOOGL', 'META']

# Filter data for mega-cap tech
df_tech = (
    df_earnings[df_earnings['symbol'].isin(tech_tickers)]
    .dropna(subset=['earn_date', 'pct_post_10d', 'eps_surprise'])
    .sort_values('earn_date')
    .assign(
        earn_date=lambda x: pd.to_datetime(x['earn_date'])
    )
)

# Create time-series plot: pct_post_10d vs earn_date, sized/color by eps_surprise
plt.figure(figsize=(14, 8))

# Scatter plot
scatter = plt.scatter(
    df_tech['earn_date'],
    df_tech['pct_post_10d'],
    s=np.abs(df_tech['eps_surprise']) * 50 + 20,  # Size by abs(eps_surprise)
    c=df_tech['eps_surprise'],
    cmap='RdYlBu_r',
    alpha=0.7,
    edgecolors='black',
    linewidth=0.5
)

plt.colorbar(scatter, label='EPS Surprise (%)')
plt.xlabel('Earnings Date')
plt.ylabel('10-Day Post-Earnings Return (%)')
plt.title('Mega-Cap Tech: 10-Day Post-Earnings Returns vs Time\n(Point size/color by EPS Surprise)')
plt.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(pd.to_numeric(df_tech['earn_date']), df_tech['pct_post_10d'], 1)
p = np.poly1d(z)
plt.plot(df_tech['earn_date'], p(pd.to_numeric(df_tech['earn_date'])), "r--", alpha=0.8, linewidth=2, label=f'Trend: {z[0]:.3f}x + {z[1]:.1f}')

plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1500/1*vFJ_bKUzT1WGiJaiF53tEg.png" alt="Bubble chart of 10-day post-earnings returns over time for AAPL, MSFT, NVDA, AMZN, GOOG, GOOGL, META. Bubble size reflects EPS surprise magnitude. Color reflects beat or miss" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>That large red bubble around 2018 is almost certainly <strong>AAPL’s Q4 2018 earnings miss</strong> (Jan 2019 announcement, but fiscal Q4 2018 data) and it stands out because:</p>
<ul>
<li><p><strong>Large size</strong> = massive EPS surprise magnitude (Apple cut guidance dramatically, ~10% miss)</p>
</li>
<li><p><strong>Red colour</strong> = negative surprise</p>
</li>
<li><p><strong>Low Y position</strong> = poor 10‑day return (~-10% range visible)</p>
</li>
</ul>
<p>This was Apple’s infamous “iPhone demand warning” that triggered the January 2019 market panic. Perfect example of how one outlier event can anchor the whole trend line downward in your visualisation.</p>
<h3 id="heading-eps-surprise-scatter-plot">EPS Surprise Scatter&nbsp;Plot</h3>
<p>After identifying major tech trends, let's now look at the <strong>EPS Surprise Scatter</strong> plots. This plot checks a simple hypothesis. Do earnings beats lead to positive returns, and do misses lead to negative returns? We plot EPS surprise on the x-axis and post-earnings returns on the y-axis, then add a regression line to show the average relationship.</p>
<pre><code class="language-python"># Prepare data: drop NaNs and convert earn_date if needed (not used here)
df_plot = (
    df_earnings
    .dropna(subset=['eps_surprise', 'pct_post_3d', 'pct_post_10d', 'sector'])
    .copy()
)

# 1. Scatter: EPS Surprise vs 3-Day Post-Return, colored by sector
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.scatterplot(
    data=df_plot,
    x='eps_surprise',
    y='pct_post_3d',
    hue='sector',
    alpha=0.6,
    s=40
)

# Regression line (overall)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_plot['eps_surprise'], df_plot['pct_post_3d'])
line = slope * df_plot['eps_surprise'] + intercept
plt.plot(df_plot['eps_surprise'], line, 'red', linestyle='--', linewidth=2,
         label=f'y = {slope:.3f}x + {intercept:.2f}\nR²={r_value**2:.3f}')
plt.xlabel('EPS Surprise (%)')
plt.ylabel('3-Day Post-Earnings Return (%)')
plt.title('EPS Surprise vs 3-Day Post-Return by Sector')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

# 2. Scatter: EPS Surprise vs 10-Day Post-Return, colored by sector
plt.subplot(1, 2, 2)
sns.scatterplot(
    data=df_plot,
    x='eps_surprise',
    y='pct_post_10d',
    hue='sector',
    alpha=0.6,
    s=40
)

# Regression line (overall)
slope10, intercept10, r_value10, p_value10, std_err10 = stats.linregress(df_plot['eps_surprise'], df_plot['pct_post_10d'])
line10 = slope10 * df_plot['eps_surprise'] + intercept10
plt.plot(df_plot['eps_surprise'], line10, 'red', linestyle='--', linewidth=2,
         label=f'y = {slope10:.3f}x + {intercept10:.2f}\nR²={r_value10**2:.3f}')
plt.xlabel('EPS Surprise (%)')
plt.ylabel('10-Day Post-Earnings Return (%)')
plt.title('EPS Surprise vs 10-Day Post-Return by Sector')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Optional: Summary table of correlations by sector
corr_3d = df_plot.groupby('sector')[['eps_surprise', 'pct_post_3d']].corr().unstack().xs('pct_post_3d', level=1, axis=1)['eps_surprise']
corr_10d = df_plot.groupby('sector')[['eps_surprise', 'pct_post_10d']].corr().unstack().xs('pct_post_10d', level=1, axis=1)['eps_surprise']

corr_df = pd.DataFrame({
    'Corr_EPS_3Day': corr_3d.round(3),
    'Corr_EPS_10Day': corr_10d.round(3)
}).sort_values('Corr_EPS_10Day', ascending=False)
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1500/1*rEAHbGRiyJs-NT9VPRudDQ.png" alt="Scatter plot of EPS surprise versus post-earnings returns with sector colors and overall regression line" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The red dashed trend line illustrates the <em>typical</em> relationship: for every 1% EPS beat, stocks tend to gain about 0.05–0.1% over 3 to 10 days. The gentle slope suggests that while surprises can give a little boost, <strong>they don’t guarantee large moves</strong>.</p>
<p>You’ll notice that Consumer Cyclical dots mainly cluster in the upper right (beats leading to gains), and Real Estate shows a steeper increase. The wide spread around the line indicates that other factors often influence stock movements beyond surprises.</p>
<h3 id="heading-return-distribution-violins">Return Distribution Violins</h3>
<p>Heatmaps show averages, but averages can hide risk. Violin plots show the full distribution of returns, including how wide the outcomes are and whether the tails are heavy. Here we plot 3-day post-earnings return distributions by sector and by market-cap bucket.</p>
<pre><code class="language-python"># Prepare data
df_plot = (
    df_earnings
    .dropna(subset=['pct_post_3d', 'sector', 'marketCap'])
    .copy()
)

# 1. Violin plot: 3-day post-returns by sector
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.violinplot(
    data=df_plot,
    x='sector',
    y='pct_post_3d',
    inner='quartile',
    palette='Set2'
)
plt.title('Distribution of 3-Day Post-Earnings Returns by Sector (Violin)')
plt.xlabel('Sector')
plt.ylabel('3-Day Post-Earnings Return (%)')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3)

# 2. Violin plot: 3-day post-returns by market-cap group
plt.subplot(1, 2, 2)
sns.violinplot(
    data=df_plot,
    x='marketCap',
    y='pct_post_3d',
    inner='quartile',
    palette='Set3'
)
plt.title('Distribution of 3-Day Post-Earnings Returns by Market-Cap (Violin)')
plt.xlabel('Market-cap bucket')
plt.ylabel('3-Day Post-Earnings Return (%)')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


plt.show()

# Summary statistics table
summary = df_plot.groupby(['sector', 'marketCap'])['pct_post_3d'].agg(['mean', 'median', 'std', 'count']).round(2)
print("Summary Statistics: Mean/Median/Std/Count of 3-Day Returns by Sector &amp; Market-Cap")
print(summary)
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1500/1*JLOvSp-2jwD5_ZNqeqdBEw.png" alt="Violin plots showing distribution of 3-day post-earnings returns by sector and by market-cap bucket" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>All violins concentrate near zero with modest variations (±5%), indicating that post-earnings reactions are <em>generally noisy and lack a clear direction.</em> Markets efficiently incorporate expectations, resulting in little predictable advantage. Consumer Cyclical and Materials sectors display slightly more frequent upside surprises, while small caps exhibit the greatest variability, reflecting higher risk and occasional gains. Not every visualization reveals alpha; this one honestly illustrates the difficulty involved.</p>
<h3 id="heading-monthly-seasonality">Monthly Seasonality</h3>
<p>After observing narrow return distributions near zero, let's now look at Monthly Seasonality in four panels: average 3/10‑day post‑returns, EPS surprises, and event counts by month. This reveals calendar effects,  systematic seasonal biases ,  that can influence timing of entries despite noisy individual responses.</p>
<pre><code class="language-python"># 1. Ensure earn_date is datetime
df_month = (
    df_earnings
    .dropna(subset=['earn_date', 'pct_post_3d', 'pct_post_10d', 'eps_surprise'])
    .copy()
)

df_month['earn_date'] = pd.to_datetime(df_month['earn_date'])

# 2. Derive month number and name
df_month['month_num'] = df_month['earn_date'].dt.month
df_month['month_name'] = df_month['earn_date'].dt.strftime('%b')

# 3. Aggregate averages by month
monthly_agg = (
    df_month
    .groupby('month_num')
    .agg(
        pct_post_3d_mean=('pct_post_3d', 'mean'),
        pct_post_10d_mean=('pct_post_10d', 'mean'),
        eps_surprise_mean=('eps_surprise', 'mean'),
        n_obs=('earn_date', 'count')
    )
    .reset_index()
    .sort_values('month_num')
)

# Keep a stable month order and names
month_order = monthly_agg['month_num'].tolist()
month_labels = df_month.drop_duplicates('month_num').set_index('month_num')['month_name'].reindex(month_order)

monthly_agg['month_name'] = month_labels.values

# 4. Plot bar charts
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Monthly Seasonality of Post-Earnings Returns and EPS Surprise', fontsize=16)

# Avg 3-day return
axes[0, 0].bar(monthly_agg['month_name'], monthly_agg['pct_post_3d_mean'], color='skyblue')
axes[0, 0].set_title('Avg 3-Day Post-Earnings Return by Month')
axes[0, 0].set_ylabel('Return (%)')
axes[0, 0].grid(alpha=0.3)

# Avg 10-day return
axes[0, 1].bar(monthly_agg['month_name'], monthly_agg['pct_post_10d_mean'], color='lightgreen')
axes[0, 1].set_title('Avg 10-Day Post-Earnings Return by Month')
axes[0, 1].set_ylabel('Return (%)')
axes[0, 1].grid(alpha=0.3)

# Avg EPS surprise
axes[1, 0].bar(monthly_agg['month_name'], monthly_agg['eps_surprise_mean'], color='salmon')
axes[1, 0].set_title('Avg EPS Surprise by Month')
axes[1, 0].set_ylabel('EPS Surprise')
axes[1, 0].grid(alpha=0.3)

# Number of observations
axes[1, 1].bar(monthly_agg['month_name'], monthly_agg['n_obs'], color='gold')
axes[1, 1].set_title('Number of Earnings Events by Month')
axes[1, 1].set_ylabel('Count')
axes[1, 1].grid(alpha=0.3)

for ax in axes.ravel():
    ax.set_xlabel('Month')
    ax.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1500/1*HjdZDaUhudYQZPNvOqy-_Q.png" alt="Four-panel bar charts showing monthly averages of 3-day returns, 10-day returns, EPS surprise, and event counts" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Jan/Oct tend to have the best 3‑day returns, about 0.8%, while May/Jul usually see weaker results. The 10‑day trends show a similar but gentler pattern, with February and August reaching peaks. EPS surprises are slightly negative in January and May, possibly due to tough comparisons, and there are fewer events in July, August, and December because of holidays. While there’s a hint of seasonality, its impact is quite small, around 0.5%.</p>
<h3 id="heading-regime-cross-section">Regime Cross-Section</h3>
<p>Finally, after subtle monthly patterns, we'll look at the Regime Cross‑Section: sector 10‑day post‑earnings returns by market regime (heatmap at the top, bars below). This stress‑tests earlier findings  ( do patterns persist across bull, bear, and COVID eras), revealing rotation opportunities and regime dependence.</p>
<pre><code class="language-python"># Prepare data with year extraction
df_regimes = (
    df_earnings
    .dropna(subset=['earn_date', 'pct_post_10d', 'sector'])
    .copy()
)

df_regimes['earn_date'] = pd.to_datetime(df_regimes['earn_date'])
df_regimes['year'] = df_regimes['earn_date'].dt.year

# Define market regimes (adjust years based on your data/market history)
# Example: Bull (2023-2025), Bear/Transition (2022), COVID (2020-2021), etc.
def assign_regime(year):
    if year &gt;= 2023:
        return 'Bull (2023+)'
    elif year == 2022:
        return 'Bear (2022)'
    elif 2020 &lt;= year &lt;= 2021:
        return 'COVID Recovery'
    elif 2018 &lt;= year &lt;= 2019:
        return 'Pre-COVID'
    else:
        return 'Earlier'

df_regimes['market_regime'] = df_regimes['year'].apply(assign_regime)

# 1. Aggregate: average 10-day returns by sector and regime/year
agg_data = (
    df_regimes
    .groupby(['sector', 'market_regime'])['pct_post_10d']
    .agg(['mean', 'count'])
    .reset_index()
    .query('count &gt;= 5')  # Filter low-sample regimes
)

# 2. Visualization: Heatmap first (quick overview)
plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
pivot_heatmap = agg_data.pivot(index='sector', columns='market_regime', values='mean')
sns.heatmap(pivot_heatmap, annot=True, fmt='.2f', cmap='RdYlGn', center=0, linewidths=0.5)
plt.title('Average 10-Day Post-Earnings Returns: Sector x Market Regime Heatmap')

# 3. Bar charts: By regime (stacked by sector)
plt.subplot(2, 1, 2)
regime_order = agg_data.groupby('market_regime')['mean'].mean().sort_values(ascending=False).index
sns.barplot(data=agg_data, x='market_regime', y='mean', hue='sector',
            palette='Set2', order=regime_order)
plt.title('Average 10-Day Returns by Market Regime (Colored by Sector)')
plt.ylabel('10-Day Post-Return (%)')
plt.xlabel('Market Regime')
plt.xticks(rotation=45, ha='right')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# 5. Summary tables
print("Average Returns by Sector x Market Regime (min 5 obs):")
print(agg_data.pivot(index='sector', columns='market_regime', values='mean').round(2))

# 6. Ranking: Best/worst performing sectors by regime
print("\nTop/Bottom Sectors by Regime:")
for regime in regime_order:
    regime_data = agg_data[agg_data['market_regime'] == regime].sort_values('mean', ascending=False)
    print(f"\n{regime}:")
    print(regime_data[['sector', 'mean', 'count']].round(2).head(3))
</code></pre>
<img src="https://cdn-images-1.medium.com/max/1000/0*Pn2sH97R2DCpRl7u.png" alt="Heatmap and bar chart showing average 10-day post-earnings returns by sector across market regimes" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Consumer Cyclical does well during Bull (2023+) and COVID Recovery (<del>1.5–2%), but it’s less favorable in Bear 2022. Utilities turned negative before COVID. The bottom bars show the COVID era led overall gains (</del>1%), with Basic Materials and Industrials being the strongest. The recent Bull remains positive but less so. Sector leadership shifts depending on the market regime , there are no consistent winners.</p>
<h2 id="heading-what-did-we-get-out-of-all-this-storyline">What Did We Get Out of All This Storyline?</h2>
<p>Guiding you through six interconnected visualizations, we’ve turned 15 years of earnings data into a clear and engaging story.</p>
<p>Each chart responds to a specific question, yet together, they paint a bigger picture: earnings surprises influence markets, but not in the same way everywhere. Some sectors, periods, and regimes often provide consistent advantages, while others don’t.</p>
<p>Here’s what the data shows us:</p>
<ul>
<li><p><strong>No definitive alpha here, but specific opportunities are present</strong>: Markets are mostly efficient,  returns hover near zero with weak surprise correlations ,  yet Consumer Cyclicals and Materials consistently show upside potential across different timeframes and market sizes. Timing your sector choice is important.</p>
</li>
<li><p><strong>Timing windows alter the story</strong>: 3-day reactions benefit Real Estate mid-caps (+4%), while 10-day reactions shift leadership to Consumer Cyclical mega-caps (+3.2%). Don’t assume all earnings reactions occur at the same pace.</p>
</li>
<li><p><strong>Mega-tech hype isn’t eternal</strong>: The bubble chart shows AAPL/MSFT/NVDA delivered strong returns from 2020–2022, but the falling trend since then indicates waning market enthusiasm. Don’t chase yesterday’s overhyped stocks.</p>
</li>
<li><p><strong>Calendar patterns reward patience</strong>: January and October deliver slightly stronger post-earnings returns (~0.8%), while July and August tend to have lower liquidity. Combine seasonal timing with sector choices for additional gains.</p>
</li>
<li><p><strong>Market regimes change winners</strong>: Cyclicals underperformed during COVID recovery and the bull run (2023+), while Industrials peaked during the recovery. There are no universal “best performers,” only the best performers <em>for now</em>. Adjust to the regime.</p>
</li>
<li><p><strong>The actionable setup</strong>: Small to mid-cap cyclical longs in January during bull markets combine all these signals for maximum conviction ,  where sector timing, seasonality, and regime alignment converge.</p>
</li>
</ul>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>This exercise shows why visualization is important in finance: raw tables of returns and surprises wouldn’t reveal these patterns.</p>
<ul>
<li><p>Heatmaps instantly highlighted sector winners.</p>
</li>
<li><p>Scatter plots demonstrated the weak surprise‑return connection. Bubble charts narrated the mega‑tech story over time.</p>
</li>
<li><p>Violins unveiled the harsh truth  that markets are noisy. Cross‑sectional regime analysis reminded us that yesterday’s approach doesn’t ensure tomorrow’s returns.</p>
</li>
</ul>
<p>The effort to interpret this data pays off: you shift from passive observation to active pattern recognition. You see not just what occurred, but where and when it happened. In trading and analysis, understanding the shape of complexity often surpasses having a perfect formula.</p>
<p>Visual storytelling turns data into intuition . And intuition, based on evidence, outperforms guesswork every time.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Spam Email Detector with Python and Naive Bayes Classifier ]]>
                </title>
                <description>
                    <![CDATA[ Ever wondered how Gmail knows that an email promising you $10 million is spam? Or how it catches those "You've won a free iPhone!" messages before they reach your inbox? In this tutorial, you'll build ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-spam-email-detector-with-python-and-naive-bayes-classifier/</link>
                <guid isPermaLink="false">69b0a8f8abc0d95001af6574</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ algorithms ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Maku Gideon ]]>
                </dc:creator>
                <pubDate>Tue, 10 Mar 2026 23:27:52 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/92eb401b-fce3-411b-9b0b-02ba486586cb.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Ever wondered how Gmail knows that an email promising you $10 million is spam? Or how it catches those "You've won a free iPhone!" messages before they reach your inbox?</p>
<p>In this tutorial, you'll build your own spam email classifier from scratch using the Naive Bayes algorithm. By the end, you'll have a working model that achieves over 97% accuracy—and you'll understand exactly how it works under the hood.</p>
<p>This project was inspired by the <a href="https://www.amazon.com/dp/B08WK2HCWL">Python Machine Learning Workbook for Beginners</a> by AI Publishing, which offers excellent hands-on ML projects for those starting their journey. <em>(Note: I have no affiliation with the authors — I simply found it a useful resource.)</em></p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a href="#prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#why-naive-bayes-for-spam-detection">Why Naive Bayes for Spam Detection?</a></p>
</li>
<li><p><a href="#how-to-set-up-your-environment">How to Set Up Your Environment</a></p>
</li>
<li><p><a href="#how-to-load-and-explore-the-dataset">How to Load and Explore the Dataset</a></p>
</li>
<li><p><a href="#how-to-visualize-the-data-distribution">How to Visualize the Data Distribution</a></p>
</li>
<li><p><a href="#how-to-analyze-word-patterns-with-word-clouds">How to Analyze Word Patterns with Word Clouds</a></p>
</li>
<li><p><a href="#preprocessing-the-text-data">Preprocessing the Text Data</a></p>
</li>
<li><p><a href="#how-to-convert-text-to-numerical-features">How to Convert Text to Numerical Features</a></p>
</li>
<li><p><a href="#how-to-train-the-naive-bayes-classifier">How to Train the Naive Bayes Classifier</a></p>
</li>
<li><p><a href="#how-to-evaluate-model-performance">How to Evaluate Model Performance</a></p>
</li>
<li><p><a href="#testing-on-individual-emails">Testing on Individual Emails</a></p>
</li>
<li><p><a href="#key-takeaways">Key Takeaways</a></p>
</li>
</ul>
<h2 id="heading-what-youll-learn">What You'll Learn</h2>
<ul>
<li><p>How email spam filters actually work</p>
</li>
<li><p>The intuition behind the Naïve Bayes algorithm</p>
</li>
<li><p>Text preprocessing techniques for machine learning</p>
</li>
<li><p>How to evaluate classification models</p>
</li>
<li><p>Building a complete spam detection pipeline in Python</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>You should have basic familiarity with Python and some understanding of fundamental machine learning concepts. Don't worry if you're still learning—I'll explain everything as we go.</p>
<h2 id="heading-why-naive-bayes-for-spam-detection">Why Naive Bayes for Spam Detection?</h2>
<p>Before we dive into code, let's understand why Naive Bayes is particularly well-suited for this task.</p>
<p>Imagine you receive an email containing words like "free," "winner," "click here," and "limited time offer." Your brain immediately flags this as suspicious. The Naive Bayes algorithm does something similar—it calculates the probability that an email is spam based on the words it contains.</p>
<p>The algorithm is called "naive" because it makes a simplifying assumption: it treats each word as independent of every other word. In reality, word combinations matter (think "free trial" vs. "free money"), but this simplification works remarkably well in practice.</p>
<p><strong>Why Choose Naive Bayes?</strong></p>
<ul>
<li><p><strong>Speed</strong>: It trains incredibly fast, even on large datasets</p>
</li>
<li><p><strong>Efficiency</strong>: Requires minimal training data to produce reliable results</p>
</li>
<li><p><strong>Simplicity</strong>: Easy to implement and interpret</p>
</li>
<li><p><strong>Performance</strong>: Despite its simplicity, it often outperforms more complex algorithms for text classification</p>
</li>
</ul>
<p><strong>Limitations to keep in mind:</strong></p>
<ul>
<li><p>The independence assumption means it can't capture relationships between words</p>
</li>
<li><p>If a word appears in the test data but never appeared in training, the algorithm assigns it zero probability (though there are ways to handle this)</p>
</li>
</ul>
<p>Now let's build our spam detector.</p>
<h2 id="heading-how-to-set-up-your-environment">How to Set Up Your Environment</h2>
<p>First, install the required libraries. Open your terminal or run this in a Jupyter notebook cell:</p>
<pre><code class="language-python">
%pip install regex wordcloud numpy pandas seaborn matplotlib scikit-learn
</code></pre>
<p>Here's a quick summary of what each library does:</p>
<ul>
<li><p><code>regex</code> / <code>re</code> — for cleaning text using pattern matching</p>
</li>
<li><p><code>wordcloud</code> — for visualizing which words appear most frequently</p>
</li>
<li><p><code>numpy</code> and <code>pandas</code> — for data loading and manipulation</p>
</li>
<li><p><code>seaborn</code> and <code>matplotlib</code> — for charts and visualizations</p>
</li>
<li><p><code>scikit-learn</code> — provides the Naive Bayes classifier, vectorizer, and evaluation tools</p>
</li>
</ul>
<p>Once installation is complete, import everything at the top of your script or notebook. Grouping all imports at the top is a Python best practice — it makes dependencies easy to spot at a glance.</p>
<pre><code class="language-python"># Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Natural language processing
import nltk
import re
from nltk.corpus import stopwords

# Machine learning
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Word cloud visualization
from wordcloud import WordCloud
</code></pre>
<h2 id="heading-how-to-load-and-explore-the-dataset">How to Load and Explore the Dataset</h2>
<p>We'll use a dataset of labeled emails. You can download it from <a href="https://bit.ly/3j9Uh7h">Kaggle</a> or use any similar email dataset with <code>text</code> and <code>spam</code> columns.</p>
<p>Use pandas' <code>read_csv()</code> function to load the dataset from a CSV file into a DataFrame — a table-like structure that makes it easy to inspect and manipulate data. The <code>head()</code> method then displays the first 5 rows so you can confirm the data loaded correctly and understand its structure.</p>
<pre><code class="language-python">message_dataset = pd.read_csv('emails.csv')
message_dataset.head()
</code></pre>
<p><strong>Output:</strong></p>
<table>
<thead>
<tr>
<th></th>
<th>text</th>
<th>spam</th>
</tr>
</thead>
<tbody><tr>
<td>0</td>
<td>Subject: naturally irresistible your corporate...</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>Subject: the stock trading gunslinger fanny i...</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>Subject: unbelievable new homes made easy im ...</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>Subject: 4 color printing special request add...</td>
<td>1</td>
</tr>
<tr>
<td>4</td>
<td>Subject: do not have money , get software cds ...</td>
<td>1</td>
</tr>
</tbody></table>
<p>Next, call <code>shape</code> on the DataFrame to check its dimensions — this returns a tuple of (rows, columns) and is a quick way to confirm you loaded the full dataset without truncation.</p>
<pre><code class="language-python"># Get the dimensions of our dataset (rows, columns)
message_dataset.shape
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">(5728, 2)
</code></pre>
<p>The dataset contains 5,728 emails with two columns: <code>text</code> (the email content) and <code>spam</code> (1 for spam, 0 for legitimate emails).</p>
<h2 id="heading-how-to-visualize-the-data-distribution">How to Visualize the Data Distribution</h2>
<p>Before training any model, it's crucial to understand your data. Let's see how spam and legitimate emails are distributed.</p>
<p><code>value_counts()</code> tallies how many emails belong to each class (spam vs. legitimate). Chaining <code>.plot(kind="pie")</code> on the result converts those counts directly into a pie chart. The <code>autopct="%1.0f%%"</code> argument tells matplotlib to label each slice with its percentage, rounded to the nearest whole number.</p>
<pre><code class="language-python">plt.rcParams["figure.figsize"] = [8, 10]
message_dataset.spam.value_counts().plot(kind="pie", autopct="%1.0f%%")
</code></pre>
<p><strong>Output:</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769379922505/de29f062-db6b-4f7c-87ad-bf03440cc3fc.png" alt="de29f062-db6b-4f7c-87ad-bf03440cc3fc" width="600" height="400" loading="lazy">

<p>You'll see that approximately 24% of emails in the dataset are spam, while 76% are legitimate. This is a moderately imbalanced dataset, which we'll keep in mind when evaluating our model.</p>
<h2 id="heading-how-to-analyze-word-patterns-with-word-clouds">How to Analyze Word Patterns with Word Clouds</h2>
<p>Word clouds provide an intuitive visualization of the most frequent words in a text corpus. Words that appear more often are rendered larger. Let's create separate word clouds for spam and legitimate emails to identify distinguishing patterns.</p>
<p>First, we need to remove stop words — common words like "the," "is," and "at" that appear everywhere and carry no meaningful signal for classification. NLTK's <code>stopwords.words("english")</code> returns a pre-built list of these words. The <code>apply()</code> method runs a function across every row in the column, and the lambda inside it splits each email into individual words, filters out any stop words, then rejoins the remaining words into a clean string.</p>
<pre><code class="language-python">stop = stopwords.words("english")

message_dataset["text_without_sw"] = message_dataset["text"].apply(
    lambda x: "".join([item for item in x.split() if item not in stop])
)
</code></pre>
<p>Now let's visualize the spam emails. We filter the DataFrame to rows where <code>spam == 1</code>, join all that text into a single large string, and pass it to <code>WordCloud().generate()</code>. The <code>imshow()</code> function renders the resulting image, and <code>axis("off")</code> hides the x/y axes since they're not meaningful for an image display.</p>
<pre><code class="language-python">message_dataset_spam = message_dataset[message_dataset["spam"] == 1]

plt.rcParams["figure.figsize"] = [8, 10]
text = ' '.join(message_dataset_spam['text_without_sw'])
wordcloud2 = WordCloud().generate(text)

plt.imshow(wordcloud2)
plt.axis("off")
plt.show()
</code></pre>
<p><strong>Output:</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769379941156/d33d8b61-7ec4-4044-a98a-c7b9ce95eabf.png" alt="d33d8b61-7ec4-4044-a98a-c7b9ce95eabf" width="600" height="400" loading="lazy">

<p>Now do the same for legitimate emails by filtering to rows where <code>spam == 0</code>:</p>
<pre><code class="language-python">message_dataset_ham = message_dataset[message_dataset["spam"] == 0]

plt.rcParams["figure.figsize"] = [8, 10]
text = ' '.join(message_dataset_ham['text_without_sw'])
wordcloud2 = WordCloud().generate(text)

plt.imshow(wordcloud2)
plt.axis("off")
plt.show()
</code></pre>
<p><strong>Output:</strong></p>
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769379947878/368e5251-3072-423f-b298-fb00d03254f3.png" alt="368e5251-3072-423f-b298-fb00d03254f3" width="600" height="400" loading="lazy">

<p><strong>Key observations:</strong></p>
<ul>
<li><p><strong>Spam emails</strong> frequently contain promotional language: "free," "money," "offer," "click," "please"</p>
</li>
<li><p><strong>Legitimate emails</strong> contain more conversational and work-related terms: "company," "time," "thanks"</p>
</li>
</ul>
<p>You'll also notice the word "enron" appearing prominently in the legitimate emails cloud. This is because the non-spam emails in this dataset are drawn from the publicly available <strong>Enron email corpus</strong> — a large collection of real internal emails from Enron Corporation that was released during their 2001 fraud investigation. It has since become one of the most widely used benchmark datasets in NLP research, which is why "enron" shows up so frequently as a word in legitimate email content.</p>
<p>These patterns give us confidence that word-based classification will work well.</p>
<h2 id="heading-how-to-preprocess-the-text-data">How to Preprocess the Text Data</h2>
<p>Raw text needs cleaning before machine learning algorithms can process it effectively. Let's first separate our features from our labels. In ML terminology, <code>X</code> holds the inputs (the email text we use to make predictions) and <code>y</code> holds the target labels (1 for spam, 0 for legitimate).</p>
<pre><code class="language-python">X = message_dataset["text"]
y = message_dataset["spam"]
</code></pre>
<p>Now we'll define a function to clean the text. The <code>re.sub()</code> function from Python's built-in <code>re</code> module performs pattern-based substitution using regular expressions. We call it three times in sequence:</p>
<ol>
<li><p><code>re.sub('[^a-zA-Z]', ' ', doc)</code> — replaces anything that isn't a letter (numbers, punctuation, symbols) with a space. This strips noise that doesn't help with classification.</p>
</li>
<li><p><code>re.sub(r'\s+[a-zA-Z]\s+', ' ', document)</code> — removes isolated single characters (like "I" or "a" left behind after removing punctuation) by matching any single letter surrounded by whitespace.</p>
</li>
<li><p><code>re.sub(r'\s+', ' ', document)</code> — collapses multiple consecutive spaces into a single space, tidying up any extra gaps created by the previous two steps.</p>
</li>
</ol>
<pre><code class="language-python">def clean_text(doc):
    document = re.sub('[^a-zA-Z]', ' ', doc)
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    document = re.sub(r'\s+', ' ', document)
    return document
</code></pre>
<p>Apply this cleaning function to every email in the dataset. We first convert the pandas Series to a plain Python list using <code>list()</code>, then loop through each email, clean it, and collect the results in <code>X_sentences</code>.</p>
<pre><code class="language-python"># Create an empty list to store cleaned emails
X_sentences = []

# Convert the pandas Series to a list for iteration
reviews = list(X)

# Clean each email and add it to our list
for rev in reviews:
    X_sentences.append(clean_text(rev))
</code></pre>
<h2 id="heading-how-to-convert-text-to-numerical-features">How to Convert Text to Numerical Features</h2>
<p>Machine learning algorithms work with numbers, not text. We need to transform our cleaned text into a numerical representation.</p>
<p><strong>TF-IDF (Term Frequency-Inverse Document Frequency)</strong> is a great choice for this. It assigns each word a score that reflects how important it is to a particular document relative to the entire dataset. A word that appears often in one email but rarely across all emails gets a high score — meaning it's distinctive and likely meaningful. Common words that appear everywhere get a lower score.</p>
<p><code>TfidfVectorizer</code> from scikit-learn handles this transformation. The parameters we set control what gets included:</p>
<ul>
<li><p><code>max_features=2500</code> — only keeps the 2,500 most frequent words, discarding rare ones that don't generalize well</p>
</li>
<li><p><code>min_df=5</code> — ignores words that appear in fewer than 5 emails (too rare to be useful)</p>
</li>
<li><p><code>max_df=0.7</code> — ignores words that appear in more than 70% of all emails (too common to be distinctive)</p>
</li>
<li><p><code>stop_words=stopwords.words('english')</code> — removes common English words like "the" and "is"</p>
</li>
</ul>
<p><code>fit_transform()</code> does two things in one step: it learns the vocabulary from our text (fit), then converts each email into a numerical vector based on that vocabulary (transform). Calling <code>.toarray()</code> on the result converts the sparse matrix output — which stores only non-zero values for efficiency — into a regular dense NumPy array that scikit-learn classifiers expect.</p>
<pre><code class="language-python">vectorizer = TfidfVectorizer(
    max_features=2500,
    min_df=5,
    max_df=0.7,
    stop_words=stopwords.words('english')
)

X = vectorizer.fit_transform(X_sentences).toarray()
</code></pre>
<p>Each email is now represented as a vector of 2,500 numbers, where each number is the TF-IDF score for a specific word.</p>
<h2 id="heading-how-to-train-the-naive-bayes-classifier">How to Train the Naive Bayes Classifier</h2>
<p>Now comes the exciting part — training our model! First, split the data into training and test sets using <code>train_test_split()</code>. This function randomly shuffles and divides both <code>X</code> and <code>y</code> simultaneously, keeping labels aligned with their corresponding emails. Setting <code>test_size=0.20</code> reserves 20% of the data for testing. Setting <code>random_state=42</code> seeds the random number generator so you get the same split every time you run the code, making your results reproducible.</p>
<pre><code class="language-python">X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42
)
</code></pre>
<p>Now train the Multinomial Naive Bayes classifier. We use <code>MultinomialNB</code> specifically because it's designed for features that represent counts or frequencies — exactly what TF-IDF scores are. Calling <code>fit(X_train, y_train)</code> trains the model by having it calculate the probability of each word appearing in spam versus legitimate emails across the training set. Those probability tables are what the model uses later to classify new emails.</p>
<pre><code class="language-python">
spam_detector = MultinomialNB()
spam_detector.fit(X_train, y_train)
</code></pre>
<p>That's it! The Naive Bayes algorithm is remarkably fast—training completes in milliseconds even with thousands of emails.</p>
<h2 id="heading-how-to-evaluate-model-performance">How to Evaluate Model Performance</h2>
<p>Let's see how well our spam detector performs on emails it has never seen before. The <code>predict()</code> method takes the test set features and returns a predicted label (0 or 1) for each email, based on the probability tables the model learned during training.</p>
<pre><code class="language-python">
y_pred = spam_detector.predict(X_test)
</code></pre>
<p>Now evaluate the predictions using three different tools from scikit-learn's <code>metrics</code> module:</p>
<ul>
<li><p><code>confusion_matrix()</code> — produces a 2×2 grid comparing actual vs. predicted labels, showing exactly where the model gets things right and wrong</p>
</li>
<li><p><code>classification_report()</code> — prints precision, recall, and F1-score for each class, giving a more complete picture than accuracy alone</p>
</li>
<li><p><code>accuracy_score()</code> — returns the overall percentage of correct predictions</p>
</li>
</ul>
<pre><code class="language-python">
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">[[849   7]
 [ 18 272]]

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       856
           1       0.97      0.94      0.96       290

    accuracy                           0.98      1146
   macro avg       0.98      0.96      0.97      1146
weighted avg       0.98      0.98      0.98      1146

0.9781849912739965
</code></pre>
<p>Our model achieves <strong>97.82% accuracy</strong>! Let's break down what the confusion matrix tells us:</p>
<ul>
<li><p><strong>849</strong>: Legitimate emails correctly identified as legitimate (True Negatives)</p>
</li>
<li><p><strong>7</strong>: Legitimate emails incorrectly marked as spam (False Positives)</p>
</li>
<li><p><strong>18</strong>: Spam emails that slipped through as legitimate (False Negatives)</p>
</li>
<li><p><strong>272</strong>: Spam emails correctly caught (True Positives)</p>
</li>
</ul>
<p>The classification report shows:</p>
<ul>
<li><p><strong>For legitimate emails (class 0)</strong>: 98% precision, 99% recall</p>
</li>
<li><p><strong>For spam emails (class 1)</strong>: 97% precision, 94% recall</p>
</li>
</ul>
<p>These numbers are impressive, especially considering the simplicity of our approach.</p>
<h2 id="heading-how-to-test-on-individual-emails">How to Test on Individual Emails</h2>
<p>Let's verify our model works by testing it on a specific email. We'll first print the cleaned text at index 56 and its actual label to see what we're working with. Then we'll ask the model to predict it.</p>
<pre><code class="language-python">
print(X_sentences[56])
print(y[56])
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">Subject localized software all languages available hello we would like to offer localized software versions german french spanish uk and many others aii iisted software is available for immediate downioad no need to wait week for cd deiivery just few exampies norton lnternet security pro windows xp professionai with sp fuil version corei draw graphics suite dreamweaver mx homesite inciudinq macromedia studio mx just browse our site and find any software you need in your native ianguaqe best reqards kayieen 
1
</code></pre>
<p>This is clearly a spam email trying to sell pirated software. The actual label is 1 (spam). Now pass this single email through the same pipeline — first transforming it into a TF-IDF vector using the already-fitted <code>vectorizer</code>, then calling <code>predict()</code> on the result. It's important to use the same vectorizer that was fitted on the training data, so the word-to-index mapping is consistent.</p>
<pre><code class="language-python">
print(spam_detector.predict(vectorizer.transform([X_sentences[56]])))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="language-plaintext">[1]
</code></pre>
<p>The model correctly identifies this promotional email as spam.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ol>
<li><p><strong>Naive Bayes is powerful for text classification</strong> despite its simplifying assumptions. For spam detection, it achieves excellent accuracy with minimal computational cost.</p>
</li>
<li><p><strong>Text preprocessing matters</strong>. Removing noise (special characters, numbers, extra spaces) helps the algorithm focus on meaningful patterns.</p>
</li>
<li><p><strong>TF-IDF captures word importance effectively</strong>. It gives higher weight to distinctive words that help differentiate spam from legitimate emails.</p>
</li>
<li><p><strong>Always evaluate with multiple metrics</strong>. Accuracy alone can be misleading, especially with imbalanced datasets. Precision, recall, and F1-score give a complete picture.</p>
</li>
<li><p><strong>Start simple</strong>. Before reaching for complex deep learning models, try classical algorithms like Naïve Bayes. They're interpretable, fast, and often surprisingly effective.</p>
</li>
</ol>
<h2 id="heading-next-steps">Next Steps</h2>
<p>Want to improve this spam detector further? Here are some ideas:</p>
<ul>
<li><p><strong>Experiment with different vectorizers</strong>: Try CountVectorizer or word embeddings (Word2Vec, GloVe)</p>
</li>
<li><p><strong>Handle class imbalance</strong>: Use techniques like SMOTE or adjust class weights</p>
</li>
<li><p><strong>Feature engineering</strong>: Add features like email length, number of links, or sender domain</p>
</li>
<li><p><strong>Try other algorithms</strong>: Compare with SVM, Random Forest, or gradient boosting</p>
</li>
<li><p><strong>Deploy the model</strong>: Build a simple API using Flask or FastAPI</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>You've built a spam email classifier that achieves over 97% accuracy using the Naïve Bayes algorithm. Along the way, you learned about text preprocessing, feature extraction with TF-IDF, and model evaluation techniques.</p>
<p>The beauty of this approach is its simplicity. With just a few dozen lines of code, you've created something that actually works—and now you understand the principles behind commercial spam filters.</p>
<p>Feel free to experiment with the code, try different parameters, and see how the results change. That's the best way to deepen your understanding.</p>
<h2 id="heading-references">References</h2>
<ul>
<li><a href="https://www.amazon.com/dp/B08WK2HCWL">Python Machine Learning Workbook for Beginners: 10 Machine Learning Projects Explained from Scratch</a> by AI Publishing</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create Boxplots and Model Data in R Using ggplot2 ]]>
                </title>
                <description>
                    <![CDATA[ In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-boxplots-and-model-data-in-r/</link>
                <guid isPermaLink="false">69693680d6f0e208b327d21c</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ R Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiffany Mojo Omondi ]]>
                </dc:creator>
                <pubDate>Thu, 15 Jan 2026 18:48:32 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768418231372/f36e1cca-eed9-4620-bd7c-19788d8beafe.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with ggplot2. Finally, you’ll learn about statistical modelling using linear regression and logistic regression in R.</p>
<p>By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into a real-world analytics workflow.</p>
<h2 id="heading-table-of-contents"><strong>Table of Contents</strong></h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-use-boxplots">How to Use Boxplots</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-perform-exploratory-data-analysis">How to Perform Exploratory Data Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>Before you begin, you should be comfortable with the following:</p>
<ul>
<li><p>Basic R syntax (variables, functions, data frames).</p>
</li>
<li><p>Installing and loading R packages.</p>
</li>
<li><p>Understanding what rows and columns represent in a dataset.</p>
</li>
<li><p>Very basic statistics (mean, median, distributions).</p>
</li>
</ul>
<h2 id="heading-how-to-set-up-your-r-environment">How to Set Up Your R Environment</h2>
<p>Start by installing and loading the packages you will need.</p>
<pre><code class="lang-r">install.packages(c(<span class="hljs-string">"tidyverse"</span>, <span class="hljs-string">"ggplot2"</span>))
<span class="hljs-keyword">library</span>(tidyverse)
<span class="hljs-keyword">library</span>(ggplot2)
</code></pre>
<p><code>tidyverse</code> provides tools for data manipulation and visualization. <code>ggplot2</code> is the visualization engine you will use for boxplots. Loading the libraries makes their functions available for use</p>
<h2 id="heading-how-to-load-and-inspect-the-data">How to Load and Inspect the Data</h2>
<p>First, download the <a target="_blank" href="https://www.kaggle.com/datasets/saadharoon27/hr-analytics-dataset">HR Analytics dataset by Saad Haroon from Kaggle</a>.</p>
<p>Assuming the downloaded dataset is saved as "C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv", load the path file into R.  </p>
<p>You can view a sample of the the dataset by running the <code>head</code> function. To view the structure of the dataset, you can run the <code>str</code> function.</p>
<pre><code class="lang-r">hr &lt;- read.csv(<span class="hljs-string">"C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv"</span>)
head(hr)
str(hr)
</code></pre>
<p>The <code>read.csv</code> function imports the dataset into R. The <code>head</code> function shows the first six rows so you can preview the data. The <code>str</code> function reveals data types, helping you spot categorical versus numeric variables early.</p>
<p>Remember that understanding your data structure early prevents errors later when plotting or modeling. Once you run the <code>head</code> function, you should see the following in your console:</p>
<p>From the <code>head</code> function, you can see:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768489839861/f304305e-b889-4e25-8315-ff24c5201681.png" alt="first-six-rows-of-a-hr-dataset-shown-in-the-r-console" class="image--center mx-auto" width="1753" height="347" loading="lazy"></p>
<h3 id="heading-structure">Structure</h3>
<ul>
<li><p>Each row represents <strong>one employee</strong>.</p>
</li>
<li><p>Each column represents a <strong>feature/variable</strong> about the employee.</p>
</li>
</ul>
<h3 id="heading-key-columns-amp-meaning">Key Columns &amp; Meaning</h3>
<ul>
<li><p><code>EmpID</code> → Employee identifier</p>
</li>
<li><p><code>Age</code> → Age in years</p>
</li>
<li><p><code>AgeGroup</code> → Age category (for example, <code>18-25</code>)</p>
</li>
<li><p><code>Attrition</code> → Whether the employee left or not (<code>Yes/No</code>)</p>
</li>
<li><p><code>BusinessTravel</code> → Travel frequency (<code>Travel_Rarely</code>, <code>Travel_Frequently</code>, <code>Non-Travel</code>)</p>
</li>
<li><p><code>Department</code> → Employee department</p>
</li>
<li><p><code>DistanceFromHome</code> → Distance from home to office (km)</p>
</li>
<li><p><code>Education</code> / <code>EducationField</code> → Level and field of education</p>
</li>
<li><p><code>EmployeeCount</code> → Usually 1 per employee (redundant)</p>
</li>
<li><p><code>Gender</code> → Male / Female</p>
</li>
<li><p><code>JobRole</code> / <code>JobSatisfaction</code> → Job title and satisfaction level</p>
</li>
<li><p><code>MonthlyIncome</code> / <code>SalarySlab</code> → Salary amount and category</p>
</li>
<li><p><code>YearsAtCompany</code> / <code>YearsInCurrentRole</code> → Experience metrics</p>
</li>
<li><p><code>OverTime</code> → Works overtime (<code>Yes/No</code>)</p>
</li>
<li><p>Other features: <code>PerformanceRating</code>, <code>TrainingTimesLastYear</code>, <code>WorkLifeBalance</code>, <code>StockOptionLevel</code>, and so on.</p>
</li>
</ul>
<h3 id="heading-data-types"><strong>Data Types</strong></h3>
<ul>
<li><p><strong>Numeric</strong> → <code>Age</code>, <code>DistanceFromHome</code>, <code>MonthlyIncome</code>, <code>YearsAtCompany</code></p>
</li>
<li><p><strong>Categorical / Character</strong> → <code>Attrition</code>, <code>Gender</code>, <code>Department</code>, <code>JobRole</code></p>
</li>
</ul>
<h3 id="heading-observations"><strong>Observations</strong></h3>
<ul>
<li><p>The dataset is tabular, like a spreadsheet.</p>
</li>
<li><p>There are multiple categorical columns</p>
</li>
<li><p>There are multiple numeric columns</p>
</li>
<li><p>Some columns seem redundant or constant; doesn’t provide useful information because of the same values (for example, <code>EmployeeCount</code>)</p>
</li>
</ul>
<p>From the <code>str</code> function, you can gather that:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768488901453/80d8cae9-d569-4749-8028-0a6e9cc128c4.png" alt="r-output-showing-structure-of-hr-dataset" class="image--center mx-auto" width="1046" height="612" loading="lazy"></p>
<p>The dataset contains 1,480 observations and 38 variables. Each row represents one employee, and each column represents a feature about that employee.</p>
<p>Each column has a name, data type, and example values. For instance, <code>Age</code> and <code>DistanceFromHome</code> are numeric (<code>int</code>), with values like 28 or 12. <code>EmpID</code> and <code>Department</code> are character strings (<code>chr</code>), with examples like Research &amp; Development or Sales. Other features include <code>JobRole</code> (Analyst, Manager) and <code>Attrition</code> (Yes/No).</p>
<p>The dataset contains mixed data types. Some columns are numeric, such as <code>MonthlyIncome</code> or <code>YearsAtCompany</code>. Some are character or categorical, like <code>Gender</code> (Male/Female) and <code>BusinessTravel</code> (Travel_Rarely, Travel_Frequently). A few columns are redundant or constant. For example, <code>EmployeeCount</code> has the same value of 1 for all rows and does not provide useful information.</p>
<h2 id="heading-how-to-clean-and-prepare-the-data">How to Clean and Prepare the Data</h2>
<p>Before visualization, you must clean your data. In order to find out what you need to clean you can investigate the data.</p>
<p>Run the <code>summary</code> function to view the statistics of the dataset. You also need to run the <code>is.na</code> function to identify missing values to be removed.</p>
<pre><code class="lang-r">summary(hr)
colSums(is.na(hr))
</code></pre>
<p>The <code>summary</code> function gives quick statistics and flags suspicious values. The <code>is.na</code> function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you are working with is critical.  </p>
<p>After running the <code>summary</code> function, the following will appear in your console:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490404469/ef3bd30d-c3c9-4cf0-9c91-80a0e56f52f5.png" alt="r-summary-output-of-hr-dataset-showing-statistical-distributions" class="image--center mx-auto" width="1778" height="495" loading="lazy"></p>
<p>This shows the basic statistics of each column. After running the <code>is.na</code> function, the following will also appear in your console:  </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768490678134/00a12c24-224e-4c8f-80ee-bc7bbd4d8ca6.png" alt="r-output-showing-missing-value-counts-per-column-in-hr-dataset" class="image--center mx-auto" width="1832" height="198" loading="lazy"></p>
<p>From this output, you can see that only <code>YearsWithCurrManager</code> has <code>57</code>, meaning that <strong>57 employees</strong> don’t have a value for this column.</p>
<p>You can drop this whole column along with the other redundant columns we saw earlier on. You can do this with the code below.</p>
<pre><code class="lang-r">hr &lt;- hr %&gt;% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))
</code></pre>
<p>To verify if the columns are gone, use this code:</p>
<pre><code class="lang-r">colnames(hr)
</code></pre>
<p>Now we need to convert important categorical variables to factors. Doing this tells R that the column has <strong>two categories</strong> (‘Yes’ and ‘No’), not continuous text.</p>
<pre><code class="lang-r">hr$Attrition &lt;- as.factor(hr$Attrition)
hr$JobRole &lt;- as.factor(hr$JobRole)
hr$Department &lt;- as.factor(hr$Department)
</code></pre>
<p>This also ensures ggplot2 treats them correctly when grouping.</p>
<h2 id="heading-how-to-use-boxplots">How to Use Boxplots</h2>
<p>A boxplot displays key features of a dataset. The median is shown by the line in the middle of the box. The interquartile range is represented by the box itself while the whiskers show the spread of the data. Outliers appear as individual points.</p>
<p>Boxplots are mostly useful when you want to compare distributions across groups, such as income by job role or age by attrition status.</p>
<p>Let’s start with a simple boxplot of monthly income.</p>
<pre><code class="lang-r">ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"blue"</span>) +
  labs(
    title = <span class="hljs-string">"Distribution of Monthly Income"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The <code>aes</code> function tells ggplot what variable to plot. <code>geom_boxplot</code> draws the boxplot. The <code>labs</code> function labels parts of the plot drawn, that is the <code>x</code> axis, <code>y</code> axis, and the title.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766410411798/200b1c22-3b73-49f0-ba30-9b83d28f3055.png" alt="A-vertical-boxplot-showing-the-distribution-of-employee-monthly-income." class="image--center mx-auto" width="473" height="523" loading="lazy"></p>
<h2 id="heading-how-to-create-boxplots-with-ggplot2">How to Create Boxplots with ggplot2</h2>
<p>Now lets compare <code>income</code> across <code>job roles</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = <span class="hljs-string">"lightblue"</span>) +
  theme(axis.text.x = element_text(angle = <span class="hljs-number">45</span>, hjust = <span class="hljs-number">1</span>)) +
  labs(
    title = <span class="hljs-string">"Monthly Income by Job Role"</span>,
    x = <span class="hljs-string">"Job Role"</span>,
    y = <span class="hljs-string">"Monthly Income"</span>)
</code></pre>
<p>The x aesthetic lists all the job roles. The labels are rotated to improve readability. This visualization quickly reveals income differences across roles.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766508710023/c12ca136-38bf-492e-af90-24d7021b54a4.png" alt="Multiple-boxplots-comparing-monthly-income-distributions-across-different-job-roles." class="image--center mx-auto" width="852" height="522" loading="lazy"></p>
<h2 id="heading-how-to-perform-exploratory-data-analysis-eda">How to Perform Exploratory Data Analysis (EDA)</h2>
<p>Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of the data.</p>
<p>We can use the example of <code>Years at company</code> by <code>department</code>.</p>
<pre><code class="lang-r">ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = <span class="hljs-string">"darkblue"</span>) +
  labs(
    title = <span class="hljs-string">"Years at Company by Department"</span>,
    y = <span class="hljs-string">"Years at Company"</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766512679598/5e5da8cd-8fe7-4fae-bbe9-362af901b330.png" alt="Boxplots-showing-employee-tenure-across-departments." class="image--center mx-auto" width="842" height="518" loading="lazy"></p>
<h2 id="heading-how-to-build-linear-regression-models">How to Build Linear Regression Models</h2>
<p>To understand how to build linear regression models, you have to model <code>MonthlyIncome</code> using <code>YearsAtCompany</code> with the command below.</p>
<p>The first one creates the model while the second displays it.</p>
<pre><code class="lang-r">hr_lm&lt;- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)
</code></pre>
<p>Linear regression estimates how income changes with tenure. This works when the variables are numeric.</p>
<p>After running the code, your console should show you this output:</p>
<pre><code class="lang-r">Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -<span class="hljs-number">9506</span>  -<span class="hljs-number">2488</span>  -<span class="hljs-number">1186</span>   <span class="hljs-number">1403</span>  <span class="hljs-number">15483</span> 

Coefficients:
               Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)     <span class="hljs-number">3734.47</span>     <span class="hljs-number">159.41</span>   <span class="hljs-number">23.43</span>   &lt;<span class="hljs-number">2e-16</span> ***
YearsAtCompany   <span class="hljs-number">395.25</span>      <span class="hljs-number">17.14</span>   <span class="hljs-number">23.07</span>   &lt;<span class="hljs-number">2e-16</span> ***
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

Residual standard error: <span class="hljs-number">4032</span> on <span class="hljs-number">1478</span> degrees of freedom
Multiple R-squared:  <span class="hljs-number">0.2647</span>,    Adjusted R-squared:  <span class="hljs-number">0.2642</span> 
<span class="hljs-literal">F</span>-statistic:   <span class="hljs-number">532</span> on <span class="hljs-number">1</span> and <span class="hljs-number">1478</span> DF,  p-value: &lt; <span class="hljs-number">2.2e-16</span>
</code></pre>
<p>Let’s interpret this model.</p>
<p>If an employee has 0 years at the company, their base monthly income is $3734.47. This comes from the intercept.</p>
<p>For each year an employee spends at the company, their monthly income is predicted to increase by $395.25.</p>
<p>Both coefficients have p-values &lt; <code>2e-16</code>. This means they are highly significant. It strongly shows that the years an employee spends at a company affects their income.</p>
<p>The model’s R-squared is <code>0.2647</code>. This means about 26% of the variation in monthly income is explained by the years an employee spends at the company. This is low, so other factors like role, department, or education likely affect income too.</p>
<p>The model’s F-statistic is <code>532</code>, with a p-value &lt; <code>2.2e-16</code>. This means the model is statistically significant overall.</p>
<p>In general, the longer an employee stays at a company, the more they earn, roughly $395 extra per year. But years at the company alone explain only about a quarter of their income. You need to consider other variables for better predictions.</p>
<h2 id="heading-how-to-build-logistic-regression-models">How to Build Logistic Regression Models</h2>
<p>You can now learn how to predict attrition. The first command generates the model while the second displays it.</p>
<pre><code class="lang-r">hr_glm&lt;- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)
</code></pre>
<p>Your console should show this as an output when you run both commands.</p>
<pre><code class="lang-r">Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)    -<span class="hljs-number">8.094e-01</span>  <span class="hljs-number">1.375e-01</span>  -<span class="hljs-number">5.886</span> <span class="hljs-number">3.96e-09</span> ***
MonthlyIncome  -<span class="hljs-number">9.449e-05</span>  <span class="hljs-number">2.302e-05</span>  -<span class="hljs-number">4.104</span> <span class="hljs-number">4.05e-05</span> ***
YearsAtCompany -<span class="hljs-number">5.047e-02</span>  <span class="hljs-number">1.792e-02</span>  -<span class="hljs-number">2.817</span>  <span class="hljs-number">0.00485</span> ** 
---
Signif. codes:  <span class="hljs-number">0</span> ‘***’ <span class="hljs-number">0.001</span> ‘**’ <span class="hljs-number">0.01</span> ‘*’ <span class="hljs-number">0.05</span> ‘.’ <span class="hljs-number">0.1</span> ‘ ’ <span class="hljs-number">1</span>

(Dispersion parameter <span class="hljs-keyword">for</span> binomial family taken to be <span class="hljs-number">1</span>)

    Null deviance: <span class="hljs-number">1305.4</span>  on <span class="hljs-number">1479</span>  degrees of freedom
Residual deviance: <span class="hljs-number">1252.5</span>  on <span class="hljs-number">1477</span>  degrees of freedom
AIC: <span class="hljs-number">1258.5</span>

Number of Fisher Scoring iterations: <span class="hljs-number">5</span>
</code></pre>
<p>Logistic regression is used for binary outcomes, that is, yes or no. It estimates probability.</p>
<p>Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (Attrition) based on their <code>Monthly Income</code> and <code>Years at Company.</code></p>
<p>The intercept is <code>-0.809</code>. This is the baseline log-odds of leaving when their income and years at the company are zero.</p>
<p>The employees’ <code>Monthly Income</code> has a coefficient of <code>-0.0000945</code>. This means that as their income increases, their chance of leaving decreases slightly. An increase in income makes them less likely to quit.</p>
<p>The employees’ <code>Years at Company</code> have a coefficient of <code>-0.0505</code>. This shows that the longer they stay, the less likely they are to leave. Each additional year reduces their attrition probability.</p>
<p>All coefficients are statistically significant. <code>Monthly Income</code> and <code>Years at Company</code> both strongly affect their likelihood to stay.</p>
<p>The model’s residual deviance is <code>1252.5</code>, lower than the null deviance of <code>1305.4</code>. This means the model explains some of the variation in attrition.</p>
<p>The key takeaway is that if an employee earns more and stays longer at the company, they are less likely to leave. These factors matter, but other elements also influence attrition.</p>
<h2 id="heading-why-visualization-comes-before-modeling">Why Visualization Comes Before Modeling</h2>
<p>Boxplots help you to:</p>
<ul>
<li><p><strong>Detect outliers:</strong> Boxplots highlight extreme values that interfere with model results.</p>
</li>
<li><p><strong>Compare groups:</strong> Boxplots allow quick comparison of distributions across different categories.</p>
</li>
<li><p><strong>Form hypotheses:</strong> Visual patterns assist in identifying relationships worth testing in a model.</p>
</li>
<li><p><strong>Validate modeling assumptions:</strong> Boxplots help check distribution shape and variance before modeling.</p>
</li>
</ul>
<p>Modeling without visualization often leads to misinterpretation or false confidence.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to load and clean data, understand boxplots and their importance. You also learned how to use ggplot2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Neural Networks Work – Explained Using the Straight Line Equation y = ax + b ]]>
                </title>
                <description>
                    <![CDATA[ Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“ A straight line equation y = ax+b answers it in the simplest way possible. y can incre... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/neural-networks-explained-using-y-ax-b/</link>
                <guid isPermaLink="false">695ef4246f1bfe13bf31abe9</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Samyukta Hegde ]]>
                </dc:creator>
                <pubDate>Thu, 08 Jan 2026 00:02:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767800625537/5bb99a58-d247-4933-b60b-fd2c14651542.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“</p>
<p>A straight line equation <code>y = ax+b</code> answers it in the simplest way possible. <code>y</code> can increase, decrease, or stay the same when <code>x</code> changes.</p>
<p>On the other hand, a deep neural network tries to answer it in a flexible way. It’s only possible because of multiple layers of straight line calculations stacked one over another along with non linear adjustments to help the network adapt and produce the desired result.</p>
<p>Since a straight line is the essence of neural networks, I think it’s time we try to understand the subtle details of <code>y = ax+b</code>, which I refer to as the <strong>magical equation</strong>. We’ll also go through the basics of linear regression and classification, which should help you understand the progression of a simple straight line to a complex deep neural network.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-yaxb">y=ax+b</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-linear-regression">Linear Regression</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-linear-classification">Linear Classification</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-comparison">Comparison</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-key-additions-to-help-build-deep-neural-networks">Key Additions to Help Build Deep Neural Networks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-modelling-a-deep-neural-network">Modelling a Deep Neural Network</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A basic understanding of linear algebra, particularly <code>y=ax+b</code>.</p>
</li>
<li><p>General idea about linear regression and classification.</p>
</li>
<li><p>Familiarity with the concept of deep neural networks.</p>
</li>
</ul>
<h2 id="heading-yaxb">y=ax+b</h2>
<p>A straight line simply means that output changes steadily as input changes. There are no surprises (that is, no non linearity). Let’s analyze it properly.</p>
<pre><code class="lang-plaintext">y =&gt; Output variable
x =&gt; Input variable
a =&gt; Amount by which y changes when x changes (slope)
b =&gt; Value of y when x is 0 (y intercept)
</code></pre>
<p>We can take an example and model it in the same form to understand it better.</p>
<p>Ms. Poly is a math teacher who wants to formulate a study plan for her students to excel in an upcoming final exam. For simplicity, she creates a rule of thumb using only one factor: the number of hours studied per week. It has a direct impact on the marks scored by a student.</p>
<p>Before beginning, she makes certain assumptions:</p>
<ul>
<li><p>Every student is capable of scoring at least 30 without studying.</p>
</li>
<li><p>For every hour a student studies, an additional 3 marks can be scored.</p>
</li>
</ul>
<p>She then comes up with the following equation based on her ideas: <code>y = 3x+30</code></p>
<pre><code class="lang-plaintext">y =&gt; Marks scored.
x =&gt; Number of hours studied.
a=3 =&gt; Increase in marks for every hour studied
b=30 =&gt; Minimum marks
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764650083131/997f2a53-78ac-4b6f-a0c1-b995fb515075.png" alt="Plot of y=3x+30" class="image--center mx-auto" width="1920" height="1080" loading="lazy"></p>
<p>In the above graph, she plots the points based on the results of the equation. As expected, it is a straight line. If she needs the marks scored for <code>9</code> hours of study, she can get it by just substituting <code>x=9</code> in <code>y=3x+30</code>. Note that the data (<code>x</code> and <code>y</code>) are just based on her hunch and aren’t real.</p>
<p>But Ms. Poly wants to guide her students on how to prepare for the final exam based on actual data. So she conducts a pop quiz and grades it. In order to formulate a study plan, she interviews her students and collects information on how many hours they study math per week. She creates a table with two columns: number of hours studied (<code>x</code>) per week and marks scored (<code>y</code>). She tries her old formula <code>y=3x+30</code>, but it doesn’t seem to work. Thus, she doesn’t have any sensible equation describing the relation between <code>x</code> and <code>y</code>.</p>
<p>Let’s assume that a new student who hasn’t attended any exam (no <code>y</code> available) joins the class the next day, and Ms. Poly only knows the number of hours dedicated per week (<code>x</code>). How can she answer the question below?</p>
<p><em>If the new student studies for a certain number of hours (</em><code>x</code><em>), what can be the marks scored (</em><code>y</code><em>) in the exam?</em></p>
<p>It’s impossible unless there’s an equation defining the sample data. So, her task is to find one that fits the given points. This process is called curve fitting or regression.</p>
<h2 id="heading-linear-regression">Linear Regression</h2>
<p>The core idea of linear regression to find a straight line that captures the trend of the existing data to facilitate predictions for new input data. Now, let’s dive straight into the example to understand the concept better.</p>
<p>Ms. Poly is determined to arrive at a solution. She plots the collected data on a graph to get a better picture.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651274954/0aa2dfc2-d846-40e6-872d-e7d5abe598a8.png" alt="Input Data" class="image--center mx-auto" width="1920" height="1080" loading="lazy"></p>
<p>She has absolutely no idea how <code>x</code> and <code>y</code> are related. So, she must figure out a formula, by trial and error, that roughly fits the points. She has to start with an intuitive guess, try to improve it in the subsequent steps and then arrive at the best possible solution.</p>
<p><strong>Trial 1</strong>: Ms. Poly begins with her previous straight line equation.</p>
<p><code>y = 3x+30</code></p>
<p>She substitutes different values of <code>x</code> and plots it alongside the collected input data. This way she can get a clear picture of the differences in her assumption and reality.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651323645/a3e79765-99bc-42be-8836-82119d7fbf66.png" alt="Linear Regression-Trial 1" class="image--center mx-auto" width="1920" height="1080" loading="lazy"></p>
<p><strong>Trial 2</strong>: She observes that the line needs a little more slope. This simply means that, in reality, more marks are being scored for every additional hour of study. By changing it from <code>3</code> to <code>4</code>, the equation becomes:</p>
<p><code>y = 4x+30</code></p>
<p>The following graph depicts the new line alongside the sample data:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651379913/42a8fc61-7927-46de-aadf-b691544b9a1b.png" alt="Linear Regression-Trial 2" class="image--center mx-auto" width="1920" height="1080" loading="lazy"></p>
<p><strong>Trial 3:</strong> It looks better but she feels there is a need to shift the whole line upwards. This means that higher marks are being scored even if a student doesn’t dedicate any time for math in a week. She decides to retain the previous slope but changes the starting marks by <code>10</code>, thus arriving at:</p>
<p><code>y = 4x+40</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651454435/5fea2d39-8254-48e6-be14-69c803982ec7.png" alt="Linear Regression-Trial 3" class="image--center mx-auto" width="1920" height="1080" loading="lazy"></p>
<p>This particular line covers most of the points and can be considered the best possible solution.</p>
<p>Now, if she wishes to ascertain the marks scored by the new student who studied for <code>3.5</code> hours, she pins the value inside the formula and calculates the answer: <code>y = 4*(3.5)+40=54</code></p>
<p>We saw how Ms. Poly arrived at a straight line equation to predict the output for an unknown input. Now she can chalk out a study plan for her class based on the equation.</p>
<p>Here, an expression is formulated to ascertain the change in output when the input changes. It looks like Ms. Poly is thinking like a data scientist. She has in fact modelled a very simple neural network for regression. The equation <code>y=4x+40</code> can be considered as the only neuron (processing unit) within it. She’s adjusted the parameters <code>a</code> (weight) and <code>b</code> (bias) to arrive at the final formula which covers most of the points (thus minimizing the loss).</p>
<p>Here’s a breakdown of the <code>y = 4x+40</code> equation:</p>
<pre><code class="lang-plaintext">y =&gt; Marks scored.
x =&gt; Number of hours studied.
a=4 =&gt; Increase in marks for every hour studied
b=40 =&gt; Minimum marks
</code></pre>
<p>At present, it is a rudimentary neural network which has no layering and non-linearity.</p>
<p>Now let’s shift our attention to a completely different scenario. Ms. Poly, being a teacher, wants to ensure that all her students pass the exam. Assuming, as an end result, she’s not interested in predicting the marks scored. She just wants to know:</p>
<p><em>If a student studies for a certain number of hours (</em><code>x</code><em>), will the student pass/fail(y) the exam?</em></p>
<p>This leads her to the process of classification.</p>
<h2 id="heading-linear-classification">Linear Classification</h2>
<p>The linear classification process uses a simple straight line to divide the data into categories or classes. The line acts as a boundary so that the classes fall on either side of it. First, Ms. Poly defines the boundary condition for pass and fail.</p>
<p><em>If marks scored&gt;=50, pass</em></p>
<p><em>If marks scored&lt;50, fail</em></p>
<p>According to the data table, <code>x=3</code> corresponds to <code>y=52</code> (boundary condition). Therefore she considers <code>x=3</code> as the classification line***.***</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651531018/e669ed7b-1c86-4093-b7e5-feb06464ebfe.png" alt="Linear Classification" class="image--center mx-auto" width="1920" height="1080" loading="lazy"></p>
<p><code>x=3</code> seems to segregate the points into the categories properly. She tries to confirm it by substituting another value. Thus, if a student studied for <code>9</code> hours, the score would lie towards the right side of <code>x=3</code>. So, they’d pass as per the classification equation.</p>
<p>Again, she’s arrived at an expression to ascertain the change in output when the input changes. But here, she has modelled a basic neural network for classification. The equation x=3 is the only neuron within it. It can be considered to be having two parts as explained below.</p>
<ol>
<li><p><strong>Pre-Activation Part:</strong> This portion of the neuron computes an intermediate value which is helpful in further processing. She’s figured out the parameters <code>a</code> (weight) and <code>b</code> (bias) to arrive at the following formula: <code>z = x-3</code></p>
<pre><code class="lang-plaintext"> z =&gt; Intermediate Value.
 x =&gt; Number of hours studied.
 a=1 =&gt; Influence of the number of hours studied on the marks scored
 b=-3 =&gt; Minimum number of hours to study to pass the exam = 3
</code></pre>
</li>
<li><p><strong>Activation Part:</strong> This portion triggers the neuron to make decisions based on a threshold value. The following equation segregates the points into two classes.</p>
<pre><code class="lang-plaintext"> y = 1 (Pass) if z&gt;=0
 y = 0 (Fail) if z&lt;0
</code></pre>
</li>
</ol>
<p>This is a very plain neural network which has no layering and non-linearity but has pre-activation and activation parts inside a neuron.</p>
<h2 id="heading-comparison">Comparison</h2>
<p>We looked at the examples of both linear regression and classification used by Ms. Poly. <strong>Regression</strong> helps in predicting a value while <strong>Classification</strong> helps in decision making. Let’s draw a small table to summarize the differences.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764652317811/f4411011-fcd3-4a53-b116-a3c8a27c81d8.png" alt="Comparison between Linear Regression and Classification" class="image--center mx-auto" width="1565" height="756" loading="lazy"></p>
<p>Upon careful observation we notice that both answer the question of how input change affects output.</p>
<p>But at a slightly higher level of complexity than a straight line. Because in the case of both regression and classification, we try to figure out the equation parameters by trial and error.</p>
<p>Here, since the requirements are simple, Ms. Poly just uses a straight line to solve both. A simple linear equation can handle only one steady trend. But in real life, problems that need solving are far more challenging and unpredictable. Some examples are:</p>
<p><strong>Image Classification</strong>: An output label is produced based on the input images.</p>
<p><strong>Text Translation</strong>: An English sentence can be given as an input to be translated to say, Spanish.</p>
<p><strong>Chatbots</strong>: A text prompt is typed in by a user and a meaningful and relevant output is generated.</p>
<p>She probably should have to use a deep neural network if both data and task were complex. That presents another question: <strong>How does one build a deep neural network?</strong></p>
<p>We will explore it further by extending the same example to a more realistic version.</p>
<h2 id="heading-key-additions-to-help-build-deep-neural-networks">Key Additions to Help Build Deep Neural Networks</h2>
<p>In the above sections, we noted that Ms. Poly was interested in predicting the exam results of a student using just one factor - number of hours studied. However, in practice, is that one factor sufficient in determining the marks scored or whether the student passes the exam?</p>
<p>No. It’s not enough. She needs to take into account a lot of aspects like:</p>
<ul>
<li><p>Number of hours studied</p>
</li>
<li><p>Number of hours of sleep/rest</p>
</li>
<li><p>Burnout due to over-studying</p>
</li>
<li><p>Difficulty level of topics in math</p>
</li>
<li><p>Pattern of the exam, and so on.</p>
</li>
</ul>
<p>All the above neither act independently nor do they have a simple linear relation with the marks scored. So, she has to solve this problem by stacking the contributing factors one above the other in layers and also adding the element of non linearity. Let’s take a look at each in detail.</p>
<h3 id="heading-layering">Layering</h3>
<p>Burnout leads to lower score whereas good sleep increases score. But burnout can be reduced if the student is well rested. So, the impact on the final score when these two factors interact should be taken into account. This is possible only when the system solves it in layers. The first layer can deal with how they independently influence the score, the next layer can explore the interaction between them.</p>
<h3 id="heading-non-linearity">Non-Linearity</h3>
<p>If the number of hours studied increases, the score might increase but when burnout overpowers the effect of study hours, the score reduces. The combined effect results in a non-linear graph. There is a rise and then dip in the score based on number of hours studied. It’s evident that the relationship is not straightforward as in a straight line. That’s where it becomes necessary to add non-linearity in the calculations. It helps the system to respond differently according to the conditions, allowing for flexibility in dealing with real world data and conditions.</p>
<p>Thus, Ms. Poly would have to extend the idea of linear regression/classification by including layering and non-linearity to build a fully functional neural network to help build a practical study plan.</p>
<h2 id="heading-modelling-a-deep-neural-network">Modelling a Deep Neural Network</h2>
<p>Ms. Poly should start the work on modelling a deep neural network by following the steps mentioned below:</p>
<h3 id="heading-step-1-define-the-problem-clearly"><strong>Step #1 - Define the Problem Clearly</strong></h3>
<p>The following factors should be considered before she begins the process of modelling:</p>
<ul>
<li><p>What are the input features?</p>
</li>
<li><p>What are the output features?</p>
</li>
<li><p>What type of problem is it (regression/classification)?</p>
</li>
</ul>
<h3 id="heading-step-2-define-the-input-layer"><strong>Step #2 - Define the Input Layer</strong></h3>
<p>The input features form the first layer. There is no computation in this stage. They are represented as:</p>
<pre><code class="lang-plaintext">x1: Number of hours studied
x2: Number of hours of sleep/rest
x3: Burnout due to over-studying
x4: Difficulty level of topics in Maths
x5: Pattern of the exam
</code></pre>
<h3 id="heading-step-3-define-the-first-hidden-layer"><strong>Step #3 - Define the First Hidden Layer</strong></h3>
<p>This step consists of two parts:</p>
<p><strong>Apply Linear Transformation</strong>: The actual learning begins here. A straight line equation is used to understand the combined effect of the inputs. The general formula is <code>z=Wx+b</code>.</p>
<pre><code class="lang-plaintext">z: Intermediate value or Pre-activation
W: Weight matrix which consists of values corresponding to the impact of
each input feature
x: Matrix consisting of input features, [x1, x2, x3, x4, x5]
b: Bias which represents the initial assumptions of the teacher(when x=0)
</code></pre>
<p>It looks similar to a linear regression/classification equation. At first <code>W</code> and <code>b</code> are initialized to random values. Then in the subsequent steps, they are adjusted like it was done in earlier examples. We can consider the following combinations assuming we have two neurons in this layer:</p>
<p><strong>Neuron 1:</strong> It can focus on study hours, burnout, and rest, with other features contributing less significantly.</p>
<p><strong>Neuron 2</strong>: It can emphasize more on the difficulty level of the topic and the exam type compared to other inputs.</p>
<p>It’s important to note that this layer doesn’t calculate the interactions between the features but only on the way different linear combinations work together but independently. To make it clearer, how they contribute independently are added together. We don’t know how one input feature influences the other. For example, we know sleep increases score and burnout reduces score, but what we don’t know at this stage is if sleep reduces burnout, which in turn can influence the final score.</p>
<p><strong>Add Non-Linearity</strong>: This step, also called activation, helps in capturing the complexities in different combinations of the features. Less study results in low marks, and too much burnout also results in low marks. It means there is a curve in the score graph which can’t be represented by a linear equation. The activation function is applied to the intermediate value and can be expressed as:</p>
<p><strong>a = g(z)</strong></p>
<pre><code class="lang-plaintext">a: Activation output
g: Activation function
z: Intermediate value or Pre-activation
</code></pre>
<p>For example: <code>ReLU</code> is an activation function which outputs <code>z</code> only if <code>z</code> is positive, else <code>0</code>.</p>
<p><strong>y = ReLU(z)=max(0,z)</strong></p>
<p>We can see that it has no steady slope and is a non-linear activation function. It can suit this scenario as it lets the value pass through to the next layer only if the combined effect of features is greater than 0. Neuron 1 will let it’s output go to the next layer only if the intermediate value (<code>z</code>) that results from study hours, burnout and rest, is large enough to be influencing the final decision, else it’s ignored. There are multiple options for non-linear activation functions that one can choose from.</p>
<h3 id="heading-step-4-stack-layers-one-above-the-other"><strong>Step #4 - Stack Layers One Above the Other</strong></h3>
<p>This step helps in learning the mutual interactions between the inferences learned from the first hidden layer. The network attempts to understand the intricate details of the influencing factors and build a stable system. It is here that details of whether sleep reduces burnout are figured out. Every layer consists of linear and non linear transformations applied on the input, which are values obtained from the previous layer. Likewise multiple layers can be stacked one over the other based on the requirements. In this example, for representation, we have taken two hidden layers with two neurons each. The number of layers and neurons can vary based on requirements.</p>
<h3 id="heading-step-5-define-the-output-features"><strong>Step #5 - Define the Output Feature(s)</strong></h3>
<p>This appears to be the final stage in a deep neural network. Ms. Poly can decide what she wants for output: predict the marks scored by a student or predict if the student passes/fails the exam. If she wants the final marks scored, she just has to apply linear transformation in the neuron in the final layer to produce the output. If she wants pass/fail status, she has to apply both linear and non-linear transformations to achieve the desired results.</p>
<p>The diagram below shows an abstract representation of the deep neural network.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766153114888/1e513840-483a-43cf-b062-ce5af886a04e.png" alt="Abstract Representation of a Deep Neural Network" class="image--center mx-auto" width="1024" height="768" loading="lazy"></p>
<p>The next steps are:</p>
<p><strong>Training the model</strong>: The network is trained in the following way:</p>
<ul>
<li><p>Random weights and biases are assigned to the linear transformation portions of the network.</p>
</li>
<li><p>Then the network makes a prediction which is compared with the expected result.</p>
</li>
<li><p>If there are gaps between the actual result and the predicted result, corrections are made in weights and biases (this step is similar to what was done in linear regression and classification).</p>
</li>
<li><p>The steps above are repeated until the results improve.</p>
</li>
</ul>
<p><strong>Using the model</strong>: After the model has been trained, it is capable of yielding results for new input values.</p>
<h2 id="heading-final-thoughts"><strong>Final Thoughts</strong></h2>
<p>In this article, we began with the basics of a straight line equation. Then we gradually navigated through slightly more elaborate concepts like linear regression and classification. They laid the groundwork for delving into the seemingly mysterious deep neural networks. But they are in fact built by stacking layers of linear transformations and non-linear activations, which help understand sophisticated real world patterns.</p>
<p>Despite all the complexities and layers, we can see that the straight line remains the foundation upon which neural networks are built. As we saw earlier, the equation that a deep neural network begins with is our <em>magical equation:</em> <code>y = ax+b</code>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Common Pitfalls to Avoid When Analyzing and Modeling Data ]]>
                </title>
                <description>
                    <![CDATA[ Working with data at any level, whether as an analyst, engineer, scientist, or decision-maker, involves going through a range of challenges. Even experienced teams can run into issues that quietly affect the quality of their work. A mislabeled column... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/common-pitfalls-to-avoid-when-analyzing-and-modeling-data/</link>
                <guid isPermaLink="false">68ee54b2edcf5de25dd4bb13</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Tue, 14 Oct 2025 13:48:34 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760449475934/80950373-2a61-4b75-bd8f-b0dfd08f6e21.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Working with data at any level, whether as an analyst, engineer, scientist, or decision-maker, involves going through a range of challenges. Even experienced teams can run into issues that quietly affect the quality of their work. A mislabeled column, an unclear definition, or a data leak that slips by unnoticed can all lead to results that do not hold up when it matters most.</p>
<p>Reliable analysis depends on how data is handled throughout the process. From collection and preparation to modeling and interpretation, each step carries its own risks. Many of the most persistent problems come not from technical gaps, but from missing checks or assumptions that go unspoken.</p>
<p>This guide highlights some of the most common pitfalls in data analysis and shows where they tend to appear. Along the way, it covers:</p>
<ul>
<li><p>Biased or unclear inputs that cause trouble early on</p>
</li>
<li><p>Validation mistakes that distort model performance</p>
</li>
<li><p>Misinterpretation of results that leads to the wrong conclusions</p>
</li>
<li><p>Workflow gaps that slow teams down or create confusion</p>
</li>
<li><p>Practical steps you can take to catch and correct these issues</p>
</li>
</ul>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-data-collection-pitfalls">Data Collection Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-data-preparation-pitfalls">Data Preparation Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-modeling-and-validation-pitfalls">Modeling and Validation Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-interpretation-and-communication-pitfalls">Interpretation and Communication Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-organizational-and-workflow-pitfalls">Organizational and Workflow Pitfalls</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-data-collection-pitfalls"><strong>Data Collection Pitfalls</strong></h2>
<p>A lot of data issues begin before any modeling takes place. The way data is collected helps shape what your analysis can reveal. Once the inputs are biased or inconsistent, even solid techniques may lead to unreliable results.</p>
<p>One common issue is the bias in data sources. When a large portion of the data comes from digital channels like websites or apps, it creates an imbalance. For instance, if a model is trained only on web traffic, it could miss users who engage through offline means, like in-person visits or phone support. This then results in blind spots that limit how well the model performs once deployed.</p>
<p>Inconsistent definitions across systems also pose a major challenge. A simple label like “customer” could represent various things - it could refer to an active user in one database, a prospect in another, or even a past buyer elsewhere. Without shared definitions, one can end up using the same terms to mean very different things, and this leads to confusion and misaligned metrics.</p>
<p>A third issue is the lack of metadata or data provenance. Without clear records of where the data came from or how well it has changed over time, it becomes harder to trace issues, explain outputs, or reproduce results.</p>
<p><strong>The way out:</strong></p>
<ul>
<li><p>Combine data from multiple sources to build a more complete and representative picture</p>
</li>
<li><p>Use stratified sampling to reduce bias where possible</p>
</li>
<li><p>Set up regular audits to catch data drift or gaps early</p>
</li>
<li><p>Maintain a shared data dictionary and align terms across teams</p>
</li>
<li><p>Track data lineage with tools like dbt, Apache Atlas, or OpenMetadata</p>
</li>
</ul>
<p>Getting data collection right sets a strong foundation for analysis and helps prevent issues down the line.</p>
<h2 id="heading-data-preparation-pitfalls"><strong>Data Preparation Pitfalls</strong></h2>
<p>Once the data has been collected, the next step involves cleaning and shaping it for use. This is another delicate stage where data analysts often encounter an issue. Some choices that seem helpful at first can create problems later, especially when they aren’t documented or tested properly.</p>
<p><strong>Silent Data Leakage</strong></p>
<p>Data leakage occurs when a model learns from information that it would not have access to at prediction time. Let’s say for example, you’re building a model in January to predict whether a customer will make a purchase in February. If your dataset includes transactions from February, and you use that to calculate a feature like “days since last purchase”, then your model is learning from data it wouldn’t realistically have at prediction time.</p>
<p><strong>Improper Handling of Missing Values</strong></p>
<p>Quite a number of data explorers think missing values are just gaps to be filled. In certain cases, the fact that data is missing can be just as meaningful as the value itself. In a customer churn dataset, some users might have blank entries for recent activities because they have already stopped engaging with the product. Filling those gaps with averages and zeros without context could make the model treat them the same as users who simply haven’t generated enough data yet, which can be misleading. </p>
<p><strong>Over-aggressive Outlier Removal</strong></p>
<p>It’s tempting to remove extreme values to simplify modeling, but outliers often represent, although rare, yet important events.  In fraud detection, for instance, the anomalies are the very signals the models need to learn from. Discarding them automatically based on z-scores or quantiles may improve the short-term accuracy while weakening long-term reliability.</p>
<p><strong>The way out</strong></p>
<ul>
<li><p>To avoid data leakage, create training and test splits before engineering features. Make use of chronological splits when modeling time-based behavior, and regularly audit feature logic.</p>
</li>
<li><p>For missing values, go through the missingness patterns first. Use indicator variables where necessary, and treat the missingness as a signal, rather than just a defect.</p>
</li>
<li><p>With outliers, analyze their sources before removing them. If they are recognized, try using robust models that can handle skewed data or flag them for downstream use instead of deleting them.</p>
</li>
</ul>
<p>Getting this stage right protects your models from brittle and unstable behavior.</p>
<h2 id="heading-modeling-and-validation-pitfalls"><strong>Modeling and Validation Pitfalls</strong></h2>
<p>A common thought in this field is that models are only as reliable as the assumptions built into them. Mistakes at this phase are often reflected late, sometimes after the models have been deployed, making them harder to catch and more expensive to fix.</p>
<p><strong>Overfitting Through Hyperparameter Tuning</strong></p>
<p>Trying to make a model perfect with the training data can lead to patterns that don’t hold up in practice. When one tests hundreds of hyperparameter combinations without proper checks, the model often ends up learning noise rather than signals in the data, thereby resulting in excellent scores during cross-validation but weak performance in production. For instance, a churn model might show an excellent performance during development, but once it is deployed to a new region with a slight difference in customer behavior, it then starts to miss the mark.</p>
<p><strong>Validation Leakage</strong></p>
<p>Leakage can occur when the validation process accidentally gives the model access to target-related information. One common case is target encoding, where features like average purchase per customer group are calculated on the full dataset rather than only on the training set. This can lead to inflated validation scores and a false sense of confidence.</p>
<p><strong>Ignoring Data Drift and Concept Drift</strong></p>
<p>Data changes over time, and so do the basic relationships that models rely on. A model trained on behavior from eight months ago may not reflect current realities. Imagine a fraud detection model built before a major policy shift or change of product; the possibility that the model may fail to catch new fraud patterns that arise afterwards is extremely high.</p>
<p><strong>The Way Out</strong></p>
<ul>
<li><p>Use nested cross-validation (a technique that separates hyperparameter tuning from final evaluation by using two loops of cross-validation) to avoid overfitting during the model selection. After this, you can then compare results against simple baselines to keep complexity in check.</p>
</li>
<li><p>Treat feature engineering as part of the pipeline and apply it within each training fold to avoid leakage. For time-sensitive data, validate progressively to reflect real-world use.</p>
</li>
<li><p>Check for drift using techniques like the Kolmogorov-Smirnov test or the Population Stability Index, and link alerts to retraining processes so models can evolve with data.</p>
</li>
</ul>
<p>These steps go a long way in keeping your models solid in production and ready for whatever the data throws at them.</p>
<h2 id="heading-interpretation-and-communication-pitfalls"><strong>Interpretation and Communication Pitfalls</strong></h2>
<p>Clear, responsible communication is just as important as accurate modeling. But it is very easy to slip into habits that make results look more certain, more compelling, more reliable than they really are. These missteps can lead teams to act on insights that don’t hold up.</p>
<p><strong>Overconfidence in Statistical Significance</strong></p>
<p>Testing lots of variables without making adjustments can make weak signals look important. Imagine you run a dozen A/B tests and pick the one with a p-value below 0.05. Without correcting for multiple comparisons, there’s a good chance that result is just noise.</p>
<p><strong>Ignoring Practical Significance</strong></p>
<p>A result can be significant statistically but still meaningless when viewed in context. For example, finding a 0.1% lift in clickthrough rate, which is technically real but not worth the cost of rolling out a change across the product.</p>
<p><strong>Model Explainability Missteps</strong></p>
<p>When explanation tools are used without context, they can confuse rather than clarify. Showing a ranked list of SHAP values might look impressive, but if the stakeholders don’t understand what the features mean or how they interact, the takeaway is lost.</p>
<p><strong>The Way Out</strong></p>
<ul>
<li><p>Be cautious with statistical significance. If you’re running several tests, apply corrections for multiple comparisons (Bonferroni or Benjamini-Hochberg methods, for instance) and avoid selectively reporting only the findings that look significant and ignoring those that don’t. </p>
</li>
<li><p>Look beyond what is statistically true and ask whether it is practically useful. A small, significant change might not be worth acting on at the end of the day.</p>
</li>
<li><p>When using explainability tools like SHAP or LIME, don’t assume the outputs speak for themselves. Add plain-language summaries, relevant examples, and business contexts to make them actionable. It is better to explain less with clarity than more with confusion.</p>
</li>
</ul>
<p>These habits make your results easier to trust, interpret, and apply, which is ultimately the point of the work.</p>
<h2 id="heading-organizational-and-workflow-pitfalls"><strong>Organizational and Workflow Pitfalls</strong></h2>
<p>A major fact is that analytics is most effective when it is collaborative and responsive.  Gaps in team structure or feedback processes can slow progress and limit the value of your work.</p>
<p>Teams working in isolation are a frequent issue. When analysts, engineers, and business stakeholders do not share tools or goals, efforts get duplicated and insights become fragmented. For example, one team might define active users based on weekly logins, while another uses monthly engagements, resulting in mismatched reports.</p>
<p>Lack of feedback from deployed models is another pitfall. If no one tracks what happens after predictions are made, teams miss the opportunity to refine and improve their processes. Imagine if a loan approval model is deployed, but there’s no follow-up on repayment behavior, it becomes difficult to tell whether the model is supporting sound lending decisions or increasing default risk.</p>
<p><strong>The way out</strong></p>
<ul>
<li><p>Encourage collaboration by forming cross-functional teams and coordinating around shared planning cycles.  Align on definitions early and rely on centralized dashboards to ensure that everyone is working from the same source of truth.</p>
</li>
<li><p>Create feedback loops and make them a standard part of your workflow, Track real-world outcomes, and schedule regular post-deployment reviews to understand what is working and what is not.</p>
</li>
<li><p>Include end users alongside data teams and treat their input as essential to improving the system.</p>
</li>
</ul>
<p>Taking these actions helps analytics stay practical, consistent, and responsive to real needs.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Each stage of the data workflow benefits from clarity, structure, and shared understanding. The table below shows all the mentioned pitfalls, together with the way out to help teams build more reliable models and deliver results that hold up in real-world settings.</p>
<table><tbody><tr><td><p><strong>Category</strong></p></td><td><p><strong>Pitfall</strong></p></td><td><p><strong>Consequences</strong></p></td><td><p><strong>Recommended Approach</strong></p></td></tr><tr><td><p><strong>Data collection</strong></p></td><td><p>Unreliable sources</p></td><td><p>Skewed insights</p></td><td><p>Validate source quality and apply consistent standards</p></td></tr><tr><td><p><strong>Data preparation</strong></p></td><td><p>Silent data leakage</p></td><td><p>Inflated model performance without real-world value</p></td><td><p>Use proper data splits and audit derived features</p></td></tr><tr><td><p><strong>Modeling &amp; validation</strong></p></td><td><p>Overfitting through hyperparameter tuning</p></td><td><p>Strong validation results that don’t translate to reality</p></td><td><p>Use nested cross-validation (a structure where tuning happens inside training folds) and keep simple baselines for comparison</p></td></tr><tr><td><p><strong>Interpretation &amp; communication</strong></p></td><td><p>Overconfidence in statistical significance</p></td><td><p>Misleading conclusions from small or selective effects</p></td><td><p>Adjust for multiple comparisons and report confidence intervals alongside p-values</p></td></tr><tr><td><p><strong>Organizational &amp; workflow</strong></p></td><td><p>Fragmented teams</p></td><td><p>Redundant work and inconsistent metrics</p></td><td><p>Encourage collaboration with shared planning, dashboards, and definitions</p></td></tr></tbody></table>

<p>Strong analytic practice is built over time. Keeping these pitfalls in view helps teams stay consistent, improve delivery, and create results that stay useful across projects and contexts.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Forecast Time Series Data with Python Darts ]]>
                </title>
                <description>
                    <![CDATA[ When analyzing time series data, your main objective is to consider the period during which the data is collected and how your variable of interest changes over time. There are various libraries for time series forecasting in Python, and Darts is one... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-forecast-time-series-data-with-python-darts/</link>
                <guid isPermaLink="false">68e40c4dd441014d7e52dc0d</guid>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Adejumo Ridwan Suleiman ]]>
                </dc:creator>
                <pubDate>Mon, 06 Oct 2025 18:37:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759775700643/6f7d18b3-2060-4708-b56e-3450acf58546.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When analyzing time series data, your main objective is to consider the period during which the data is collected and how your variable of interest changes over time.</p>
<p>There are various libraries for time series forecasting in Python, and <a target="_blank" href="https://unit8co.github.io/darts/">Darts</a> is one of them. Unlike other forecasting libraries, Darts is a high-level forecasting library with algorithms to handle various time series data, regardless of the kind of trend they portray.</p>
<p>This tutorial will walk you through how you can forecast time series data using Python Darts. This will help you make meaningful insights whenever you come across time series data such as stock prices, weather measurements, and so on.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-python-darts">What is Python Darts?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-set-up-dependencies">How to Set Up Dependencies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-the-dataset">Understanding the Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-prepare-the-data-for-darts">How to Prepare the Data for Darts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-forecasting-model">How to Build a Forecasting Model</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-classical-model">Classical Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-machine-learning-models">Machine Learning Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-forecast-with-deep-learning-models">How to Forecast with Deep Learning models</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-model-evaluation">Model Evaluation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-backtesting">BackTesting</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hyper-parameter-tuning">Hyper Parameter Tuning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-real-world-use-cases">Real-World Use Cases</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-best-practices">Best Practices</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-what-is-python-darts">What is Python Darts?</h2>
<p>Python Darts is an open-source library for time series analysis and forecasting. It has various models ranging from statistical time series models like ARIMA, and SARIMA, to machine learning and deep learning models like Prophet, and LSTM.</p>
<p>It has various algorithms for handling missing imputations in time series data, and can handle time series problems ranging from univariate, multivariate to hierarchical time series.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before we proceed, you will need to have the following:</p>
<ul>
<li><p>Python 3.9+ installed.</p>
</li>
<li><p>Jupyter Notebook, Google Colab, or Positron to run your code.</p>
</li>
<li><p>Download the <a target="_blank" href="https://www.kaggle.com/datasets/kalilurrahman/netflix-stock-data-live-and-latest">Netflix stock data</a>.</p>
</li>
<li><p>Have the following libraries installed:</p>
<ul>
<li><p><code>darts</code> for time series analysis</p>
</li>
<li><p><code>pandas</code> for data wrangling</p>
</li>
<li><p><code>matplotlib</code> for data visualization.</p>
</li>
</ul>
</li>
</ul>
<h2 id="heading-how-to-set-up-dependencies">How to Set Up Dependencies</h2>
<p>Load the following libraries.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> darts
<span class="hljs-keyword">from</span> darts <span class="hljs-keyword">import</span> TimeSeries
<span class="hljs-keyword">from</span> darts.models <span class="hljs-keyword">import</span> ARIMA
<span class="hljs-keyword">from</span> darts.models <span class="hljs-keyword">import</span> RegressionModel
<span class="hljs-keyword">from</span> lightgbm <span class="hljs-keyword">import</span> LGBMRegressor
<span class="hljs-keyword">from</span> darts.models <span class="hljs-keyword">import</span> RNNModel
<span class="hljs-keyword">from</span> darts.metrics <span class="hljs-keyword">import</span> mape
<span class="hljs-keyword">import</span> itertools
</code></pre>
<h2 id="heading-understanding-the-dataset">Understanding the Dataset</h2>
<p>The Netflix stock data contains historical daily prices of Netflix stock from the year 2002 till date.</p>
<p>Load the data and have a preview of it.</p>
<pre><code class="lang-python">netflix = pd.read_csv(<span class="hljs-string">"/kaggle/input/netflix-stock-data-live-and-latest/Netflix_stock_history.csv"</span>)
netflix[<span class="hljs-string">'Date'</span>] = pd.to_datetime(netflix[<span class="hljs-string">'Date'</span>], utc=<span class="hljs-literal">True</span>).dt.tz_convert(<span class="hljs-literal">None</span>)
netflix.head()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757927775470/2d4b542c-3869-40c5-844c-a733b5cc4bea.png" alt="Image showing the first 5 rows of the Netflix stock data" class="image--center mx-auto" width="1059" height="484" loading="lazy"></p>
<p>To forecast a time series data, we need a <code>Date</code> column, which we already have, and then the variable of interest. We have several variables, but for this tutorial, we will focus on the <code>Close</code> variable of Netflix stocks.</p>
<p>Let’s visualize the data to see how Netflix closing price performed over the years.</p>
<pre><code class="lang-python">netflix.plot(x=<span class="hljs-string">'Date'</span>, y=<span class="hljs-string">'Close'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">5</span>))
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1757928810807/75a1fa13-4f2e-4bdd-a539-5eaf2663843a.png" alt="Image showing a line chart of Netflix stock data from 2000 to date" class="image--center mx-auto" width="1036" height="517" loading="lazy"></p>
<p>From the chart above, you can see that Netflix stock showed exponential growth in recent years. This means that the data is non-stationary, implying that there are no consistent changes over time.</p>
<p>There are a lot of random fluctuations in the data, which might make it difficult to forecast. Such data usually requires advanced models to handle the various fluctuations or noise present in the data.</p>
<h2 id="heading-how-to-prepare-the-data-for-darts"><strong>How to Prepare the Data for Darts</strong></h2>
<p>Before preparing the data for Darts, you need to take note of few things.</p>
<p>First of all, if you look at our data preview earlier on, you would notice that it is recorded daily, we also need to fill in missing dates.</p>
<p>Copy and paste this code into your notebook.</p>
<pre><code class="lang-python">start = netflix[<span class="hljs-string">'Date'</span>].min()
end = netflix[<span class="hljs-string">'Date'</span>].max()

netflix = (
    netflix.set_index(<span class="hljs-string">'Date'</span>)
           .reindex(pd.date_range(start=start, end=end, freq=<span class="hljs-string">'D'</span>))
           .ffill()
           .reset_index()
           .rename(columns={<span class="hljs-string">'index'</span>: <span class="hljs-string">'Date'</span>})
)
netflix.head()
</code></pre>
<p>The code above ensures the <code>netflix</code> dataset has a continuous daily time series by filling in missing dates.</p>
<p>First, it finds the earliest <code>start</code> and latest <code>end</code> dates in the data, then creates a full daily date range between them.</p>
<p>By setting the <code>Date</code> column as the index and using <code>.reindex()</code> method, it inserts rows for any missing dates, which initially contain <code>NaN</code>.</p>
<p>The <code>.ffill()</code> method (forward fill) replaces these gaps by carrying forward the last known value, which is common for stock data when markets are closed, such as weekends.</p>
<p>Finally, the index is reset, and the column is renamed back to <code>Date</code>, producing a clean, continuous dataset ready for time series analysis.</p>
<p>Next, we need to convert the data to a Darts <code>Timeseries</code> object to make it usable by the Darts library.</p>
<pre><code class="lang-python"> = TimeSeries.from_dataframe(
    netflix,
    time_col=<span class="hljs-string">'Date'</span>,
    value_cols=<span class="hljs-string">'Close'</span>,
)
</code></pre>
<p>The code above converts the <code>netflix</code> DataFrame into a Darts <code>TimeSeries</code> object, which is optimized for time series modeling and forecasting.</p>
<p>It takes the <code>Date</code> column (<code>time_col='Date'</code>) as the timeline and the <code>Close</code> column (<code>value_cols='Close'</code>) as the target values to forecast.</p>
<p>The resulting <code>series</code> object is now structured for use with Darts’ advanced forecasting models like ARIMA, Prophet, RNNs, and other time series algorithms.</p>
<p>Just like you would with any other machine learning model, you need to split your data into a training set and a validation set.</p>
<pre><code class="lang-python">train, val = series.split_before(<span class="hljs-number">0.8</span>)
</code></pre>
<h2 id="heading-how-to-build-a-forecasting-model"><strong>How to Build a Forecasting Model</strong></h2>
<p>When building a forecasting model, you have the privilege of trying various models and picking the best-performing one.</p>
<p>The Darts library has various algorithms for time series analysis, from popular statistical algorithms like the Auto Regressive Integrated Moving Average (ARIMA) and Moving Average (MA) models, to machine learning and deep learning algorithms like Prophet and Long Short Term Memory (LSTM).</p>
<p>Note, I will only demonstrate how these algorithms work - it’s not necessary that we get accurate model metrics. But with further feature engineering, hyperparameter tuning, and cross-validation, you can get good results on your own.</p>
<h3 id="heading-classical-model">Classical Model</h3>
<p>The classical mode is the use of statistical time series models such as ARIMA. ARIMA is made up of the following components:</p>
<ul>
<li><p><strong>AR (AutoRegressive):</strong> Predict past values by looking at previous ones.</p>
</li>
<li><p><strong>I (Integrated):</strong> Remove trends by focusing on changes instead of raw values.</p>
</li>
<li><p><strong>MA (Moving Average):</strong> Learn from the errors of past predictions to improve accuracy.</p>
</li>
</ul>
<p>Run the code below in your notebook to fit an ARIMA model.</p>
<pre><code class="lang-python">arima_model = ARIMA()
arima_model.fit(train)
arima_forecast = arima_model.predict(len(val))
</code></pre>
<p>To visualize the forecast by the model, call the <code>.plot()</code> method on the <code>forecast</code> object.</p>
<pre><code class="lang-python">series.plot(label=<span class="hljs-string">'actual'</span>)
arima_forecast.plot(label=<span class="hljs-string">'forecast'</span>)
plt.legend()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758028284156/a40f2341-cfc6-4a9f-8297-e0511c2bb254.png" alt="Image showing the ARIMA model forecast of netflix stock " class="image--center mx-auto" width="820" height="646" loading="lazy"></p>
<p>You can improve the model by adding some additional parameters to the <code>ARIMA()</code> class. You can read more about that in the <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.arima.html">Darts documentation</a>.</p>
<h3 id="heading-machine-learning-models"><strong>Machine Learning Models</strong></h3>
<p>Classical models like ARIMA can’t handle non-linear data. Machine learning models fill this gap. We’ll use the LightGBM model as an example.</p>
<p>The LightGBM is a machine learning model that builds models sequentially based on decision trees. It adds new decision trees that correct the errors of previous trees.</p>
<p>Although it was not designed to handle time series, with some feature engineering such as lags, rolling statistics, and seasonal indicators, you can make it learn patterns from time series data.</p>
<p>Run this code on your notebook to fit a LightGBM model on the Netflix data.</p>
<pre><code class="lang-python">lgbm = LGBMRegressor()
lgbm_model = RegressionModel(lags=<span class="hljs-number">12</span>, model=lgbm)
lgbm_model.fit(train)
lgbm_forecast = lgbm_model.predict(len(val))
</code></pre>
<p>From the code above, the <code>lag</code> argument is set to <code>12</code>, which is the value of the Netflix stock price for 12 days before a selected day.</p>
<p>Let’s have a view of the forecast by running the following code.</p>
<pre><code class="lang-python">series.plot(label=<span class="hljs-string">'actual'</span>)
lgbm_forecast.plot(label=<span class="hljs-string">'forecast'</span>)
plt.legend()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758029933172/54f34a69-4f6b-4b44-85ab-d0b45931d701.png" alt="Image showing the LightGBM model forecast of netflix stock " class="image--center mx-auto" width="813" height="631" loading="lazy"></p>
<p>You can read more about tuning the LightGBM model from the <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.lgbm.html">Darts documentation</a> to improve the above model.</p>
<h3 id="heading-how-to-forecast-with-deep-learning-models"><strong>How to Forecast with Deep Learning models</strong></h3>
<p>You can go for deep learning models designed for time series, such as LSTM, a kind of Recurrent Neural Network (RNN) designed to capture long-term dependencies in sequential data.</p>
<p>Run the following code to build the LSTM model.</p>
<pre><code class="lang-python">lstm_model = RNNModel(model=<span class="hljs-string">'LSTM'</span>, input_chunk_length=<span class="hljs-number">12</span>, output_chunk_length=<span class="hljs-number">6</span>, n_epochs=<span class="hljs-number">100</span>)
lstm_model.fit(train)
lstm_forecast = rnn_model.predict(len(val))
</code></pre>
<p>Now let’s visualize the forecast and see what we have.</p>
<pre><code class="lang-python">series.plot(label=<span class="hljs-string">'actual'</span>)
lstm_forecast.plot(label=<span class="hljs-string">'forecast'</span>)
plt.legend()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758116174578/2ff80218-2254-452d-8d4c-2f85c61612de.png" alt="Image showing the LSTM model forecast of Netflix stock " class="image--center mx-auto" width="682" height="526" loading="lazy"></p>
<p>You can look up the <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.rnn_model.html">Darts documentation</a> to improve the model and check out other deep learning models also.</p>
<h2 id="heading-model-evaluation"><strong>Model Evaluation</strong></h2>
<p>Now that you have three models, you need to select the best one among them using the Mean Absolute Percentage Error (MAPE).</p>
<p>It expresses the average absolute error as a percentage of the actual values, and the closer your value is to 0, the better your model.</p>
<p>Run the following to print the MAPE of each respective model.</p>
<pre><code class="lang-python">arima_error = mape(val, arima_forecast)
print(<span class="hljs-string">"MAPE:"</span>, arima_error)
lgbm_error = mape(val, lgbm_forecast)
print(<span class="hljs-string">"MAPE:"</span>, lgbm_error)
lstm_error = mape(val, lstm_forecast)
print(<span class="hljs-string">"MAPE:"</span>, lstm_error)
</code></pre>
<pre><code class="lang-bash">&gt; MAPE: 38.33262525601514
&gt; MAPE: 39.00241495209449
&gt; MAPE: 38.82910057097827
</code></pre>
<p>The model with the lowest MAPE is the ARIMA model with approximately 38.33, which means it’s our best-performing model.</p>
<h2 id="heading-backtesting">BackTesting</h2>
<p>Darts has a feature called backtesting that allows you to evaluate your models based on historical data, using a rolling forecast.</p>
<p>Backtesting is like a time machine for forecasting. It simulates how your model would have performed in the past by repeatedly training it on historical data up to a certain point, making a prediction for the next step, then moving forward, and repeating the process.</p>
<p>This rolling evaluation simulates how the model would behave in real-world conditions, where future data is unknown, helping you measure its consistency and reliability over time, instead of just testing it once on a single validation set.</p>
<p>Since the ARIMA model is currently our best-performing model, run the code below to implement backtesting.</p>
<pre><code class="lang-python">
<span class="hljs-comment"># Perform backtesting on the training + validation series</span>
backtest_series = train.concatenate(val)

<span class="hljs-comment"># Backtest</span>
backtest_forecast = arima_model.historical_forecasts(
    series=backtest_series,
    start=<span class="hljs-number">0.8</span>,          <span class="hljs-comment"># fraction of the series to start forecasting from</span>
    forecast_horizon=len(val),
    stride=<span class="hljs-number">1</span>,           <span class="hljs-comment"># step size of rolling forecast</span>
    retrain=<span class="hljs-literal">True</span>,       <span class="hljs-comment"># retrain the model at each step</span>
    verbose=<span class="hljs-literal">True</span>
)

<span class="hljs-comment"># Compute metrics</span>
error = mape(backtest_series[-len(val):], backtest_forecast)
print(<span class="hljs-string">f"MAPE: <span class="hljs-subst">{error:<span class="hljs-number">.2</span>f}</span>%"</span>)
</code></pre>
<pre><code class="lang-bash">&gt; historical forecasts: 100%|██████████| 1/1 [00:02&lt;00:00,  2.69s/it]MAPE: 47.27%
</code></pre>
<p>In the code above,</p>
<ul>
<li><p>The <code>start</code> argument defines where to start backtesting, which in this case is the last 20% series of the data.</p>
</li>
<li><p>The <code>forecast_horizon</code> is how many steps ahead to forecast at each point.</p>
</li>
<li><p>The <code>stride</code> is how frequently to retrain/forecast.</p>
</li>
<li><p>The <code>retrain=True</code> refits the model at each step for realistic evaluation.</p>
</li>
</ul>
<p>You can see that the MAPE, after backtesting, is higher because backtesting is more realistic, and it is more difficult to achieve a lower MAPE.</p>
<p>On your own, you can try to replicate backtesting for the other models.</p>
<h2 id="heading-hyper-parameter-tuning">Hyper Parameter Tuning</h2>
<p>The ARIMA model has three hyperparameter:</p>
<ul>
<li><p><code>p</code> which is the AR order</p>
</li>
<li><p><code>d</code> which is the differencing order</p>
</li>
<li><p><code>q</code> which is the MA order</p>
</li>
</ul>
<p>You can use either grid or random search to tune your ARIMA model in Darts.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Define possible values</span>
p_values = range(<span class="hljs-number">0</span>, <span class="hljs-number">4</span>)
d_values = range(<span class="hljs-number">0</span>, <span class="hljs-number">3</span>)
q_values = range(<span class="hljs-number">0</span>, <span class="hljs-number">4</span>)

best_mape = float(<span class="hljs-string">'inf'</span>)
best_params = <span class="hljs-literal">None</span>

<span class="hljs-keyword">for</span> p, d, q <span class="hljs-keyword">in</span> itertools.product(p_values, d_values, q_values):
    <span class="hljs-keyword">try</span>:
        arima_model = ARIMA(p=p, d=d, q=q)
        arima_model.fit(train)
        arima_forecast = arima_model.predict(len(val))
        arima_error = mape(val, arima_forecast)
        <span class="hljs-keyword">if</span> arima_error &lt; best_mape:
            best_mape = arima_error
            best_params = (p, d, q)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-comment"># Some combinations may fail</span>
        <span class="hljs-keyword">continue</span>

print(<span class="hljs-string">f"Best ARIMA params: p=<span class="hljs-subst">{best_params[<span class="hljs-number">0</span>]}</span>, d=<span class="hljs-subst">{best_params[<span class="hljs-number">1</span>]}</span>, q=<span class="hljs-subst">{best_params[<span class="hljs-number">2</span>]}</span> with MAPE=<span class="hljs-subst">{best_mape:<span class="hljs-number">.2</span>f}</span>%"</span>)
</code></pre>
<pre><code class="lang-bash">&gt; Best ARIMA params: p=2, d=0, q=3 with MAPE=35.95%
</code></pre>
<p>In the above code, you define a range of possible values for the <code>p</code>, <code>d</code> , and <code>q</code> components, iterating over each combination of those values and choosing the model with the best MAPE among them.</p>
<p>Note that each model has its specific parameter you would have to tune, and you will need to check <a target="_blank" href="https://unit8co.github.io/darts/userguide/hyperparameter_optimization.html">the Darts documentation</a> for the hyperparameters of other models.</p>
<h2 id="heading-real-world-use-cases"><strong>Real-World Use Cases</strong></h2>
<p>Forecasting time series data has a lot of real-world applications, some of which are:</p>
<ul>
<li><p><strong>Stock price prediction:</strong> Like the dataset used in this tutorial, forecasting is used in finance for stock price prediction, allowing investors to manage risk.</p>
</li>
<li><p><strong>Demand forecasting for inventory:</strong> As a store owner, you can forecast product demands based on past sales of a product. This lets you know products that are in high demand.</p>
</li>
<li><p><strong>Energy consumption prediction:</strong> Governments, industries, and consumers can plan and manage energy production, distribution, and consumption efficiently, based on data from past usage. This helps to avoid blackouts and wastage, enabling them to prepare ahead.</p>
</li>
</ul>
<h2 id="heading-best-practices">Best Practices</h2>
<ul>
<li><p><strong>Always visualize residuals:</strong> Residuals are the difference between forecasted values and actual values. You must visualize them to detect outliers and unusual events.</p>
</li>
<li><p><strong>Perform proper backtesting:</strong> Backtesting lets you see a more realistic model, subjected to various changes that can occur in real life. When you backtest all your models, you end up getting a model that performs well when forecasting.</p>
</li>
<li><p><strong>Avoid data leakage:</strong> Do not train your models on validation sets to avoid bias, and always use cross-validation where necessary.</p>
</li>
<li><p><strong>Use domain knowledge for feature engineering:</strong> Ensure you understand the data you are working with. This comes in handy in feature engineering, when you want to come up with new features to help your forecasting model, especially in multivariate time series forecasting.</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>This tutorial is more like an overview, especially if you are new to time series, but you can build a lot just from what you have learned.</p>
<p>You already have an idea of what time series and forecasting are, and how you can use the Darts Python library to achieve that.</p>
<p>You also learned of various models for forecasting time series data, and how you can apply techniques such as backtesting and hyperparameter tuning to achieve better results.</p>
<p>Another interesting thing with Darts is its ability to handle <a target="_blank" href="https://unit8co.github.io/darts/userguide/timeseries.html#hierarchical-time-series">hierarchical time series</a>. Here, data is structured at aggregated levels.</p>
<p>Darts is one of the most powerful time series libraries in Python and has a lot of models to handle various cases. You can proceed to explore models such as <a target="_blank" href="https://unit8co.github.io/darts/generated_api/darts.models.forecasting.transformer_model.html">Transformers</a> and also <a target="_blank" href="https://unit8co.github.io/darts/examples/01-multi-time-series-and-covariates.html">multi-series forecasting</a>, which are used for special use cases.</p>
<p>If you are interested in more data science and statistics articles, don’t forget to check out <a target="_blank" href="https://learndata.xyz/blog">my blog</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Graph Algorithms in Python: BFS, DFS, and Beyond ]]>
                </title>
                <description>
                    <![CDATA[ Have you ever wondered how Google Maps finds the fastest route or how Netflix recommends what to watch? Graph algorithms are behind these decisions. Graphs, made up of nodes (points) and edges (connections), are one of the most powerful data structur... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/graph-algorithms-in-python-bfs-dfs-and-beyond/</link>
                <guid isPermaLink="false">68b86be0956e509211153b48</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ graphs ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Wed, 03 Sep 2025 16:25:04 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756916679855/9b173128-ed79-4ae0-8cc8-79fca17662dd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Have you ever wondered how Google Maps finds the fastest route or how Netflix recommends what to watch? Graph algorithms are behind these decisions.</p>
<p>Graphs, made up of nodes (points) and edges (connections), are one of the most powerful data structures in computer science. They help model relationships efficiently, from social networks to transportation systems.</p>
<p>In this guide, we will explore two core traversal techniques: Breadth-First Search (BFS) and Depth-First Search (DFS). Moving on from there, we will cover advanced algorithms like Dijkstra’s, A*, Kruskal’s, Prim’s, and Bellman-Ford.</p>
<h3 id="heading-table-of-contents">Table of Contents:</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-understanding-graphs-in-python">Understanding Graphs in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ways-to-represent-graphs-in-python">Ways to Represent Graphs in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-breadth-first-search-bfs">Breadth-First Search (BFS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-depth-first-search-dfs">Depth-First Search (DFS)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-dijkstras-algorithm">Dijkstra’s Algorithm</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-a-search">A* Search</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-kruskals-algorithm">Kruskal’s Algorithm</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prims-algorithm">Prim’s Algorithm</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-bellman-ford-algorithm">Bellman-Ford Algorithm</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-optimizing-graph-algorithms-in-python">Optimizing Graph Algorithms in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-key-takeaways">Key Takeaways</a></p>
</li>
</ol>
<h2 id="heading-understanding-graphs-in-python">Understanding Graphs in Python</h2>
<p>A graph consists of <strong>nodes (vertices)</strong> and <strong>edges (relationships)</strong>.</p>
<p>For examples, in a social network, people are nodes and friendships are edges. Or in a roadmap, cities are nodes and roads are edges.</p>
<p>There are a few different types of graphs:</p>
<ul>
<li><p><strong>Directed</strong>: edges have direction (one-way streets, task scheduling).</p>
</li>
<li><p><strong>Undirected</strong>: edges go both ways (mutual friendships).</p>
</li>
<li><p><strong>Weighted</strong>: edges have values (distances, costs).</p>
</li>
<li><p><strong>Unweighted</strong>: edges are equal (basic subway routes).</p>
</li>
</ul>
<p>Now that you know what graphs are, let’s look at the different ways they can be represented in Python.</p>
<h2 id="heading-ways-to-represent-graphs-in-python">Ways to Represent Graphs in Python</h2>
<p>Before diving into traversal and pathfinding, it’s important to know how graphs can be represented. Different problems call for different representations.</p>
<h3 id="heading-adjacency-matrix">Adjacency Matrix</h3>
<p>An adjacency matrix is a 2D array where each cell <code>(i, j)</code> shows whether there is an edge from node <code>i</code> to node <code>j</code>.</p>
<ul>
<li><p>In an <strong>unweighted graph</strong>, <code>0</code> means no edge, and <code>1</code> means an edge exists.</p>
</li>
<li><p>In a <strong>weighted graph</strong>, the cell holds the edge weight.</p>
</li>
</ul>
<p>This makes it very quick to check if two nodes are directly connected (constant-time lookup), but it uses more memory for large graphs.</p>
<pre><code class="lang-python">graph = [
    [<span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>],
    [<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>],
    [<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>]
]
</code></pre>
<p>Here, the matrix shows a fully connected graph of 3 nodes. For example, <code>graph[0][1] = 1</code> means there is an edge from node 0 to node 1.</p>
<h3 id="heading-adjacency-list">Adjacency List</h3>
<p>An adjacency list represents each node along with the list of nodes it connects to.</p>
<p>This is usually more efficient for sparse graphs (where not every node is connected to every other node). It saves memory because only actual edges are stored instead of an entire grid.</p>
<pre><code class="lang-python">graph = {
    <span class="hljs-string">'A'</span>: [<span class="hljs-string">'B'</span>,<span class="hljs-string">'C'</span>],
    <span class="hljs-string">'B'</span>: [<span class="hljs-string">'A'</span>,<span class="hljs-string">'C'</span>],
    <span class="hljs-string">'C'</span>: [<span class="hljs-string">'A'</span>,<span class="hljs-string">'B'</span>]
}
</code></pre>
<p>Here, node <code>A</code> connects to <code>B</code> and <code>C</code>, and so on. Checking connections takes a little longer than with a matrix, but for large, sparse graphs, it’s the better option.</p>
<h3 id="heading-using-networkx">Using NetworkX</h3>
<p>When working on real-world applications, writing your own adjacency lists and matrices can get tedious. That’s where <strong>NetworkX</strong> comes in, a Python library that simplifies graph creation and analysis.</p>
<p>With just a few lines of code, you can build graphs, visualize them, and run advanced algorithms without reinventing the wheel.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> networkx <span class="hljs-keyword">as</span> nx
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

G = nx.Graph()
G.add_edges_from([(<span class="hljs-string">'A'</span>,<span class="hljs-string">'B'</span>), (<span class="hljs-string">'A'</span>,<span class="hljs-string">'C'</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-string">'C'</span>)])
nx.draw(G, with_labels=<span class="hljs-literal">True</span>)
plt.show()
</code></pre>
<p>This builds a triangle-shaped graph with nodes A, B, and C. NetworkX also lets you easily run algorithms like shortest paths or spanning trees without manually coding them.</p>
<p>Now that we’ve seen different ways to represent graphs, let’s move on to traversal methods, starting with Breadth-First Search (BFS).</p>
<h2 id="heading-breadth-first-search-bfs">Breadth-First Search (BFS)</h2>
<p>The basic idea behind BFS is to explore a graph one layer at a time. It looks at all the neighbors of a starting node before moving on to the next level. A queue is used to keep track of what comes next.</p>
<p>BFS is particularly useful for:</p>
<ul>
<li><p>Finding the shortest path in unweighted graphs</p>
</li>
<li><p>Detecting connected components</p>
</li>
<li><p>Crawling web pages</p>
</li>
</ul>
<p>Here’s an example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> deque

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">bfs</span>(<span class="hljs-params">graph, start</span>):</span>
    visited = {start}
    queue = deque([start])

    <span class="hljs-keyword">while</span> queue:
        node = queue.popleft()
        print(node, end=<span class="hljs-string">" "</span>)
        <span class="hljs-keyword">for</span> neighbor <span class="hljs-keyword">in</span> graph[node]:
            <span class="hljs-keyword">if</span> neighbor <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> visited:
                visited.add(neighbor)
                queue.append(neighbor)


graph = {
    <span class="hljs-string">'A'</span>: [<span class="hljs-string">'B'</span>,<span class="hljs-string">'C'</span>],
    <span class="hljs-string">'B'</span>: [<span class="hljs-string">'A'</span>,<span class="hljs-string">'D'</span>,<span class="hljs-string">'E'</span>],
    <span class="hljs-string">'C'</span>: [<span class="hljs-string">'A'</span>,<span class="hljs-string">'F'</span>],
    <span class="hljs-string">'D'</span>: [<span class="hljs-string">'B'</span>],
    <span class="hljs-string">'E'</span>: [<span class="hljs-string">'B'</span>,<span class="hljs-string">'F'</span>],
    <span class="hljs-string">'F'</span>: [<span class="hljs-string">'C'</span>,<span class="hljs-string">'E'</span>]
}

bfs(graph, <span class="hljs-string">'A'</span>)
</code></pre>
<p>Here’s what’s going on in this code:</p>
<ul>
<li><p><code>graph</code> is a dict where each node maps to a list of neighbors.</p>
</li>
<li><p><code>deque</code> is used as a FIFO queue so we visit nodes level-by-level.</p>
</li>
<li><p><code>visited</code> keeps track of nodes we’ve already processed so we don’t loop forever on cycles.</p>
</li>
<li><p>In the loop, we pop a node, print it, then for each unvisited neighbor, we mark it visited and enqueue it.</p>
</li>
</ul>
<p>And here’s the output:</p>
<pre><code class="lang-python">A B C D E F
</code></pre>
<p>Now that we have seen how BFS works, let’s turn to its counterpart: Depth-First Search (DFS).</p>
<h2 id="heading-depth-first-search-dfs">Depth-First Search (DFS)</h2>
<p>DFS works differently from BFS. Instead of moving level by level, it follows one path as far as it can go before backtracking. Think of it as diving deep down a trail, then returning to explore the others.</p>
<p>We can implement DFS in two ways:</p>
<ul>
<li><p><strong>Recursive DFS</strong>, which uses the function call stack</p>
</li>
<li><p><strong>Iterative DFS</strong>, which uses an explicit stack</p>
</li>
</ul>
<p>DFS is especially useful for:</p>
<ul>
<li><p>Cycle detection</p>
</li>
<li><p>Maze solving and puzzles</p>
</li>
<li><p>Topological sorting</p>
</li>
</ul>
<p>Here’s an example of recursive DFS:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dfs_recursive</span>(<span class="hljs-params">graph, node, visited=None</span>):</span>
    <span class="hljs-keyword">if</span> visited <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        visited = set()
    <span class="hljs-keyword">if</span> node <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> visited:
        print(node, end=<span class="hljs-string">" "</span>)
        visited.add(node)
        <span class="hljs-keyword">for</span> neighbor <span class="hljs-keyword">in</span> graph[node]:
            dfs_recursive(graph, neighbor, visited)

graph = {
    <span class="hljs-string">'A'</span>: [<span class="hljs-string">'B'</span>,<span class="hljs-string">'C'</span>],
    <span class="hljs-string">'B'</span>: [<span class="hljs-string">'A'</span>,<span class="hljs-string">'D'</span>,<span class="hljs-string">'E'</span>],
    <span class="hljs-string">'C'</span>: [<span class="hljs-string">'A'</span>,<span class="hljs-string">'F'</span>],
    <span class="hljs-string">'D'</span>: [<span class="hljs-string">'B'</span>],
    <span class="hljs-string">'E'</span>: [<span class="hljs-string">'B'</span>,<span class="hljs-string">'F'</span>],
    <span class="hljs-string">'F'</span>: [<span class="hljs-string">'C'</span>,<span class="hljs-string">'E'</span>]
}

dfs_recursive(graph, <span class="hljs-string">'A'</span>)
</code></pre>
<ul>
<li><p><code>visited</code> is a set that tracks nodes already processed so you don’t loop forever on cycles.</p>
</li>
<li><p>On each call, if <code>node</code> hasn’t been seen, it’s printed, marked visited, then the function recurses into each neighbor.</p>
</li>
</ul>
<p>Traversal order:</p>
<pre><code class="lang-python">A B D E F C
</code></pre>
<p>Explanation: DFS visits B after A, goes deeper into D, then backtracks to explore E and F, and finally visits C.</p>
<p>And here’s an example of iterative DFS:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dfs_iterative</span>(<span class="hljs-params">graph, start</span>):</span>
    visited = set()
    stack = [start]

    <span class="hljs-keyword">while</span> stack:
        node = stack.pop()
        <span class="hljs-keyword">if</span> node <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> visited:
            print(node, end=<span class="hljs-string">" "</span>)
            visited.add(node)
            stack.extend(reversed(graph[node]))

dfs_iterative(graph, <span class="hljs-string">'A'</span>)
</code></pre>
<ul>
<li><p><code>visited</code> tracks nodes you’ve already processed so you don’t loop on cycles.</p>
</li>
<li><p><code>stack</code> is LIFO (last in, first out) – you <code>pop()</code> the top node, process it, then push its neighbors.</p>
</li>
<li><p><code>reversed(graph[node])</code> pushes neighbors in reverse so they’re visited in the original left-to-right order (mimicking the usual recursive DFS).</p>
</li>
</ul>
<p>Here’s the output:</p>
<pre><code class="lang-python">A B D E F C
</code></pre>
<p>With BFS and DFS explained, we can now move on to algorithms that solve more complex problems, starting with Dijkstra’s shortest path algorithm.</p>
<h2 id="heading-dijkstras-algorithm">Dijkstra’s Algorithm</h2>
<p>Dijkstra’s algorithm is built on a simple rule: always visit the node with the smallest known distance first. By repeating this, it uncovers the shortest path from a starting node to all others in a weighted graph that doesn’t have negative edges.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> heapq

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dijkstra</span>(<span class="hljs-params">graph, start</span>):</span>
    heap = [(<span class="hljs-number">0</span>, start)]
    shortest_path = {node: float(<span class="hljs-string">'inf'</span>) <span class="hljs-keyword">for</span> node <span class="hljs-keyword">in</span> graph}
    shortest_path[start] = <span class="hljs-number">0</span>

    <span class="hljs-keyword">while</span> heap:
        cost, node = heapq.heappop(heap)
        <span class="hljs-keyword">for</span> neighbor, weight <span class="hljs-keyword">in</span> graph[node]:
            new_cost = cost + weight
            <span class="hljs-keyword">if</span> new_cost &lt; shortest_path[neighbor]:
                shortest_path[neighbor] = new_cost
                heapq.heappush(heap, (new_cost, neighbor))
    <span class="hljs-keyword">return</span> shortest_path

graph = {
    <span class="hljs-string">'A'</span>: [(<span class="hljs-string">'B'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-number">4</span>)],
    <span class="hljs-string">'B'</span>: [(<span class="hljs-string">'A'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-number">2</span>), (<span class="hljs-string">'D'</span>,<span class="hljs-number">5</span>)],
    <span class="hljs-string">'C'</span>: [(<span class="hljs-string">'A'</span>,<span class="hljs-number">4</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-number">2</span>), (<span class="hljs-string">'D'</span>,<span class="hljs-number">1</span>)],
    <span class="hljs-string">'D'</span>: [(<span class="hljs-string">'B'</span>,<span class="hljs-number">5</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-number">1</span>)]
}

print(dijkstra(graph, <span class="hljs-string">'A'</span>))
</code></pre>
<p>Here’s what’s going on in this code:</p>
<ul>
<li><p><code>graph</code> is an adjacency list: each node maps to a list of <code>(neighbor, weight)</code> pairs.</p>
</li>
<li><p><code>shortest_path</code> stores the current best-known distance to each node (∞ initially, 0 for <code>start</code>).</p>
</li>
<li><p><code>heap</code> (priority queue) holds frontier nodes as <code>(cost, node)</code>, always popping the smallest cost first.</p>
</li>
<li><p>For each popped <code>node</code>, it relaxes its edges: for each <code>(neighbor, weight)</code>, compute <code>new_cost</code>. If <code>new_cost</code> beats <code>shortest_path[neighbor]</code>, update it and push the neighbor with that cost.</p>
</li>
</ul>
<p>And here’s the output:</p>
<pre><code class="lang-python">{<span class="hljs-string">'A'</span>: <span class="hljs-number">0</span>, <span class="hljs-string">'B'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'C'</span>: <span class="hljs-number">3</span>, <span class="hljs-string">'D'</span>: <span class="hljs-number">4</span>}
</code></pre>
<p>Moving on, let’s look at an extension of this algorithm: <em>A Search.</em>*</p>
<h2 id="heading-a-search">A* Search</h2>
<p>A* works like Dijkstra’s but adds a heuristic function that estimates how close a node is to the goal. This makes it more efficient by guiding the search in the right direction.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> heapq

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">heuristic</span>(<span class="hljs-params">node, goal</span>):</span>
    heuristics = {<span class="hljs-string">'A'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'B'</span>: <span class="hljs-number">2</span>, <span class="hljs-string">'C'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'D'</span>: <span class="hljs-number">0</span>}
    <span class="hljs-keyword">return</span> heuristics.get(node, <span class="hljs-number">0</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">a_star</span>(<span class="hljs-params">graph, start, goal</span>):</span>
    g_costs = {node: float(<span class="hljs-string">'inf'</span>) <span class="hljs-keyword">for</span> node <span class="hljs-keyword">in</span> graph}
    g_costs[start] = <span class="hljs-number">0</span>
    came_from = {}

    heap = [(heuristic(start, goal), start)]

    <span class="hljs-keyword">while</span> heap:
        f, node = heapq.heappop(heap)

        <span class="hljs-keyword">if</span> f &gt; g_costs[node] + heuristic(node, goal):
            <span class="hljs-keyword">continue</span>

        <span class="hljs-keyword">if</span> node == goal:
            path = [node]
            <span class="hljs-keyword">while</span> node <span class="hljs-keyword">in</span> came_from:
                node = came_from[node]
                path.append(node)
            <span class="hljs-keyword">return</span> path[::<span class="hljs-number">-1</span>], g_costs[path[<span class="hljs-number">0</span>]]

        <span class="hljs-keyword">for</span> neighbor, weight <span class="hljs-keyword">in</span> graph[node]:
            new_g = g_costs[node] + weight
            <span class="hljs-keyword">if</span> new_g &lt; g_costs[neighbor]:
                g_costs[neighbor] = new_g
                came_from[neighbor] = node
                heapq.heappush(heap, (new_g + heuristic(neighbor, goal), neighbor))

    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>, float(<span class="hljs-string">'inf'</span>)

graph = {
    <span class="hljs-string">'A'</span>: [(<span class="hljs-string">'B'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-number">4</span>)],
    <span class="hljs-string">'B'</span>: [(<span class="hljs-string">'A'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-number">2</span>), (<span class="hljs-string">'D'</span>,<span class="hljs-number">5</span>)],
    <span class="hljs-string">'C'</span>: [(<span class="hljs-string">'A'</span>,<span class="hljs-number">4</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-number">2</span>), (<span class="hljs-string">'D'</span>,<span class="hljs-number">1</span>)],
    <span class="hljs-string">'D'</span>: []
}

print(a_star(graph, <span class="hljs-string">'A'</span>, <span class="hljs-string">'D'</span>))
</code></pre>
<p>This one’s a little more complex, so here’s what’s going on:</p>
<ul>
<li><p><code>graph</code>: adjacency list – each node maps to <code>[(neighbor, weight), ...]</code>.</p>
</li>
<li><p><code>heuristic(node, goal)</code>: returns an estimate <code>h(node)</code> (lower is better). It’s passed <code>goal</code> but in this demo uses a fixed dict.</p>
</li>
<li><p><code>g_costs</code>: best known cost from <code>start</code> to each node (∞ initially, 0 for start).</p>
</li>
<li><p><code>heap</code>: min-heap of <code>(priority, node)</code> where <code>priority = g + h</code>.</p>
</li>
<li><p><code>came_from</code>: backpointers to reconstruct the path once we pop the goal.</p>
</li>
</ul>
<p>Then in the main loop:</p>
<ul>
<li><p>We pop the node with smallest priority.</p>
</li>
<li><p>If it’s the goal, we backtrack via <code>came_from</code> to build the path and return it with <code>g_costs[goal]</code>.</p>
</li>
<li><p>Otherwise, we relax the edges: for each <code>(neighbor, weight)</code>, compute <code>new_cost = g_costs[node] + weight</code>. If <code>new_cost</code> improves <code>g_costs[neighbor]</code>, update it, set <code>came_from[neighbor] = node</code>, and push <code>(new_cost + heuristic(neighbor, goal), neighbor)</code>.</p>
</li>
</ul>
<p>Output:</p>
<pre><code class="lang-python">([<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'D'</span>], <span class="hljs-number">4</span>)
</code></pre>
<p>Next up, let’s move from shortest paths to spanning trees. This is where Kruskal’s algorithm comes in.</p>
<h2 id="heading-kruskals-algorithm">Kruskal’s Algorithm</h2>
<p>Kruskal’s algorithm builds a Minimum Spanning Tree (MST) by sorting all edges from smallest to largest and adding them one at a time, as long as they don’t create a cycle. This makes it a greedy algorithm as it always picks the cheapest option available at each step.</p>
<p>The implementation uses a Disjoint Set (Union-Find) data structure to efficiently check whether adding an edge would create a cycle. Each node starts in its own set, and as edges are added, sets are merged.</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DisjointSet</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, nodes</span>):</span>
        self.parent = {node: node <span class="hljs-keyword">for</span> node <span class="hljs-keyword">in</span> nodes}
        self.rank = {node: <span class="hljs-number">0</span> <span class="hljs-keyword">for</span> node <span class="hljs-keyword">in</span> nodes}
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">find</span>(<span class="hljs-params">self, node</span>):</span>
        <span class="hljs-keyword">if</span> self.parent[node] != node:
            self.parent[node] = self.find(self.parent[node])
        <span class="hljs-keyword">return</span> self.parent[node]
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">union</span>(<span class="hljs-params">self, node1, node2</span>):</span>
        r1, r2 = self.find(node1), self.find(node2)
        <span class="hljs-keyword">if</span> r1 != r2:
            <span class="hljs-keyword">if</span> self.rank[r1] &gt; self.rank[r2]:
                self.parent[r2] = r1
            <span class="hljs-keyword">else</span>:
                self.parent[r1] = r2
                <span class="hljs-keyword">if</span> self.rank[r1] == self.rank[r2]:
                    self.rank[r2] += <span class="hljs-number">1</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">kruskal</span>(<span class="hljs-params">graph</span>):</span>
    edges = sorted(graph, key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">2</span>])
    mst, ds = [], DisjointSet({u <span class="hljs-keyword">for</span> e <span class="hljs-keyword">in</span> graph <span class="hljs-keyword">for</span> u <span class="hljs-keyword">in</span> e[:<span class="hljs-number">2</span>]})
    <span class="hljs-keyword">for</span> u,v,w <span class="hljs-keyword">in</span> edges:
        <span class="hljs-keyword">if</span> ds.find(u) != ds.find(v):
            ds.union(u,v)
            mst.append((u,v,w))
    <span class="hljs-keyword">return</span> mst

graph = [(<span class="hljs-string">'A'</span>,<span class="hljs-string">'B'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'A'</span>,<span class="hljs-string">'C'</span>,<span class="hljs-number">4</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-string">'C'</span>,<span class="hljs-number">2</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-string">'D'</span>,<span class="hljs-number">5</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-string">'D'</span>,<span class="hljs-number">1</span>)]
print(kruskal(graph))
</code></pre>
<p>Output:</p>
<pre><code class="lang-python">[(<span class="hljs-string">'A'</span>,<span class="hljs-string">'B'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-string">'D'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-string">'C'</span>,<span class="hljs-number">2</span>)]
</code></pre>
<p>Here, the MST includes the smallest edges that connect all nodes without forming cycles. Now that we have seen Kruskal’s, we can move further to analyze another algorithm.</p>
<h2 id="heading-prims-algorithm">Prim’s Algorithm</h2>
<p>Prim’s algorithm also finds an MST, but it grows the tree step by step. It starts with one node and repeatedly <strong>adds the smallest edge</strong> that connects the current tree to a new node. Think of it as expanding a connected “island” until all nodes are included.</p>
<p>This implementation uses a <strong>priority queue (heapq)</strong> to always select the smallest available edge efficiently.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> heapq

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">prim</span>(<span class="hljs-params">graph, start</span>):</span>
    mst, visited = [], {start}
    edges = [(w, start, n) <span class="hljs-keyword">for</span> n,w <span class="hljs-keyword">in</span> graph[start]]
    heapq.heapify(edges)

    <span class="hljs-keyword">while</span> edges:
        w,u,v = heapq.heappop(edges)
        <span class="hljs-keyword">if</span> v <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> visited:
            visited.add(v)
            mst.append((u,v,w))
            <span class="hljs-keyword">for</span> n,w <span class="hljs-keyword">in</span> graph[v]:
                <span class="hljs-keyword">if</span> n <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> visited:
                    heapq.heappush(edges, (w,v,n))
    <span class="hljs-keyword">return</span> mst

graph = {
    <span class="hljs-string">'A'</span>:[(<span class="hljs-string">'B'</span>,<span class="hljs-number">1</span>),(<span class="hljs-string">'C'</span>,<span class="hljs-number">4</span>)],
    <span class="hljs-string">'B'</span>:[(<span class="hljs-string">'A'</span>,<span class="hljs-number">1</span>),(<span class="hljs-string">'C'</span>,<span class="hljs-number">2</span>),(<span class="hljs-string">'D'</span>,<span class="hljs-number">5</span>)],
    <span class="hljs-string">'C'</span>:[(<span class="hljs-string">'A'</span>,<span class="hljs-number">4</span>),(<span class="hljs-string">'B'</span>,<span class="hljs-number">2</span>),(<span class="hljs-string">'D'</span>,<span class="hljs-number">1</span>)],
    <span class="hljs-string">'D'</span>:[(<span class="hljs-string">'B'</span>,<span class="hljs-number">5</span>),(<span class="hljs-string">'C'</span>,<span class="hljs-number">1</span>)]
}
print(prim(graph,<span class="hljs-string">'A'</span>))
</code></pre>
<p>Output:</p>
<pre><code class="lang-python">[(<span class="hljs-string">'A'</span>,<span class="hljs-string">'B'</span>,<span class="hljs-number">1</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-string">'C'</span>,<span class="hljs-number">2</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-string">'D'</span>,<span class="hljs-number">1</span>)]
</code></pre>
<p>Notice how the algorithm gradually expands from node <code>A</code>, always picking the lowest-weight edge that connects a new node.</p>
<p>Let’s now look at an algorithm that can handle graphs with negative edges: Bellman-Ford.</p>
<h2 id="heading-bellman-ford-algorithm">Bellman-Ford Algorithm</h2>
<p>Bellman-Ford is a shortest path algorithm that can handle negative edge weights, unlike Dijkstra’s. It works by <strong>relaxing all edges repeatedly</strong>: if the current path to a node can be improved by going through another node, it updates the distance. After <code>V-1</code> iterations (where <code>V</code> is the number of vertices), all shortest paths are guaranteed to be found.</p>
<p>This makes it slightly slower than Dijkstra’s but more versatile. It can also detect negative weight cycles by checking for further improvements after the main loop.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">bellman_ford</span>(<span class="hljs-params">graph, start</span>):</span>
    dist = {node: float(<span class="hljs-string">'inf'</span>) <span class="hljs-keyword">for</span> node <span class="hljs-keyword">in</span> graph}
    dist[start] = <span class="hljs-number">0</span>
    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(len(graph)<span class="hljs-number">-1</span>):
        <span class="hljs-keyword">for</span> u <span class="hljs-keyword">in</span> graph:
            <span class="hljs-keyword">for</span> v,w <span class="hljs-keyword">in</span> graph[u]:
                <span class="hljs-keyword">if</span> dist[u] + w &lt; dist[v]:
                    dist[v] = dist[u] + w
    <span class="hljs-keyword">return</span> dist

graph = {
    <span class="hljs-string">'A'</span>:[(<span class="hljs-string">'B'</span>,<span class="hljs-number">4</span>),(<span class="hljs-string">'C'</span>,<span class="hljs-number">2</span>)],
    <span class="hljs-string">'B'</span>:[(<span class="hljs-string">'C'</span>,<span class="hljs-number">-1</span>),(<span class="hljs-string">'D'</span>,<span class="hljs-number">2</span>)],
    <span class="hljs-string">'C'</span>:[(<span class="hljs-string">'D'</span>,<span class="hljs-number">3</span>)],
    <span class="hljs-string">'D'</span>:[]
}
print(bellman_ford(graph,<span class="hljs-string">'A'</span>))
</code></pre>
<p>Output:</p>
<pre><code class="lang-python">{<span class="hljs-string">'A'</span>: <span class="hljs-number">0</span>, <span class="hljs-string">'B'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'C'</span>: <span class="hljs-number">2</span>, <span class="hljs-string">'D'</span>: <span class="hljs-number">5</span>}
</code></pre>
<p>Here, the shortest path to each node is found, even though there’s a negative edge (<code>B → C</code> with weight -1). If there had been a negative cycle, Bellman-Ford would detect it by noticing that distances keep improving after <code>V-1</code> iterations.</p>
<p>With the main algorithms explained, let’s move on to some practical tips for making these implementations more efficient in Python.</p>
<h2 id="heading-optimizing-graph-algorithms-in-python">Optimizing Graph Algorithms in Python</h2>
<p>When graphs get bigger, little tweaks in how you write your code can make a big difference. Here are a few simple but powerful tricks to keep things running smoothly.</p>
<p><strong>1. Use</strong> <code>deque</code> for BFS<br>If you use a regular Python list as a queue, popping items from the front takes longer the bigger the list gets. With <code>collections.deque</code>, you get instant (<code>O(1)</code>) pops from both ends. It’s basically built for this kind of job.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> deque

queue = deque([start])  <span class="hljs-comment"># fast pops and appends</span>
</code></pre>
<p><strong>2. Go Iterative with DFS</strong><br>Recursive DFS looks neat, but Python doesn’t like going too deep – you’ll hit a recursion limit if your graph is very large. The fix? Write DFS in an iterative style with a stack. Same idea, no recursion errors.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">dfs_iterative</span>(<span class="hljs-params">graph, start</span>):</span>
    visited, stack = set(), [start]
    <span class="hljs-keyword">while</span> stack:
        node = stack.pop()
        <span class="hljs-keyword">if</span> node <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> visited:
            visited.add(node)
            stack.extend(graph[node])
</code></pre>
<p><strong>3. Let NetworkX Do the Heavy Lifting</strong><br>For practice and learning, writing your own graph code is great. But if you’re working on a real-world problem – say analyzing a social network or planning routes – the NetworkX library saves tons of time. It comes with optimized versions of almost every common graph algorithm plus nice visualization tools.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> networkx <span class="hljs-keyword">as</span> nx

G = nx.Graph()
G.add_edges_from([(<span class="hljs-string">'A'</span>,<span class="hljs-string">'B'</span>), (<span class="hljs-string">'A'</span>,<span class="hljs-string">'C'</span>), (<span class="hljs-string">'B'</span>,<span class="hljs-string">'D'</span>), (<span class="hljs-string">'C'</span>,<span class="hljs-string">'D'</span>)])

print(nx.shortest_path(G, source=<span class="hljs-string">'A'</span>, target=<span class="hljs-string">'D'</span>))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-python">[<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'D'</span>]
</code></pre>
<p>Instead of worrying about queues and stacks, you can let NetworkX handle the details and focus on what the results mean.</p>
<h2 id="heading-key-takeaways">Key Takeaways</h2>
<ul>
<li><p>An adjacency matrix is fast for lookups but is memory-heavy.</p>
</li>
<li><p>An adjacency list is space-efficient for sparse graphs.</p>
</li>
<li><p>NetworkX makes graph analysis much easier for real-world projects.</p>
</li>
<li><p>BFS explores layer by layer, DFS explores deeply before backtracking.</p>
</li>
<li><p>Dijkstra’s and A* handle shortest paths.</p>
</li>
<li><p>Kruskal’s and Prim’s build spanning trees.</p>
</li>
<li><p>Bellman-Ford works with negative weights.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Graphs are everywhere, from maps to social networks, and the algorithms you have seen here are the building blocks for working with them. Whether it is finding paths, building spanning trees, or handling tricky weights, these tools open up a wide range of problems you can solve.</p>
<p>Keep experimenting and try out libraries like NetworkX when you are ready to take on bigger projects.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ A Beginner Developer's Guide to Scrum ]]>
                </title>
                <description>
                    <![CDATA[ Let me guess: you’re learning to code…alone. You’ve been grinding through tutorials. You've built a portfolio site, maybe deployed a few projects on GitHub. And now you're trying to land a job or join a team. Then the interviews start. Suddenly, peop... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/a-beginner-developers-guide-to-scrum/</link>
                <guid isPermaLink="false">68813c7579e092b166d373b6</guid>
                
                    <category>
                        <![CDATA[ Scrum ]]>
                    </category>
                
                    <category>
                        <![CDATA[ agile development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ project management ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Developer ]]>
                    </category>
                
                    <category>
                        <![CDATA[ interview ]]>
                    </category>
                
                    <category>
                        <![CDATA[ guide ]]>
                    </category>
                
                    <category>
                        <![CDATA[ education ]]>
                    </category>
                
                    <category>
                        <![CDATA[ learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Productivity ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Product Management ]]>
                    </category>
                
                    <category>
                        <![CDATA[ software development ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Career ]]>
                    </category>
                
                    <category>
                        <![CDATA[ workflow ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Aditya Vikram Kashyap ]]>
                </dc:creator>
                <pubDate>Wed, 23 Jul 2025 19:48:05 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753300058064/7046dd6c-1d9e-4f06-9ca1-65b3bb7eec83.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Let me guess: you’re learning to code…alone.</p>
<p>You’ve been grinding through tutorials. You've built a portfolio site, maybe deployed a few projects on GitHub. And now you're trying to land a job or join a team.</p>
<p>Then the interviews start.</p>
<p>Suddenly, people ask:</p>
<ul>
<li><p>"Are you familiar with Agile?"</p>
</li>
<li><p>"Have you worked in a Scrum environment?"</p>
</li>
<li><p>"What’s your experience with sprints?"</p>
</li>
</ul>
<p>Cue the imposter syndrome. Because no one teaches this stuff in JavaScript 101.</p>
<p>This guide is for you.</p>
<p>I’ll help make the Scrum process – a very common way developers work together – <em>make actual sense</em>. I’ll walk you through the basics, but also tell you what developers actually <em>do</em>, how standups feel when you're new, and what’s expected of you when you’re no longer coding in a vacuum.</p>
<p>Let’s break it down.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-even-is-scrum">What Even Is Scrum?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-three-roles-in-scrum-and-who-does-what">The Three Roles in Scrum (and Who Does What)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-scrum-rhythm-what-a-sprint-actually-looks-like">The Scrum Rhythm: What a Sprint Actually Looks Like</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-who-attends-the-ceremonies">Who attends the Ceremonies:</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-standups-where-you-talk-like-a-human-not-a-robot">Standups: Where You Talk Like a Human, Not a Robot</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-sprint-planning">Sprint Planning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-whats-a-user-story-and-why-does-it-sound-like-a-childrens-book">What’s a User Story and Why Does It Sound Like a Children’s Book?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-counts-as-done-definition-of-done-and-why-its-important">What Counts as “Done”? Definition of Done and Why It’s Important</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-demos-retros-and-saying-the-hard-things">Demos, Retros, and Saying the Hard Things</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tools-you-might-encounter">Tools You Might Encounter</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-if-youre-preparing-for-a-job-heres-what-you-can-do">If You’re Preparing for a Job, Here’s What You Can Do</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-what-even-is-scrum"><strong>What Even Is Scrum?</strong></h2>
<p>Scrum is not a tool. It’s not a software. It’s not some elite thing only PMs care about.</p>
<p>It’s a lightweight framework that helps software teams build things incrementally, together, in short focused cycles called sprints.</p>
<p>Scrum is used by everyone from FAANG teams to indie dev shops because it helps:</p>
<ul>
<li><p>Keep teams aligned</p>
</li>
<li><p>Deliver working software fast</p>
</li>
<li><p>Course-correct often</p>
</li>
<li><p>Spot problems early (before they go nuclear)</p>
</li>
</ul>
<p>It’s the opposite of the old-school “build for a year and pray it works” model.</p>
<h2 id="heading-the-three-roles-in-scrum-and-who-does-what"><strong>The Three Roles in Scrum (and Who Does What)</strong></h2>
<p>Scrum officially defines three roles. Here's what that means in practice:</p>
<h3 id="heading-1-product-owner-po"><strong>1. Product Owner (PO)</strong></h3>
<p>Think: Vision-holder. They decide <em>what</em> the team builds and <em>why</em>. A product owner:</p>
<ul>
<li><p>Writes user stories (think of these as feature requests written from a user’s point of view)</p>
</li>
<li><p>Prioritizes the work</p>
</li>
<li><p>Clarifies what success looks like</p>
</li>
<li><p>Says “yes” or “not yet” to features</p>
</li>
</ul>
<h3 id="heading-2-scrum-master-sm"><strong>2. Scrum Master (SM)</strong></h3>
<p>Think: Air-traffic controller meets therapist. They make sure the process works. The are master Facilitators, like between Dev and PO’s. A Scrum Master:</p>
<ul>
<li><p>Facilitates meetings</p>
</li>
<li><p>Removes blockers (“Your AWS access is stuck? I’ll escalate it.”)</p>
</li>
<li><p>Coaches the team on Scrum practices</p>
</li>
<li><p>Doesn’t manage people – manages <em>flow</em></p>
</li>
</ul>
<h3 id="heading-3-developers-you"><strong>3. Developers (YOU!)</strong></h3>
<p>Think: Builders. You write code, test it, ship it, fix it, and improve it. You also:</p>
<ul>
<li><p>Break down stories into tasks</p>
</li>
<li><p>Pick up work from the team board (like Jira or Trello)</p>
</li>
<li><p>Communicate progress</p>
</li>
<li><p>Demo what you’ve built at the end of the sprint</p>
</li>
</ul>
<p>You might also work with designers, testers, or DevOps folks – but within Scrum, you’re all “developers” building a product together.</p>
<h2 id="heading-the-scrum-rhythm-what-a-sprint-actually-looks-like"><strong>The Scrum Rhythm: What a Sprint Actually Looks Like</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752809790048/253fd92b-1ebe-4f3e-bfbc-48719676dc82.png" alt="253fd92b-1ebe-4f3e-bfbc-48719676dc82" class="image--center mx-auto" width="900" height="530" loading="lazy"></p>
<p>Image Source: <a target="_blank" href="https://www.invensislearning.com/blog/what-are-scrum-ceremonies/">https://www.invensislearning.com/blog/what-are-scrum-ceremonies/</a></p>
<h3 id="heading-understanding-the-scrum-cycle"><strong>Understanding the Scrum Cycle</strong></h3>
<p>So, what does it <em>actually</em> look like when a team uses Scrum to build software?</p>
<p>Let’s walk through a full sprint – not just the buzzwords, but what really happens when a group of humans tries to plan, build, review, and improve together. Think of this as your backstage pass to the rhythm of modern teamwork.</p>
<h3 id="heading-step-1-build-and-refine-the-product-backlog">📦 Step 1: Build and Refine the Product Backlog</h3>
<p>Before any coding starts, the team needs to agree on <em>what</em> they might build – not just this week, but in the near future too.</p>
<p>That’s where the <strong>Product Backlog</strong> comes in. This is a big, running list of everything the product might need – features, bug fixes, improvements, ideas, and maybe a few wild dreams. It’s like the wishlist for the product, but more organized (ideally).</p>
<p>The Product Owner is responsible for maintaining and prioritizing this list. They decide what’s most important to work on based on customer needs, business goals, and feedback.</p>
<p>But the PO doesn’t do this in isolation. Enter the <strong>Backlog Refinement meeting</strong>.</p>
<p>In these sessions, the Scrum Team – that’s the PO, the Scrum Master (SM), and the Developers – come together to:</p>
<ul>
<li><p><strong>Review</strong> the most important upcoming items</p>
</li>
<li><p><strong>Clarify</strong> any vague or confusing parts of each task</p>
</li>
<li><p><strong>Break big items</strong> down into smaller, buildable pieces called <strong>user stories</strong></p>
</li>
<li><p><strong>Estimate effort</strong> (how much time or complexity is involved for each story)</p>
</li>
</ul>
<p>This meeting makes sure the team isn’t caught off guard in the sprint – that they understand the work ahead and can actually start sprinting when the time comes.</p>
<h3 id="heading-step-2-sprint-planning-what-are-we-building-this-time">🧭 Step 2: Sprint Planning – What Are We Building This Time?</h3>
<p>Now that we’ve got a solid backlog, it’s time to pick what to build <em>right now</em>.</p>
<p>At the start of each sprint (which typically lasts 1 to 4 weeks), the team holds a <strong>Sprint Planning meeting</strong>. This meeting sets the stage for the entire sprint – it’s like the huddle before the big game.</p>
<p>In Sprint Planning, the team:</p>
<ul>
<li><p>Reviews the top items from the backlog</p>
</li>
<li><p>Discusses what can realistically be completed based on their availability and capacity</p>
</li>
<li><p>Chooses a handful of these stories to commit to</p>
</li>
<li><p><strong>Defines a Sprint Goal</strong> – a simple statement that captures the purpose of this sprint</p>
</li>
</ul>
<p>For example, the Sprint Goal might be:<br>🎯 <em>“Allow users to reset their passwords.”</em></p>
<p>Every user story chosen should contribute to that goal. The collection of these stories becomes the <strong>Sprint Backlog</strong> – basically, the to-do list for the sprint.</p>
<p>So when we say:</p>
<p>“The team selects an ordered list of user stories to comprise the Sprint Backlog for the next sprint, which will be achievable to satisfy the Sprint Goal...”</p>
<p>We’re really just saying:<br>👉 <em>“Pick a realistic number of important tasks that, if completed, will help us hit our target for the sprint.”</em></p>
<p>Not too vague. Not too ambitious. Just achievable and focused.</p>
<h3 id="heading-step-3-daily-standups-stay-in-sync">☀️ Step 3: Daily Standups – Stay in Sync</h3>
<p>Now the sprint is underway! But how does everyone stay aligned and avoid working in silos?</p>
<p>That’s where the <strong>Daily Standup</strong> comes in. Every day – usually in the morning – the team has a quick check-in (about 15 minutes) where each person answers three questions:</p>
<ol>
<li><p><strong>What did I do yesterday?</strong></p>
</li>
<li><p><strong>What am I working on today?</strong></p>
</li>
<li><p><strong>Is anything blocking me?</strong> (that is, am I stuck?)</p>
</li>
</ol>
<p>Example:</p>
<p>“Yesterday I set up the login API integration. Today I’ll work on the UI validation. I’m blocked on getting access to the staging database — may need help.”</p>
<p>These standups keep the team in sync and surface blockers early so they can be addressed quickly. They’re not about micromanaging or showing off. They’re about visibility and support.</p>
<h3 id="heading-whats-a-sprint-burndown-chart">📉 What’s a Sprint Burndown Chart?</h3>
<p>You might hear your team mention a “burndown chart.” No, this isn’t about things going down in flames (hopefully).</p>
<p>A <strong>Sprint Burndown Chart</strong> is a graph that shows how much work is left in the sprint – day by day.</p>
<ul>
<li><p>The <strong>y-axis</strong> is the amount of work remaining (often measured in story points or tasks)</p>
</li>
<li><p>The <strong>x-axis</strong> is the number of days left in the sprint</p>
</li>
</ul>
<p>The line should ideally trend downward as work gets completed – hence “burning down.” If it flattens out or slopes up, that’s a red flag that the team might be stuck, behind schedule, or not updating the board.</p>
<p>Think of it as a visual heartbeat of the sprint. You can learn more via a practical example <a target="_blank" href="https://youtu.be/2K84aZn9AY8?si=tS8oMGxVD0CYtnlw">in this video</a>.</p>
<h3 id="heading-step-4-sprint-review-show-what-youve-built">🖥️ Step 4: Sprint Review – Show What You’ve Built</h3>
<p>At the end of the sprint, the team holds a <strong>Sprint Review</strong> (also called a “demo”). This is where you show what was actually built during the sprint.</p>
<ul>
<li><p>The <strong>Developers</strong> demo working features – live, not just screenshots</p>
</li>
<li><p>The <strong>Product Owner</strong> reviews whether the Sprint Goal was achieved</p>
</li>
<li><p>Stakeholders may ask questions, give feedback, or suggest tweaks</p>
</li>
</ul>
<p>This meeting isn’t just for show – it’s a feedback loop. It helps the team validate that what they built is useful, usable, and meets expectations. If changes are needed, those get added to the backlog for future sprints.</p>
<h3 id="heading-step-5-sprint-retrospective-look-back-to-move-forward">🔍 Step 5: Sprint Retrospective – Look Back to Move Forward</h3>
<p>Once the review is done, the team shifts focus from <em>what</em> they built to <em>how</em> they worked together.</p>
<p>Enter the <strong>Sprint Retrospective</strong> – a meeting to reflect on the process, not the product.</p>
<p>The team discusses:</p>
<ul>
<li><p>✅ What went well</p>
</li>
<li><p>❌ What didn’t go so well</p>
</li>
<li><p>🔁 What could be improved next time</p>
</li>
</ul>
<p>This isn’t about pointing fingers. It’s about learning, adapting, and continuously improving how the team collaborates.</p>
<p>The <strong>Scrum Master</strong> often facilitates this meeting and helps turn feedback into action items for the next sprint. For example:</p>
<p>“We underestimated testing time. Next sprint, let’s budget for QA earlier.”</p>
<p>The best teams take retros seriously. Why? Because even if your code is perfect, your <em>process</em> needs tuning too – and small process changes often lead to big gains.</p>
<h3 id="heading-scrum-is-a-loop">♻️ Scrum Is a Loop</h3>
<p>Here’s the rhythm:</p>
<ol>
<li><p>Plan the sprint</p>
</li>
<li><p>Check in daily</p>
</li>
<li><p>Build and demo the product</p>
</li>
<li><p>Reflect and improve</p>
</li>
</ol>
<p>Then do it all over again – with slightly better coordination and slightly more trust each time.</p>
<p>It’s not about being fast. It’s about being intentional, consistent, and collaborative.</p>
<h3 id="heading-example-sprint">Example Sprint</h3>
<p>Let’s say, for example, that your team does 4-week sprints. (Keep in mind that Sprints can differ by team, nature of product, release cycles, and so on.)</p>
<p>Here’s the rough beat:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Week</strong></td><td><strong>What Happens (Sprint Ceremonies)</strong></td><td><strong>Your Role</strong></td></tr>
</thead>
<tbody>
<tr>
<td>1</td><td><strong>Sprint Planning</strong></td><td>Help estimate effort, pick what to build</td></tr>
<tr>
<td>1-4</td><td><strong>Daily Stand ups</strong> (15 mins)</td><td>Share what you’re doing &amp; any blockers</td></tr>
<tr>
<td>1-3</td><td><strong>Development Time</strong></td><td>Code, test, commit, fix, push, repeat</td></tr>
<tr>
<td>3.5-4</td><td><strong>Sprint Review</strong></td><td>Demo what you built</td></tr>
<tr>
<td>4</td><td><strong>Sprint Retrospective</strong></td><td>Reflect on how the sprint went as a team</td></tr>
</tbody>
</table>
</div><p>Scrum works in <strong>loops</strong>. Every 2-4 weeks (depending on your cadence and sprint cycle), your team should have working, demo-able software to show for it – even if it’s small.</p>
<p>And no, it’s not about “speed.” It’s about consistency, communication, and collaboration.</p>
<h2 id="heading-who-attends-the-ceremonies"><strong>Who attends the Ceremonies:</strong></h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Ceremony</strong></td><td><strong>Who Attends</strong></td><td><strong>Why They’re There</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Sprint Planning</strong></td><td>Product Owner (PO), Scrum Master (SM), Development Team</td><td>To define what will be delivered and how the work will be accomplished</td></tr>
<tr>
<td><strong>Daily Standup</strong></td><td>Development Team, Scrum Master (optional), PO (optional)</td><td>To sync on progress, share blockers, and coordinate efforts</td></tr>
<tr>
<td><strong>Sprint Review</strong></td><td>Development Team, Scrum Master, Product Owner, Stakeholders</td><td>To demo the work, get feedback, and assess if goals were met</td></tr>
<tr>
<td><strong>Sprint Retrospective</strong></td><td>Development Team, Scrum Master, Product Owner (optional)</td><td>To reflect on the process, identify what worked/what didn’t, and improve the next sprint</td></tr>
<tr>
<td><strong>Backlog Refinement</strong></td><td>Product Owner, Development Team, Scrum Master (optional)</td><td>To clarify upcoming stories, estimate work, and prepare for future sprint planning</td></tr>
</tbody>
</table>
</div><p>Now lets dive deeper and understand practically how each of these ceremonies work:</p>
<h2 id="heading-standups-where-you-talk-like-a-human-not-a-robot"><strong>Standups: Where You Talk Like a Human, Not a Robot</strong></h2>
<p>So how does the team actually stay connected day to day? That’s where standups come in.</p>
<p>Every morning, your team meets briefly – usually on Zoom or in a circle – and you answer 3 questions:</p>
<ol>
<li><p>What did I work on yesterday?</p>
</li>
<li><p>What will I work on today?</p>
</li>
<li><p>What’s blocking me? Any impediments?</p>
</li>
</ol>
<p>Example:</p>
<p>"Yesterday I cleaned up the signup validation logic. Today I’m working on the email verification flow. I’m stuck on SendGrid config – might need help setting up credentials."</p>
<p>It’s not about impressing anyone. It’s about keeping everyone in sync. Some days you’ll say, “I spent the whole day debugging a CSS bug that turned out to be a semicolon.” That’s okay.</p>
<p>How does it work?</p>
<p>The Scrum Master gathers everyone in a huddle room, the PO and Dev Team included, and opens the the Standup. They are the facilitator of the ceremony. Everyone gets a chance to answer the 3 questions above (usually about 2-5 minutes each). It’s not a full report – it’s quick. When one person is done, they pass it on to someone else.</p>
<p>This ensures there is team cohesion and transparency.</p>
<p><a target="_blank" href="https://youtu.be/q_R9wQY4G5I?si=W1AcvcLXB-mnUM1f">Here is a video example of a standup</a>.</p>
<h2 id="heading-sprint-planning"><strong>Sprint Planning</strong></h2>
<p>The goal of the planning meeting is to answer the questions “What are we going to work on, and how are we going to do it?” It is critical for the team to have a shared goal and a shared commitment to this goal before beginning this ceremony.</p>
<p>Participants should:</p>
<ul>
<li><p>Measure growth</p>
</li>
<li><p>Sync with the Scrum Master</p>
</li>
<li><p>Sync with the Product Owner</p>
</li>
</ul>
<p>Sprint planning happens just before the sprint starts, and usually lasts for an hour or two. In this meeting, the team goes over a collection of <strong>user stories</strong> and discuss, plan, measure, and prioritize. This is where they decide what is going to be in scope for their upcoming sprint cycle.</p>
<p>The Product Owner will have a prioritized view of things in the backlog. They work with the team on each object or customer experience. Together, as a group they go through and make calculations, deciding to what they can commit.</p>
<h2 id="heading-whats-a-user-story-and-why-does-it-sound-like-a-childrens-book"><strong>What’s a User Story and Why Does It Sound Like a Children’s Book?</strong></h2>
<p>So you might be wondering: how do you know what to work on? What to build? So much work, so little time? Thats where <strong>user stories</strong> come in.</p>
<p>In Scrum, teams don’t just write vague tasks like “code the login.” Instead, they write user stories – short, human-centered feature descriptions that describe what the user needs, why they need it, and what success looks like.</p>
<p>Here’s an example:</p>
<p><em>As a user, I want to be able to reset my password, so I can access my account if I forget it.</em></p>
<p>User stories are the scaffolding of teamwork. They’re written with empathy, not just efficiency. And each one comes with <strong>acceptance criteria</strong> – a checklist that clarifies what “done” actually means:</p>
<ul>
<li><p>A “Forgot Password” link is visible</p>
</li>
<li><p>Clicking it shows a form</p>
</li>
<li><p>An email gets sent with a reset link</p>
</li>
</ul>
<p>Once a story is agreed upon, developers break it down into tasks, like “build form,” “hook into backend,” or “handle email validation.” It’s collaborative, not prescriptive. And user stories have priority so you know what’s the most important and what’s the least.</p>
<p>A helpful rule of thumb many teams use is the <a target="_blank" href="https://medium.com/@nic/writing-user-stories-with-gherkin-dda63461b1d2"><strong>Gherkin</strong>-style "Given–When–Then"</a> format:</p>
<ul>
<li><p><strong>Given</strong> some initial context</p>
</li>
<li><p><strong>When</strong> an event occurs</p>
</li>
<li><p><strong>Then</strong> a specific outcome should happen</p>
</li>
</ul>
<p>This ensures that everyone – devs, testers, and product owners – shares the same understanding of behavior and expectations.</p>
<p><a target="_blank" href="https://www.youtube.com/watch?v=7hoGqhb6qAs">Here is a great video example</a> thats outlines how to draft effective and powerful user stories.</p>
<h2 id="heading-what-counts-as-done-definition-of-done-and-why-its-important"><strong>What Counts as “Done”? Definition of Done and Why It’s Important</strong></h2>
<p>Now you might be wondering – how do I know when a task is done and can be closed out?</p>
<p>The <strong>Definition of Done</strong> is a type of documentation in the form of a <strong>team agreement</strong>. The Definition of Done identifies the conditions that need to be achieved in order for the product to be considered done (as in <strong>potentially shippable</strong>).</p>
<p>This is how we know that we "did the thing right". Meaning, we built the correct level of quality into the product. The Definition of Done is not the same as the acceptance criteria, which are written by the product owner to help us know we did the "right thing".</p>
<p>Every team has a Definition of Done – it’s not just “I pushed code.” It could mean:</p>
<ul>
<li><p>Code is written</p>
</li>
<li><p>Reviewed by a peer</p>
</li>
<li><p>Merged into main</p>
</li>
<li><p>Tested on staging</p>
</li>
<li><p>Possibly deployed</p>
</li>
</ul>
<p>This clarity keeps teams honest and accountable. No “it works on my machine” energy here. The DoD sets a quality bar. It prevents ambiguity, rework, and “it works on my machine” moments. When every card on the board passes the same finish line, teams move faster – and trust each other more.</p>
<p>Everyone should know what done is in a team. Either its Done as per DoD standards or its not.</p>
<p><a target="_blank" href="https://youtu.be/pYOJyQoBT3U?si=nVygkQQx79NaAOo4">Here is a beautiful video</a> highlighting the impotence of DoD.</p>
<h2 id="heading-demos-retros-and-saying-the-hard-things"><strong>Demos, Retros, and Saying the Hard Things</strong></h2>
<p>Once you’ve built the product, then comes demos (showcasing your work) and retros (analysis as a team on what when well and what areas to improve on).</p>
<p>In the retro, everyone’s encouraged to speak up:</p>
<ul>
<li><p>What went well?</p>
</li>
<li><p>What didn’t?</p>
</li>
<li><p>What should we try next time?</p>
</li>
</ul>
<p>Example:</p>
<p>“We missed a lot of stories because we didn’t account for testing time. Maybe we buffer next sprint with fewer tasks.”</p>
<p>The goal is not to blame – it’s to <em>improve</em>. Over time, this feedback loop becomes gold. The Scrum Master usually facilitates, collects feedback (via tools like Parabol, Miro, or sticky notes), and helps turn insights into actionable experiments for the next sprint.</p>
<p>Over time, retros become the heartbeat of team evolution.</p>
<p><a target="_blank" href="https://youtu.be/5eu1HotNmWs?si=1DZaSmztB6rHyawj">Here is a video</a> highlighting the importance of a Retro and Sprint Review.</p>
<h3 id="heading-why-retrospection-matters-more-than-you-think">🧠 Why Retrospection Matters More Than You Think</h3>
<p>The Sprint Retrospective is more than just another meeting. It’s a mirror for your team – a safe, structured space to pause, reflect, and improve together.</p>
<p>You discuss:</p>
<p>✅ what went well</p>
<p>❌ what did not go well</p>
<p>🔁 what could we do better next time</p>
<p>Great teams don't just deliver great software, they continually deliver better ways of working.</p>
<p>This is why many experienced Scrum practitioners consider the retro to be the most important event in Scrum. Code is deployed once, but process improvements grow exponentially, sprint after sprint.</p>
<h2 id="heading-tools-you-might-encounter"><strong>Tools You Might Encounter</strong></h2>
<p>Scrum doesn’t require software, but real-world teams use a variety of tools:</p>
<ul>
<li><p><strong>Jira</strong> – Tracks sprints, issues, velocity</p>
</li>
<li><p><strong>Trello</strong> – Simple board, good for small teams</p>
</li>
<li><p><strong>Slack</strong> – Where standups often happen if async</p>
</li>
<li><p><strong>Notion / Confluence</strong> – Docs, retros, notes</p>
</li>
<li><p><strong>GitHub Projects</strong> – Lightweight planning for devs</p>
</li>
</ul>
<p>Don’t worry if you’re not fluent in these yet. They’re tools – you’ll learn them on the job.</p>
<h2 id="heading-if-youre-preparing-for-a-job-heres-what-you-can-do"><strong>If You’re Preparing for a Job, Here’s What You Can Do</strong></h2>
<ul>
<li><p>✍️ Practice writing user stories from your side projects</p>
</li>
<li><p>🧪 Run a mini-sprint: Plan your weekend project, set goals, and “review” it at the end</p>
</li>
<li><p>🤝 Contribute to an open-source project that uses Scrum or Agile workflows</p>
</li>
<li><p>🧾 Write about what you learned – maybe as a tutorial (<em>hint hint</em>)</p>
</li>
</ul>
<h2 id="heading-final-thoughts"><strong>Final Thoughts</strong></h2>
<p>So to recap, Scrum is a simple yet powerful way for teams to work together, stay organized, and deliver results quickly. It runs in short cycles called <strong>sprints</strong>, where the team plans what to do, checks in daily, shows their progress at the end, and reflects on how to improve.</p>
<p>The four key ceremonies – <strong>Sprint Planning</strong>, <strong>Daily Scrum</strong>, <strong>Sprint Review</strong>, and <strong>Sprint Retrospective</strong> – help keep everyone aligned and focused. With clear roles and regular feedback, Scrum makes it easier to handle changes, solve problems early, and continuously get better as a team.</p>
<p>But scrum isn’t a magic spell. It’s just a way for humans to build complex things – together – without falling apart.</p>
<p>You don’t need to be a Scrum Master. You don’t need a certification. But if you understand how sprints work, what’s expected of you, and how to show up to meetings with clarity and candor, you’re 10 steps ahead of most.</p>
<p>Scrum helps teams talk, plan, build, and learn. And now? You can too.</p>
<p>If you liked this, please do share. You never know who it might help out.</p>
<p>Until then…keep learning, unlearning, and relearning!!!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code ]]>
                </title>
                <description>
                    <![CDATA[ The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design. In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks: Custom class... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-multilayer-perceptron-with-examples-and-python-code/</link>
                <guid isPermaLink="false">6839f729798ea464918cffe8</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ neural networks ]]>
                    </category>
                
                    <category>
                        <![CDATA[ binary classification ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MLP (Multi-Layer Perceptrons) ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Fri, 30 May 2025 18:21:29 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748616370600/01903917-4be7-476b-90d1-18295d19edef.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The <strong>perceptron</strong> is a fundamental concept in deep learning, with many algorithms stemming from its original design.</p>
<p>In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:</p>
<ul>
<li><p>Custom classifier</p>
</li>
<li><p>Scikit-learn’s MLPClassifier</p>
</li>
<li><p>Keras Sequential classifier using SGD and Adam optimizers.</p>
</li>
</ul>
<p>This will help you learn about their various use cases and how they work.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-a-perceptron">What is a Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-optimizers">Understanding Optimizers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-results-generalization">Final Results: Generalization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Mathematics (Calculus, Linear Algebra, Statistics)</p>
</li>
<li><p>Coding in Python</p>
</li>
<li><p>Basic understanding of Machine Learning concepts</p>
</li>
</ul>
<h2 id="heading-what-is-a-perceptron">What is a Perceptron?</h2>
<p>A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.</p>
<p>A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.</p>
<p>But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.</p>
<p>The perceptron consists of four main parts:</p>
<ul>
<li><p><strong>Input layer</strong>: Takes the initial numerical values into the system for further processing.</p>
</li>
<li><p><strong>Weights</strong>: Combines input values with weights (and bias terms).</p>
</li>
<li><p><strong>Activation function</strong>: Determines whether the neuron should fire based on the threshold value.</p>
</li>
<li><p><strong>Output layer</strong>: Produces classification result.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748438698612/5b2920db-4ec1-455b-840e-7b5e9d6c2e75.png" alt="Image: Organization of a perceptron. Source: Rosenblatt 1958" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.</p>
<p>So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.</p>
<h3 id="heading-applications-of-perceptrons">Applications of Perceptrons</h3>
<p>Perceptrons are applied to tasks such as:</p>
<ul>
<li><p><strong>Image classification:</strong> Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.</p>
</li>
<li><p><strong>Linear regression:</strong> Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.</p>
</li>
</ul>
<h3 id="heading-how-the-activation-function-works">How the Activation Function Works</h3>
<p>For a single perceptron used for binary classification, the most common activation function is the <strong>step function</strong> (also known as the threshold function):</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq \theta \\ \\ 0 &amp;\text{if } z &lt; \theta \end{cases}$$</p><p>where:</p>
<ul>
<li><p><code>ϕ(z)</code>: the output of the activation function.</p>
</li>
<li><p><code>z</code>: the weighted sum of the inputs plus the bias:</p>
</li>
</ul>
<p>$$z = \sum_{i=1}^m w_i x_i + b$$</p><p>(xi: input values, w: weight associated with each input, b: bias terms)</p>
<p><code>θ</code> is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.</p>
<p>In that case, the formula becomes:</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq 0 \\ \\ 0 &amp;\text{if } z &lt; 0 \end{cases}$$</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748439460839/e74f1c1c-4e89-419b-aa9e-24a297d81ff5.png" alt="Image: Step Function (Author)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.</p>
<p>This occurs <strong>when the weighted sum is greater than zero,</strong> leading the perceptron to predict the input is in this binary class.</p>
<p>While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.</p>
<p>In modern implementations, we can use other activation functions like the <strong>sigmoid</strong> function:</p>
<p>$$\sigma (z) = \frac {1} {1 + e^{-z}}$$</p><p>The sigmoid function also outputs zero or one depending on the weighted sum (z).</p>
<h3 id="heading-how-the-loss-function-works">How the Loss Function Works</h3>
<p>The <strong>loss function</strong> is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.</p>
<p>Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.</p>
<p>In a binary classification task, the model may adopt the <strong>hinge loss function</strong> to penalize misclassifications by incurring an additional cost for incorrect predictions:</p>
<p>$$L(y, h(x)) = max(0, 1- y*h(x))$$</p><p>(h(x): prediction label, y: true label)</p>
<h2 id="heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</h2>
<p>Now, let’s build a simple single-layer perceptron for binary classification.</p>
<h3 id="heading-1-custom-classifier">1. Custom Classifier</h3>
<h4 id="heading-initialize-the-classifier">Initialize the classifier</h4>
<p>We’ll first initialize the classifier with <code>weights</code>, <code>bias</code>, number of epochs (<code>n_iterations)</code>, and <code>learning_rates</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = <span class="hljs-literal">None</span>
    self.bias = <span class="hljs-literal">None</span>
</code></pre>
<h4 id="heading-define-the-activation-function">Define the activation function</h4>
<p>Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the <code>threshold</code> is set to zero.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
     <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
</code></pre>
<h4 id="heading-train-the-model">Train the model</h4>
<p>Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: <code>weights</code> and <code>bias</code>.</p>
<p>This process is controlled by a specified number of training epochs defined by <code>n_iterations</code>.</p>
<p>In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined <code>learning_rate</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = <span class="hljs-number">0</span>

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
            <span class="hljs-comment"># compute weighted sum (z)</span>
            z = np.dot(X[i], self.weights) + self.bias

            <span class="hljs-comment"># apply the activation function</span>
            y_pred = self._step_function(z)

            <span class="hljs-comment"># update weights and bias</span>
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)
</code></pre>
<h4 id="heading-how-the-weights-work-in-the-iteration-loop">How the weights work in the iteration loop</h4>
<p>The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.</p>
<p>Its iterative update in the <code>for</code> loop aims to reduce classification errors such that:</p>
<p>$$\begin {align*} w_j &amp;:= w_j + \Delta w_j \\ &amp; := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &amp;= \begin{cases} w_j &amp;\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &amp;\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>(<code>w_j</code>: j-th weight, <code>η</code>: learning rate, (<code>yi​−y^​i​</code>): error)</p>
<p>This means that:</p>
<ol>
<li><p>When the prediction is <strong>correct</strong>, the error is zero, so the weight is unchanged.</p>
</li>
<li><p>When the prediction is <strong>too low</strong> (yi​=1 and y^​i​=0), the weight is adjusted to the same direction to increase the weighted sum.</p>
</li>
<li><p>When the prediction is <strong>too high</strong> (yi​=0 and y^​i​=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.</p>
</li>
</ol>
<h4 id="heading-how-the-bias-terms-work-in-the-iteration-loop">How the bias terms work in the iteration loop</h4>
<p>The bias determines the decision boundary’s intercept (position from the origin).</p>
<p>Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:</p>
<p>$$\begin {align*} b &amp;:= b + \Delta b \\ &amp; := b + \eta (y_i - \hat y_i) \\ &amp;= \begin{cases} b &amp;\text{(a) } y_i - \hat y_i = 0\\ b + \eta &amp;\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.</p>
<h4 id="heading-make-a-prediction">Make a prediction</h4>
<p>Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      <span class="hljs-keyword">return</span> predictions
</code></pre>
<p><strong>The entire classifier looks like this:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Perceptron</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = <span class="hljs-literal">None</span>
        self.bias = <span class="hljs-literal">None</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
        <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        <span class="hljs-keyword">return</span> y_pred
</code></pre>
<h4 id="heading-simulate-with-synthetic-datasets">Simulate with synthetic datasets</h4>
<p>First, we generated a synthetic linearly separable dataset using <code>make_blob</code> and computed a decision boundary, then train the classifier we created.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> make_blobs
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># create a mock dataset</span>
X, y = make_blobs(n_features=<span class="hljs-number">2</span>, centers=<span class="hljs-number">2</span>, n_samples=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">12</span>)

<span class="hljs-comment"># split</span>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># train the model</span>
perceptron = Perceptron(learning_rate=<span class="hljs-number">0.1</span>, n_iterations=<span class="hljs-number">1000</span>).fit(X_train, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

<span class="hljs-comment"># evaluate the results</span>
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results">Results</h4>
<p>The classifier generated a clear, highly accurate linear decision boundary.</p>
<ul>
<li><p><em>Accuracy (Train): 0.981</em></p>
</li>
<li><p><em>Accuracy (Test): 0.975</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440470195/0a01c5ad-124e-4f59-b4d5-9ee5dd5b23ce.png" alt="Decision boundary of single-layer perceptron (Custom classifier)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-2-leverage-sckitlearns-mcp-classifier">2. Leverage SckitLearn’s MCP Classifier</h3>
<p>For our convenience, we’ll use sckit-learn’s build-in classifier ( <code>MCPClassifier</code>) to build a similar, yet more robust classifier:</p>
<pre><code class="lang-python">model = MLPClassifier(
    hidden_layer_sizes=(), <span class="hljs-comment"># intentionally set empty to create a single layer perceptron</span>
    activation=<span class="hljs-string">'logistic'</span>, <span class="hljs-comment"># choosing a sigmoid function as an activation function</span>
    solver=<span class="hljs-string">'sgd'</span>, <span class="hljs-comment"># choosing SGD optimizer</span>
    max_iter=<span class="hljs-number">1000</span>,
    random_state=<span class="hljs-number">42</span>, 
    learning_rate=<span class="hljs-string">'constant'</span>, 
    learning_rate_init=<span class="hljs-number">0.1</span>
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"MCPClassifier\nAccuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results-1">Results</h4>
<p>The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.</p>
<ul>
<li><p><em>Accuracy (Train): 0.985</em></p>
</li>
<li><p><em>Accuracy (Test): 0.995</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440118956/f5391f47-711a-4948-b956-1a76dbd7ca92.png" alt="Decision boundary of single-layer perceptron (MCP Classifier)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-limitations-of-single-layer-perceptrons">Limitations of Single-Layer Perceptrons</h3>
<p>Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.</p>
<p>Unlike more general neural networks, single-layer perceptrons use a <strong>step function</strong> as their activation.</p>
<p>Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).</p>
<p>This fundamental property precludes the use of <strong>gradient-based optimization algorithms</strong> such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.</p>
<p>In contrast, most neural networks employ differentiable activation functions (for example, <strong>sigmoid</strong>, <strong>ReLU</strong>) and loss functions (for example, <strong>MSE</strong>, <strong>Cross-Entropy</strong>) for effective optimization.</p>
<p>Other challenges of a single-layer perceptron include:</p>
<ul>
<li><p><strong>Limited to linear separability:</strong> Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.</p>
</li>
<li><p><strong>Lack of depth:</strong> Being single-layered, they cannot learn complex hierarchical representations.</p>
</li>
<li><p><strong>Limited optimizer options:</strong> As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.</p>
</li>
</ul>
<p>So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.</p>
<h2 id="heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</h2>
<p>An MLP is a class of feedforward artificial neural network that consists of at least <strong>three layers</strong> of nodes:</p>
<ul>
<li><p>an input layer,</p>
</li>
<li><p>one or more hidden layers, and</p>
</li>
<li><p>an output layer.</p>
</li>
</ul>
<p>Except for the input nodes, each node is a neuron that uses a <strong>nonlinear</strong> activation function.​</p>
<p>MLPs are widely used for classification problems as well as regression:</p>
<ul>
<li><p><strong>Classification tasks:</strong> MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.​</p>
</li>
<li><p><strong>Regression analysis:</strong> They are also applied in regression problems where the relationship between input and output is complex.​</p>
</li>
</ul>
<h2 id="heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</h2>
<p>Let’s handle a binary classification task using a standard MLP architecture.</p>
<h3 id="heading-outline-of-the-project">Outline of the Project</h3>
<h4 id="heading-objective">Objective</h4>
<ul>
<li>Detect fraudulent transactions</li>
</ul>
<h4 id="heading-evaluation-metrics">Evaluation Metrics</h4>
<ul>
<li><p>Considering the cost of misclassification, we’ll prioritize improving <strong>Recall</strong> and <strong>Precision scores</strong></p>
</li>
<li><p>Then check the accuracy of classification with <strong>Accuracy</strong> Score (TP + TN / (TP + TN + FP + FN ))</p>
</li>
</ul>
<p><strong>Cost of Misclassification (from high to low):</strong></p>
<ul>
<li><p><strong>False Negative (FN):</strong> The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)</p>
</li>
<li><p><strong>False Positive (FP):</strong> The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)</p>
</li>
<li><p><strong>True Positive (TP):</strong> The model correctly identifies a fraudulent transaction as fraud.</p>
</li>
<li><p><strong>True Negative (TN):</strong>  The model correctly identifies a non-fraudulent transaction as non-fraud.</p>
</li>
</ul>
<h3 id="heading-planning-an-mlp-architecture">Planning an MLP Architecture</h3>
<p>In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.</p>
<p>Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.</p>
<p>During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440761512/37753a4c-f7f8-44bc-bea9-c50360830456.png" alt="Standard MLP Architecture for Binary Classification Tasks)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using <a target="_blank" href="https://www.researchgate.net/publication/355148120_SS-MLP_A_Novel_Spectral-Spatial_MLP_Architecture_for_Hyperspectral_Image_Classification">image source</a>)</p>
<p>Especially in deeper network, <strong>ReLU</strong> is advantageous in preventing <a target="_blank" href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem#:~:text=In%20machine%20learning%2C%20the%20vanishing,derivative%20of%20the%20loss%20function">vanishing gradient problems</a> where gradients become extremely small as they are backpropagated from the output layers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440797954/ba19bf66-cdb9-4bfb-9b92-e1e3f72e9fc7.png" alt="Comparison of major activation functions: From left to right: Sigmoid, Tanh, ReLU" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><a target="_blank" href="https://medium.com/data-science-collective/a-comprehensive-guide-on-neural-network-in-deep-learning-442ba9f1f0e5">Learn More: A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h3 id="heading-preprocessing-the-datasets">Preprocessing the Datasets</h3>
<p>First, we consolidate <a target="_blank" href="https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets">three datasets  –  transaction, customer, and credit card</a>  –  into a single DataFrame, independently sanitizing numerical and categorical data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler, OneHotEncoder
<span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

<span class="hljs-comment"># download the raw data to local</span>
<span class="hljs-keyword">import</span> kagglehub
path = kagglehub.dataset_download(<span class="hljs-string">"computingvictor/transactions-fraud-datasets"</span>)
dir = <span class="hljs-string">f'<span class="hljs-subst">{path}</span>/gd_card_flaud_demo'</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sanitize_df</span>(<span class="hljs-params">amount_str</span>):</span>
    <span class="hljs-string">"""Removes '$' and converts the string to a float."""</span>
    <span class="hljs-keyword">if</span> isinstance(amount_str, str):
        <span class="hljs-keyword">return</span> float(amount_str.replace(<span class="hljs-string">'$'</span>, <span class="hljs-string">''</span>))
    <span class="hljs-keyword">return</span> amount_str

<span class="hljs-comment"># load transaction data</span>
trx_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/transactions_data.csv'</span>)

<span class="hljs-comment"># sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)</span>
trx_df = trx_df[trx_df[<span class="hljs-string">'errors'</span>].isna()]
trx_df = trx_df.drop(columns=[<span class="hljs-string">'merchant_city'</span>,<span class="hljs-string">'merchant_state'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'mcc'</span>, <span class="hljs-string">'errors'</span>], axis=<span class="hljs-string">'columns'</span>)
trx_df[<span class="hljs-string">'amount'</span>] = trx_df[<span class="hljs-string">'amount'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge the dataframe with fraud transaction flag.</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/train_fraud_labels.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get(<span class="hljs-string">'target'</span>, {})
fraud_labels_series = pd.Series(fraud_labels_dict, name=<span class="hljs-string">'is_fraud'</span>)
fraud_labels_series.index = fraud_labels_series.index.astype(int) <span class="hljs-comment"># convert the datatype from string to integer</span>
merged_df = pd.merge(trx_df, fraud_labels_series, left_on=<span class="hljs-string">'id'</span>, right_index=<span class="hljs-literal">True</span>, how=<span class="hljs-string">'left'</span>)
merged_df.fillna({<span class="hljs-string">'is_fraud'</span>: <span class="hljs-string">'No'</span>}, inplace=<span class="hljs-literal">True</span>)
merged_df[<span class="hljs-string">'is_fraud'</span>] = merged_df[<span class="hljs-string">'is_fraud'</span>].map({<span class="hljs-string">'Yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'No'</span>: <span class="hljs-number">0</span>})

<span class="hljs-comment"># load card data</span>
card_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/cards_data.csv'</span>)
card_df = card_df.drop(columns=[<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'acct_open_date'</span>, <span class="hljs-string">'card_number'</span>, <span class="hljs-string">'expires'</span>, <span class="hljs-string">'cvv'</span>], axis=<span class="hljs-string">'columns'</span>)
card_df[<span class="hljs-string">'credit_limit'</span>] = card_df[<span class="hljs-string">'credit_limit'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge transaction and card data</span>
merged_df = pd.merge(left=merged_df, right=card_df, left_on=<span class="hljs-string">'card_id'</span>, right_on=<span class="hljs-string">'id'</span>, how=<span class="hljs-string">'inner'</span>)
merged_df = merged_df.drop(columns=[<span class="hljs-string">'id_y'</span>, <span class="hljs-string">'card_id'</span>], axis=<span class="hljs-string">'columns'</span>)

<span class="hljs-comment"># converts categorical variables into a new binary column (0 or 1)</span>
categorical_cols = merged_df.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=<span class="hljs-literal">False</span>, dtype=float) 
df = df.dropna().drop([<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'id_x'</span>], axis=<span class="hljs-number">1</span>)
print(<span class="hljs-string">'\nDataFrame: \n'</span>, df.head(n=<span class="hljs-number">3</span>))
</code></pre>
<p>DataFrame:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440856826/ba79bdaf-e0a1-457f-ab19-fda3e0f08141.png" alt="Base DataFrame" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Our DataFrame shows an extremely <strong>skewed data distribution</strong> with:</p>
<ul>
<li><p>Fraud samples: 1,191</p>
</li>
<li><p>Non-fraud samples: 11,477,397</p>
</li>
</ul>
<p>For classification tasks, <strong>it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact</strong> on classification model performance, especially regarding the minority class.</p>
<p>For our data, we’ll:</p>
<ol>
<li><p>split the 1,191 fraud samples into training, validation, and test sets,</p>
</li>
<li><p>add an equal number of randomly chosen non-fraud samples from the DataFrame, and</p>
</li>
<li><p>adjust split balances later if generalization challenges arise.</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># define the desired size of the fraud samples for the validation and test sets</span>
val_size_per_class = <span class="hljs-number">200</span>
test_size_per_class = <span class="hljs-number">200</span>

<span class="hljs-comment"># create test sets</span>
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced test set</span>
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_test = X_test[<span class="hljs-string">'is_fraud'</span>]
X_test = X_test.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the original dataframes to avoid data leakage</span>
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


<span class="hljs-comment"># create validation sets</span>
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced validation set</span>
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_val = X_val[<span class="hljs-string">'is_fraud'</span>]
X_val = X_val.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the remaining dataframes</span>
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


<span class="hljs-comment"># create training sets</span>
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_train = X_train[<span class="hljs-string">'is_fraud'</span>]
X_train = X_train.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)


print(<span class="hljs-string">"\n--- Final Dataset Shapes and Distributions ---"</span>)
print(<span class="hljs-string">f"X_train shape: <span class="hljs-subst">{X_train.shape}</span>, y_train distribution: <span class="hljs-subst">{np.unique(y_train, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_val shape: <span class="hljs-subst">{X_val.shape}</span>, y_val distribution: <span class="hljs-subst">{np.unique(y_val, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_test shape: <span class="hljs-subst">{X_test.shape}</span>, y_test distribution: <span class="hljs-subst">{np.unique(y_test, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
</code></pre>
<p>After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a <strong>50:50 split between fraud and non-fraud transactions</strong>:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*IZtK3l0hSqmkOrm9h_d9Jw.png" alt="X, y datasets shape" width="600" height="400" loading="lazy"></p>
<p>Considering the high dimensional feature space with 19 input features, we’ll apply <strong>SMOTE</strong> to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> imblearn.over_sampling <span class="hljs-keyword">import</span> SMOTE
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter

train_target = <span class="hljs-number">2000</span>

smote_train = SMOTE(
  sampling_strategy={<span class="hljs-number">0</span>: train_target, <span class="hljs-number">1</span>: train_target},  <span class="hljs-comment"># increase sample size to 2,000</span>
  random_state=<span class="hljs-number">12</span>
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(<span class="hljs-string">f"\nAfter SMOTE with custom sampling_strategy (target train: <span class="hljs-subst">{train_target}</span>):"</span>)
print(<span class="hljs-string">f"X_train_oversampled shape: <span class="hljs-subst">{X_train.shape}</span>"</span>)
print(<span class="hljs-string">f"y_train_oversampled distribution: <span class="hljs-subst">{Counter(y_train)}</span>"</span>)
</code></pre>
<p>We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440986995/ed079321-3972-4226-b1a8-244010445162.png" alt="Training sample shape after SMOTE" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Lastly, we’ll apply <strong>column transformers</strong> to numerical and categorical features separately.</p>
<p>Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

categorical_features = X_train.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns.tolist()
categorical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)),(<span class="hljs-string">'onehot'</span>, OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>))])

numerical_features = X_train.select_dtypes(include=[<span class="hljs-string">'int64'</span>, <span class="hljs-string">'float64'</span>]).columns.tolist()
numerical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)), (<span class="hljs-string">'scaler'</span>, StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        (<span class="hljs-string">'num'</span>, numerical_transformer, numerical_features),
        (<span class="hljs-string">'cat'</span>, categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)
</code></pre>
<h2 id="heading-understanding-optimizers">Understanding Optimizers</h2>
<p>In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.</p>
<p>Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.</p>
<p>In this article, we’ll use the SGD Optimizer and Adam Optimizer.</p>
<h3 id="heading-1-how-a-sgd-stochastic-gradient-descent-optimizer-works">1. How a SGD (Stochastic Gradient Descent) Optimizer Works</h3>
<p>SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:</p>
<p>$$\begin{align*} w_j &amp;:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &amp;:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$</p><p>(w: weight, b: bias, J: cost function, <em>η</em>: learning rate)</p>
<p>In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:</p>
<p>$$\begin{align*} J(y, \hat y) &amp;=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &amp;= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &amp;= \sum_{i=1}^m w_i x_i + b \end {align*}$$</p><h3 id="heading-2-how-adam-adaptive-moment-estimation-optimizer-works">2. How Adam (Adaptive Moment Estimation) Optimizer Works</h3>
<p>Adam is an optimization algorithm that computes <strong>individual adaptive learning rates</strong> for different parameters from estimates of first and second moments of the gradients.</p>
<p>Adam optimizer combines the advantages of <a target="_blank" href="https://keras.io/api/optimizers/rmsprop/"><strong>RMSprop</strong></a> (using squared gradients to scale the learning rate) and <a target="_blank" href="https://optimization.cbe.cornell.edu/index.php?title=Momentum"><strong>Momentum</strong></a> (using past gradients to accelerate convergence):</p>
<p>$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$</p><p>where:</p>
<ul>
<li><p><code>α</code>: The learning rate (default is 0.001)</p>
</li>
<li><p><code>ϵ</code>: A small positive constant used to avoid division by zero</p>
</li>
<li><p><code>m^</code>: First moment (mean) estimate with a bias correction, leveraging <strong>Momentum</strong>:</p>
</li>
</ul>
<p>$$\begin{align*} \hat m_t &amp;= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &amp;= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$</p><p>(<code>β1</code>​​: <strong>Decay rates</strong>, typically set to β1=0.9)</p>
<p><code>v^</code>: Second moment (variance) estimate with a bias correction, leveraging <strong>RMSprop</strong>:</p>
<p>$$\begin{align*} \hat v_t &amp;= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &amp;=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$</p><p>(<code>β2</code>​​: <strong>Decay rates</strong>, typically set to β2=0.999)</p>
<p>Since both <code>m</code>​​ and <code>v</code>​ are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.</p>
<p>Learn More: <a target="_blank" href="https://medium.com/@kuriko-iwai/a-comprehensive-guide-on-neural-network-in-deep-learning-9c795a1f1648">A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</h2>
<h3 id="heading-custom-classifier">Custom Classifier</h3>
<p>This process involves a <strong>forward pass</strong> and <strong>backpropagation</strong>, during which SGD computes optimal weights and biases using gradients:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
    <span class="hljs-comment"># SGD starts with randomly selected mini-batch for the epoch</span>
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    <span class="hljs-comment"># A. forward pass</span>
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[<span class="hljs-number">-1</span>]  <span class="hljs-comment"># final output of the network</span>

    <span class="hljs-comment"># B. backpropagation</span>
    <span class="hljs-comment"># 1) calculating gradients for the output layer)</span>
    delta = y_pred - y_batch
    dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

    <span class="hljs-comment"># 2) update output layer parameters</span>
    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

    <span class="hljs-comment"># 3) iterate backward from last hidden layer to the input layer</span>
    <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
        delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
        dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
        db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db
</code></pre>
<p>In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
    activations = [X]
    zs = []

    <span class="hljs-comment"># forward through hidden layers</span>
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
        z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) <span class="hljs-comment"># using ReLU for hidden layers</span>
        activations.append(a)

    <span class="hljs-comment"># forward through output layer</span>
    z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
    zs.append(z_output)

    <span class="hljs-comment"># computes the final output using sigmoid function</span>
    y_pred = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
    activations.append(y_pred)
    <span class="hljs-keyword">return</span> activations, zs
</code></pre>
<p>So the final classifier looks like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_SGD</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.01</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]
        self.weights = []
        self.biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            <span class="hljs-comment"># shuffle datasets</span>
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>]

                delta = y_pred - y_batch
                dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten() <span class="hljs-comment"># for 1D output</span>
</code></pre>
<h3 id="heading-training-prediction">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python"><span class="hljs-comment"># 1. define the model</span>
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>, ), <span class="hljs-comment"># 2 hidden layers with 30 neurons each</span>
  learning_rate=<span class="hljs-number">0.001</span>,           <span class="hljs-comment"># a step size</span>
  n_epochs=<span class="hljs-number">1000</span>,                 <span class="hljs-comment"># number of epochs</span>
  batch_size=<span class="hljs-number">32</span>                  <span class="hljs-comment"># mini-batch size</span>
)

<span class="hljs-comment"># 2. train the model</span>
mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># 3. make a prediction with training and validation datasets</span>
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

<span class="hljs-comment"># 4. compute evaluation matrics</span>
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
recall = recall_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
f1 = f1_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)


print(<span class="hljs-string">f"\nMLP (Custom SGD) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom SGD) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-2">Results</h3>
<ul>
<li><p>Recall: <em>0.7930 — 0.6650 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7790 — 0.6786 (from training to validation)</em></p>
</li>
</ul>
<p>The model effectively learned and generalized the patterns, achieving a <strong>Recall of 79.3%</strong> (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.</p>
<p><strong>Loss history:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441103897/088deb38-846d-4026-a706-701be93036ca.png" alt="Loss by epoch, weight history, bias history (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>We visualized the <strong>decision boundary</strong> using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442430297/032ee809-1b7e-4bb1-81c0-8715361658a5.png" alt="Image: Decision Boundary of MLP Classifier with SGD optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier">Leverage SckitLearn’s MCP Classifier</h3>
<p>We can use an MCP Classifier to define a similar model, incorporating;</p>
<ul>
<li><p><strong>Early stopping</strong> using internal validation to prevent overfitting and</p>
</li>
<li><p><strong>L2 regularization</strong> with a small tolerance.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPClassifier

<span class="hljs-comment"># define a model</span>
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'sgd'</span>,
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    momentum=<span class="hljs-number">0.9</span>,
    nesterovs_momentum=<span class="hljs-literal">True</span>,
    alpha=<span class="hljs-number">0.00001</span>,           <span class="hljs-comment"># l2 regulation strength</span>
    max_iter=<span class="hljs-number">3000</span>,           <span class="hljs-comment"># max epochs (keep it high)</span>
    batch_size=<span class="hljs-number">16</span>,           <span class="hljs-comment"># mini-batch size</span>
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,     <span class="hljs-comment"># apply early stopping</span>
    n_iter_no_change=<span class="hljs-number">50</span>,     <span class="hljs-comment"># stop the iteration if internal validation score doesn't improve for 50 epochs</span>
    validation_fraction=<span class="hljs-number">0.1</span>, <span class="hljs-comment"># proportion of training data for internal validation (default is 0.1)</span>
    tol=<span class="hljs-number">1e-4</span>,                <span class="hljs-comment"># tolerance for optimization</span>
    verbose=<span class="hljs-literal">False</span>,
)

<span class="hljs-comment"># training</span>
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-3">Results</h3>
<ul>
<li><p>Recall: <em>0.7830 - 0.6200 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.8208  - 0.6703 (from training to validation)</em></p>
</li>
</ul>
<p>The model showed strong performance during training, achieving a Recall <strong>of 78.30%</strong>. Its performance declined on the validation set.</p>
<p>This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.</p>
<h3 id="heading-leverage-keras-sequential-classifier">Leverage Keras Sequential Classifier</h3>
<p>For the sequential classifier, we can further enhance the classifier by:</p>
<ul>
<li><p>Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train​) to address dataset imbalance and promote faster convergence,</p>
</li>
<li><p>Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,</p>
</li>
<li><p>Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,</p>
</li>
<li><p>Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and</p>
</li>
<li><p>Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> SGD
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


<span class="hljs-comment"># calculates an initial bias for the output layer </span>
initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])


<span class="hljs-comment"># defines the model</span>
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>), <span class="hljs-comment"># 10% of the neurons in that layer randomly dropped out</span>
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, <span class="hljs-comment"># binary classification</span>
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) <span class="hljs-comment"># to address the imbalanced datasets</span>
])



<span class="hljs-comment"># compiles the model with the SGD optimizer</span>
opt = SGD(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_sgd.compile(
    optimizer=opt, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>,
    metrics=[
        <span class="hljs-string">'accuracy'</span>, <span class="hljs-comment"># add several metrics to return</span>
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)


<span class="hljs-comment"># defines early stopping to prevent overfitting</span>
early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,  <span class="hljs-comment"># monitor recall </span>
    mode=<span class="hljs-string">'max'</span>,         <span class="hljs-comment"># maximize recall</span>
    patience=<span class="hljs-number">50</span>,        <span class="hljs-comment"># stop after 50 epochs without loss improvement</span>
    min_delta=<span class="hljs-number">1e-4</span>,     <span class="hljs-comment"># minimum change to be considered an improvement (tol)</span>
    verbose=<span class="hljs-number">0</span>
)


<span class="hljs-comment"># compute the class weight</span>
class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


<span class="hljs-comment"># train the model</span>
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val), <span class="hljs-comment"># use our external val set</span>
    callbacks=[early_stopping_callback], <span class="hljs-comment"># early stopping to prevent overfitting</span>
    class_weight=class_weights_dict, <span class="hljs-comment"># penarlize more misclassification on minority class</span>
    verbose=<span class="hljs-number">0</span>
)

<span class="hljs-comment"># evaluate</span>
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)

<span class="hljs-comment"># display model summary</span>
model_keras_sgd.summary()
</code></pre>
<h3 id="heading-results-4">Results</h3>
<ul>
<li><p>Recall: <em>0.7125 — 0.7250 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7607 — 0.7545 (from training to validation)</em></p>
</li>
</ul>
<p>Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.</p>
<p>It suggests that the regularization techniques are likely effective in preventing significant overfitting.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441165170/4e0528e3-514a-454c-b52a-2a0318ba405a.png" alt="Image: Summary of the Keras Sequential Model with SGD Optimizer" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</h2>
<h3 id="heading-custom-classifier-1">Custom Classifier</h3>
<p>This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:</p>
<pre><code class="lang-python"><span class="hljs-comment"># apply Adam updates for output layer parameters</span>
<span class="hljs-comment"># 1) weights (w)</span>
self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

<span class="hljs-comment"># 2) bias (b)</span>
self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
</code></pre>
<p>Following the principles of forward and backward passes, we construct the final classifier by initializing it with <code>beta1</code> and <code>beta2</code>, built upon an <code>MLP_SGD</code> architecture:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_Adam</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span>,
                 beta1=<span class="hljs-number">0.9</span>, beta2=<span class="hljs-number">0.999</span>, epsilon=<span class="hljs-number">1e-8</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        <span class="hljs-comment"># Adam optimizer internal states for each parameter (weights and biases)</span>
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
            self.v_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-comment"># global time step for Adam bias correction</span>
        t = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># Mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += <span class="hljs-number">1</span>

                <span class="hljs-comment"># 1. forward pass</span>
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>] <span class="hljs-comment"># Output of the network</span>

                <span class="hljs-comment"># 2. backpropagation</span>
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>] <span class="hljs-comment"># Average over batch</span>
                grad_b_output = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                <span class="hljs-comment"># apply Adam updates to weights</span>
                self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
                self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
                m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                <span class="hljs-comment"># apply Adam updates to bias</span>
                self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
                self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
                m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                <span class="hljs-comment"># Propagate gradients backward through hidden layers</span>
                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    grad_b_hidden = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    <span class="hljs-comment"># apply Adam updates to weights</span>
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_hidden ** <span class="hljs-number">2</span>)
                    m_w_hat = self.m_weights[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    <span class="hljs-comment"># apply Adam updates to bias</span>
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_hidden ** <span class="hljs-number">2</span>)
                    m_b_hat = self.m_biases[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten()
</code></pre>
<h3 id="heading-training-prediction-1">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python">mlp_adam = MLP_Adam(hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">10</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">500</span>, batch_size=<span class="hljs-number">32</span>)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(<span class="hljs-string">f"\nMLP (Custom Adam) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom Adam) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-5">Results</h3>
<ul>
<li><p>Recall: <em>0.9870–0.6150 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.9811–0.6474 (from training to validation)</em></p>
</li>
</ul>
<p>While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.</p>
<p><strong>Loss History</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442341394/3183a9b1-5df0-4f74-9473-6b5b595dc9c0.png" alt="Loss by epoch, middle: weights history by epoch, right: bias history by epoch (source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442311514/34f004c9-bf1d-41e5-a0af-08c62802b78c.png" alt="Decision Boundary of MLP with Adam Optimizer (source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier-1">Leverage SckitLearn’s MCP Classifier</h3>
<p>We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:</p>
<pre><code class="lang-python">model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'adam'</span>,             <span class="hljs-comment"># update the optimizer from SGD to Adam</span>
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    alpha=<span class="hljs-number">0.0001</span>,
    max_iter=<span class="hljs-number">3000</span>,
    batch_size=<span class="hljs-number">16</span>,
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,
    n_iter_no_change=<span class="hljs-number">50</span>,
    validation_fraction=<span class="hljs-number">0.1</span>,
    tol=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-literal">False</span>,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-6">Results</h3>
<ul>
<li><p><em>Recall: 0.8975–0.6400 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8864 —  0.6305 (from training to validation)</em></p>
</li>
</ul>
<p>Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.</p>
<h3 id="heading-leverage-keras-sequential-classifier-1">Leverage Keras Sequential Classifier</h3>
<p>Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> Adam
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>)),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>, 
    metrics=[
        <span class="hljs-string">'accuracy'</span>,
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,
    mode=<span class="hljs-string">'max'</span>,
    patience=<span class="hljs-number">50</span>,
    min_delta=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-number">0</span>
)

class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=<span class="hljs-number">0</span>
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)


model_keras_adam.summary()
</code></pre>
<h3 id="heading-results-7">Results</h3>
<ul>
<li><p><em>Recall: 0.7995–0.7500 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8409–0.8065 (from training to validation)</em></p>
</li>
</ul>
<p>The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).</p>
<p>This indicates good generalization, with only minor performance degradation on unseen data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441767800/fe43f181-4323-461f-b56a-125fc78e9c84.png" alt="Image: Keras Sequential Model with Adam Optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-final-results-generalization">Final Results: Generalization</h2>
<p>Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Custom classifiers</span>
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># MLPClassifer</span>
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># Keras Sequential</span>
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
</code></pre>
<p>Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an <strong>AUPRC (Area Under Precision-Recall Curve) of 0.72.</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748874699534/f0f008c4-9067-4e2a-b070-4bb5cbae8f23.png" alt="Precision-Recall Curves for Six Classifier Models (Comparing Custom, MLP, and Keras Sequential Classifiers with SGD and Adam Optimizers (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.</p>
<p>Our findings underscore that effective machine learning hinges on three critical factors:</p>
<ol>
<li><p><strong>robust data preprocessing</strong> (tailored to objectives and data distribution),</p>
</li>
<li><p><strong>judicious model selection</strong>, and</p>
</li>
<li><p><strong>strategic framework or library choices</strong>.</p>
</li>
</ol>
<h3 id="heading-choosing-the-right-framework"><strong>Choosing the right framework</strong></h3>
<p>Generally speaking, choose <code>MLPClassifier</code> when:</p>
<ul>
<li><p>You’re primarily working with <strong>tabular data,</strong></p>
</li>
<li><p>You want to prioritize <strong>simplicity, quick iteration, and seamless integration,</strong></p>
</li>
<li><p>You have simple, shallow architectures, and</p>
</li>
<li><p>You have a moderate dataset size (manageable on a CPU).</p>
</li>
</ul>
<p>Choose Keras <code>Sequential</code> when:</p>
<ul>
<li><p>You’re dealing with <strong>image, text, audio, or other sequential data,</strong></p>
</li>
<li><p>You’re building <strong>deep learning models</strong> such as CNNs, RNNs, LSTMs,</p>
</li>
<li><p>You need <strong>fine-grained control</strong> over the model architecture, training process, or custom components,</p>
</li>
<li><p>You need to leverage <strong>GPU acceleration</strong>,</p>
</li>
<li><p>You’re planning for <strong>production deployment</strong>, and</p>
</li>
<li><p>You want to experiment with more advanced deep learning techniques.</p>
</li>
</ul>
<h3 id="heading-limitation-of-mlps">Limitation of MLPs</h3>
<p>While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.</p>
<p>Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.</p>
<p>You can find more info about me on my <a target="_blank" href="https://kuriko.vercel.app/">Portfolio</a> / <a target="_blank" href="https://www.linkedin.com/in/k-i-i">LinkedIn</a> / <a target="_blank" href="https://github.com/versionhq/multi-agent-system">Github</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn Python for Data Science – Full Course for Beginners ]]>
                </title>
                <description>
                    <![CDATA[ If you're interested in data science but not sure where to begin, Python is a great starting point. It’s easy to pick up and has a bunch of libraries that make working with data a lot easier. We just published a course on the freeCodeCamp.org YouTube... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-python-for-data-science-full-course/</link>
                <guid isPermaLink="false">6838a96337f76032ee644b85</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 29 May 2025 18:37:23 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748543757666/140c2e3b-a0f6-4ef1-b50f-0c6b6c86ecb1.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you're interested in data science but not sure where to begin, Python is a great starting point. It’s easy to pick up and has a bunch of libraries that make working with data a lot easier.</p>
<p>We just published a course on the <a target="_blank" href="http://freeCodeCamp.org">freeCodeCamp.org</a> YouTube channel that teaches you how to do data science using Python. Frank Andrade developed this course.</p>
<p>It starts with installation and setup, then covers Python fundamentals so you’re not lost if you’ve never coded before. From there, it gets into two of the most commonly used libraries in data science: Pandas and NumPy. Pandas helps you work with tables of data (think spreadsheets, but in Python), and NumPy is great for doing math on that data.</p>
<p>You’ll get to apply what you're learning right away with hands-on projects. The first one shows you how to scrape data from websites using Pandas. Then you’ll learn how to filter and clean that data, reshape it, and create pivot tables. There's also a project where you’ll build charts and graphs so you can actually visualize what the data is telling you. You’ll use real datasets and build things like bar charts and scatter plots to explore trends and patterns.</p>
<p>Once you're comfortable with those basics, the course introduces more useful techniques like using <code>groupby</code> and aggregate functions, combining different data sets, and using regular expressions to pull out specific patterns from text. These are skills you’ll need for any real data job, or even if you’re just trying to make sense of a big messy spreadsheet.</p>
<p>Later in the course, you'll start working with machine learning. It’s not super advanced, but it gives you a solid first look at how it works. You’ll use scikit-learn to build a simple text classification model. Basically, you’ll train a program to read some text and decide what category it belongs to. Think spam vs. not spam, or positive vs. negative reviews.</p>
<p>If you're new to data science and want to actually try things instead of just reading about them, this course is a solid pick. Everything is broken into small, manageable sections, and the projects help the ideas stick. It’s free, it’s on YouTube, and you can follow along at your own pace.</p>
<p>Are you ready to learn Data Science with Python? Watch the <a target="_blank" href="https://www.youtube.com/watch?v=CMEWVn1uZpQ">full course on the freeCodeCamp.org YouTube channel</a> (17-hour watch):</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/CMEWVn1uZpQ" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Extract  YouTube Analytics Data and Analyze in Python ]]>
                </title>
                <description>
                    <![CDATA[ If you’re a YouTube content creator, you’ll make data-driven decisions when posting content. This helps you target the right audience when creating your videos. YouTube Studio provides YouTube Analytics, where you can get comprehensive data about you... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/extract-youtube-analytics-data-and-analyze-in-python/</link>
                <guid isPermaLink="false">67e425c92a171465d4fb4cce</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Adejumo Ridwan Suleiman ]]>
                </dc:creator>
                <pubDate>Wed, 26 Mar 2025 16:05:29 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743005089726/39e2323d-8f7b-4bf4-94cb-288aeb9cea4f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re a YouTube content creator, you’ll make data-driven decisions when posting content. This helps you target the right audience when creating your videos.</p>
<p>YouTube Studio provides YouTube Analytics, where you can get comprehensive data about your channel. But there is a caveat: most of the statistics provided by YouTube Analytics are descriptive and not predictive. This means information like future views, subscriber counts, and factors influencing watch time or earnings are unavailable. This means you’ll need to calculate these metrics yourself.</p>
<p>In this article, you’ll learn how to export data from YouTube Analytics to Python so you can analyze it further or create visualizations. You can even build your own custom dashboard using various Python libraries like <a target="_blank" href="https://streamlit.io/">Streamlit</a>, <a target="_blank" href="https://shiny.posit.co/py/">Shiny</a>, or <a target="_blank" href="https://dash.plotly.com/">Dash</a>.</p>
<h3 id="heading-heres-what-we">Here’s what we</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-identify-the-problem-statement">Step 1: Identify the Problem Statement</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-extract-the-data">Step 2: Extract the Data</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-analyze-the-data-in-python">Step 3: Analyze the Data in Python</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-correlation-analysis">Correlation Analysis</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-audience-retention-analysis">Audience Retention Analysis</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>Active YouTube and YouTube Studio Account</p>
</li>
<li><p>Jupyter Notebook, Google Colab, Kaggle, or any other environment that supports Python</p>
</li>
<li><p><a target="_blank" href="https://pandas.pydata.org/">Pandas</a> library installed</p>
</li>
<li><p><a target="_blank" href="https://seaborn.pydata.org/">Seaborn</a> library installed</p>
</li>
<li><p><a target="_blank" href="https://matplotlib.org/">Matplotlib</a> library installed</p>
</li>
</ul>
<h2 id="heading-step-1-identify-the-problem-statement">Step 1: Identify the Problem Statement</h2>
<p>Before proceeding, we need to know what we’re looking for – because YouTube Analytics has many metrics, and this can get overwhelming. My channel doesn’t have a ton of subscribers, but I have quite a few videos and views. So we’ll use my data as an example.</p>
<p>Just note that this analysis I’ll conduct in this tutorial is specific to my channel and can vary from channel to channel. You’ll be able to use the techniques here to answer the same/similar questions using your data, but your results will be different from mine.</p>
<p>Here are the questions I would like to find an answer for:</p>
<ol>
<li><strong>Correlation Analysis</strong></li>
</ol>
<ul>
<li><p><strong>Views and watch time</strong> – Are longer watch times associated with higher views?</p>
</li>
<li><p><strong>Views and subscribers</strong> – Do more views translate to more subscribers?</p>
</li>
<li><p><strong>Impressions and Click-Through Rate (CTR%) –</strong> Does a stronger impression lead to better engagement?</p>
</li>
<li><p><strong>Watch time and average view duration</strong> – Are longer videos watched more?</p>
</li>
</ul>
<ol start="2">
<li><strong>Audience Retention Analysis</strong></li>
</ol>
<ul>
<li><p><strong>Average view duration vs. Video length</strong> – Are longer videos watched in full?</p>
</li>
<li><p><strong>Drop-off points</strong> – Which duration range has the best retention?</p>
</li>
<li><p><strong>Retention Rate (%)</strong> – Watch time divided by duration?</p>
</li>
</ul>
<h2 id="heading-step-2-extract-the-data">Step 2: Extract the Data</h2>
<p>Sign in to your YouTube Studio account, go to the Analytics tab, and click Advanced mode.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548010236/1392de34-a280-4117-9a3d-feda80392f62.png" alt="Image showing YouTube Analytics Dashboard and the Advanced Mode" class="image--center mx-auto" width="1920" height="927" loading="lazy"></p>
<p>This will open a dashboard showing comprehensive descriptive analytics of your YouTube channel. This can get overwhelming, as there are a lot of metrics and filters with various types of data. This is why I emphasized the importance of knowing your problem and identifying your questions before diving in.</p>
<p>You can select the range of data you are interested in using the date dropdown (1 in the image below) and the Compare to button (2) to compare data from different date ranges.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548329162/3b8be0ea-769a-4723-b427-f911b3cfec83.png" alt="Image showing the date dropdown and the Compare to button" class="image--center mx-auto" width="1914" height="904" loading="lazy"></p>
<p>The column headers you see in the dashboard are the filters. Each contains different metrics, and you can find some metrics in one or more filters. You can play around with the tabs and dropdowns to understand them better.</p>
<p>This is just a foundation for understanding your YouTube channel performance. If you have a long-running channel with a large number of subscribers and views, trust me – you can get a lot of insights from your data.</p>
<p>For this tutorial, I will select my entire lifetime data (1) and click the download button at the top right-hand corner (2).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548442210/8fbddcac-98cb-4e52-9355-5383e6afc172.png" alt="Image showing the lifetime option under the date dropdown" class="image--center mx-auto" width="1900" height="915" loading="lazy"></p>
<p>This will display two options: whether to open the data in Google Sheets in a new tab or download the CSV file.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548490620/c8829a2b-228b-45fd-8789-45dfb397f2da.png" alt="Image showing the download options to open the data in a google sheets new tab or download the csv" class="image--center mx-auto" width="718" height="474" loading="lazy"></p>
<p>Since we want to use the data in Python, select the option to download the CSV file. After downloading the file, extract the files from the zip folder, and inside the extracted folder, you will see three CSV files: <code>Chart data.csv</code>, <code>Table data.csv</code>, and <code>Totals.csv</code>.</p>
<p>For this tutorial, we are interested in the <code>Table data.csv</code>. Click the data to open and view it in Excel to do some manual data cleaning before importing the data in Python.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742548741025/ace69aaf-bb0e-40de-aa1e-e716bb4182aa.png" alt="Image showing the Table data in Excel" class="image--center mx-auto" width="1891" height="604" loading="lazy"></p>
<p>The data is a list of all the videos on my YouTube channel, which is forty (yours might have more or fewer). Remove the first row, which is the <code>Total</code> row, and save the changes.</p>
<p>Here are the columns in the dataset:</p>
<ul>
<li><p><code>Content</code>: The video id</p>
</li>
<li><p><code>Video title</code>: The video title</p>
</li>
<li><p><code>Video publish time</code>: The day the video was published</p>
</li>
<li><p><code>Duration</code>: The video duration in seconds</p>
</li>
<li><p><code>Views</code>: The number of views per video</p>
</li>
<li><p><code>Watch time</code>: The estimated amount of video watch time by your audience in hours</p>
</li>
<li><p><code>Subscribers</code>: Change in total subscribers found by subtracting subscribers lost from subscribers gained for the selected date and region.</p>
</li>
<li><p><code>Average view duration</code>: Estimated average minutes watched per video.</p>
</li>
<li><p><code>Impressions</code>: Number of times your videos were shown to viewers.</p>
</li>
<li><p><code>Impressions click-through rate (%)</code>: Number of times viewers clicked your video after seeing an impression.</p>
</li>
</ul>
<h2 id="heading-step-3-analyze-the-data-in-python">Step 3: Analyze the Data in Python</h2>
<p>Go to your Jupyter Notebook and import the Pandas, Seaborn, and Matplotlib libraries.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
</code></pre>
<p>Next, import the <code>Table data.csv</code> file.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Load data</span>
df = pd.read_csv(<span class="hljs-string">"/content/Table data.csv"</span>)
</code></pre>
<h3 id="heading-correlation-analysis">Correlation Analysis</h3>
<p>Concerning our problem statement, we are going to plot a <a target="_blank" href="https://www.quanthub.com/how-to-read-a-correlation-heatmap/">correlation heatmap</a> between the following variables: <code>Views</code>, <code>Watch time (hours)</code>, <code>Subscribers</code>, <code>Average view duration</code>, and <code>Impressions-click-through rate (%)</code> to see the strength and direction of the relationship between them.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Convert "Average view duration" (formatted as H:M:S) to seconds</span>
df[<span class="hljs-string">'Average view duration'</span>] = pd.to_timedelta(df[<span class="hljs-string">'Average view duration'</span>]).dt.total_seconds()

<span class="hljs-comment"># Select relevant columns for correlation analysis</span>
correlation_data = df[[<span class="hljs-string">'Views'</span>, <span class="hljs-string">'Watch time (hours)'</span>, <span class="hljs-string">'Subscribers'</span>, <span class="hljs-string">'Average view duration'</span>, <span class="hljs-string">'Impressions'</span>, <span class="hljs-string">'Impressions click-through rate (%)'</span>]]

<span class="hljs-comment"># Compute correlation matrix</span>
corr_matrix = correlation_data.corr()

<span class="hljs-comment"># Visualization using a heatmap</span>
plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">6</span>))
sns.heatmap(corr_matrix, annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">'coolwarm'</span>, fmt=<span class="hljs-string">".2f"</span>, linewidths=<span class="hljs-number">0.5</span>)
plt.title(<span class="hljs-string">"YouTube Analytics Correlation Heatmap"</span>)
plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742632975699/427811d8-09ca-4a8d-8fdc-98cdaf5b7033.png" alt="Correlation heatmap showing the relationship between the selected variables" class="image--center mx-auto" width="1174" height="913" loading="lazy"></p>
<p>Correlation coefficient ranges from -1 to 1, where values less than 0 mean a negative relationship, while those above 0 mean a positive relationship. The lower the value in a negative relationship, the stronger the negative relationship, while the higher the value in a positive relationship, the stronger the relationship.</p>
<p>Based on the plot above, here are the key insights:</p>
<ul>
<li><p><strong>Views and watch time</strong>: There's a strong correlation (0.94) between views and watch time, suggesting that as videos get more views, they also accumulate more watch hours, proportionally.</p>
</li>
<li><p><strong>Views and impressions</strong>: There's a strong correlation (0.89) between views and impressions, indicating that videos that are shown more frequently in recommendations and search results tend to get more views.</p>
</li>
<li><p><strong>Average view duration</strong>: This metric has very weak correlations with almost all other metrics. It is particularly notable in views (0.06), subscribers (0.01), and impressions (0.03).</p>
</li>
<li><p><strong>Subscribers and metrics</strong>: Subscribers have a moderate to strong correlation with views (0.75) and impressions (0.79) and a weaker correlation with click-through rate (0.54).</p>
</li>
<li><p><strong>Click-through rate</strong>: Has moderate correlations with views (0.69) and watch time (0.66) but a weaker correlation with subscribers (0.54).</p>
</li>
</ul>
<p>The most significant insight is that average view duration appears to operate independently from other metrics. This suggests that on my YouTube channel, a video's ability to retain viewers throughout its length isn't necessarily connected to how many people watch it, how often it's recommended, or how many subscribers the channel has.</p>
<p>This implies that the strategies I would implement to increase my views, subscribers, and impressions might differ from those needed to improve average view duration, an important factor in YouTube's recommendation algorithm. This means I need to look at other YouTube metrics that have a relationship with average view duration, which is a topic for another article.</p>
<h3 id="heading-audience-retention-analysis">Audience Retention Analysis</h3>
<p>To analyze audience retention, we need to create a new variable <code>Retention Rate (%)</code>, which is calculated by dividing a video’s <code>Average view duration</code> by the <code>Duration</code> and expressing it as a percentage.</p>
<pre><code class="lang-python">
<span class="hljs-comment"># Calculate retention rate as (Average View Duration / Total Video Duration) * 100</span>
df[<span class="hljs-string">'Retention Rate (%)'</span>] = (df[<span class="hljs-string">'Average view duration'</span>] / df[<span class="hljs-string">'Duration'</span>]) * <span class="hljs-number">100</span>
</code></pre>
<p>Next is to sort the videos in ascending order based on <code>Retention Rate (%)</code> and display the top 10 videos with the highest retention rate.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Sort videos by retention rate</span>
df_sorted = df.sort_values(by=<span class="hljs-string">'Retention Rate (%)'</span>, ascending=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Display top 10 videos with highest retention</span>
df_sorted[[<span class="hljs-string">'Video title'</span>, <span class="hljs-string">'Duration'</span>, <span class="hljs-string">'Average view duration'</span>, <span class="hljs-string">'Retention Rate (%)'</span>]].head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742634265073/fc5bac65-18f3-467a-a8da-85f95ae00488.png" alt="Image showing top ten videos by retention rate" class="image--center mx-auto" width="1194" height="550" loading="lazy"></p>
<p>From the table above, you will notice that most of the videos in the top 10 spot are not above 503 seconds, which is approximately 8 minutes. This implies that my audience are interested in short, mid-range videos.</p>
<p>Most videos with the high retention rate have a duration less than 4 minutes, with a retention rate ranging from 27% - 40%. With this insight, I can ensure that the next videos I will upload are within 5 to 8 minutes.</p>
<p>Let’s take a look at the bottom 10 videos with a low retention rate:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Sort videos by retention rate</span>
df_sorted = df.sort_values(by=<span class="hljs-string">'Retention Rate (%)'</span>, ascending=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Display bottom 10 videos with highest retention</span>
df_sorted[[<span class="hljs-string">'Video title'</span>, <span class="hljs-string">'Duration'</span>, <span class="hljs-string">'Average view duration'</span>, <span class="hljs-string">'Retention Rate (%)'</span>]].tail(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742634531458/28b1d8e8-38d9-480e-8259-a30f659386a3.png" alt="Image showing bottom ten videos by retention rate" class="image--center mx-auto" width="1168" height="538" loading="lazy"></p>
<p>From the above information, you will notice that long videos in my channel spanning approximately 22 - 58 minutes have a low retention rate. This further supports the claim above that my audience is more interested in shorter videos.</p>
<p>We can further decide to plot a scattered plot of <code>Duration</code> against <code>Retention Rate (%)</code> to summarize the above tables.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Set style for plots</span>
sns.set_style(<span class="hljs-string">"whitegrid"</span>)

<span class="hljs-comment"># Plot Retention Rate vs. Video Duration</span>
plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))

sns.scatterplot(data=df, x=<span class="hljs-string">'Duration'</span>, y=<span class="hljs-string">'Retention Rate (%)'</span>, hue=<span class="hljs-string">'Views'</span>, size=<span class="hljs-string">'Views'</span>, sizes=(<span class="hljs-number">20</span>, <span class="hljs-number">200</span>), palette=<span class="hljs-string">'coolwarm'</span>)
plt.title(<span class="hljs-string">"Audience Retention vs. Video Duration"</span>)
plt.xlabel(<span class="hljs-string">"Video Duration (seconds)"</span>)
plt.ylabel(<span class="hljs-string">"Retention Rate (%)"</span>)
plt.legend(title=<span class="hljs-string">"Views"</span>, loc=<span class="hljs-string">"upper right"</span>)

plt.show()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742634776775/e024b61c-d86f-45d6-b8fb-13ff87e101e9.png" alt="Scatter plot showing audience retention against video duration" class="image--center mx-auto" width="1486" height="820" loading="lazy"></p>
<p>The <a target="_blank" href="https://byjus.com/commerce/scatter-diagram/">scatter plot</a> above shows the relationship between audience retention rate (y-axis, measured as a percentage) and video duration (x-axis, measured in seconds) for various videos. Here are the following key observations:</p>
<ul>
<li><p>There's a clear negative correlation between video duration and retention rate – as videos get longer, the retention rate generally decreases.</p>
</li>
<li><p>The highest retention rates (35-40%) are found in shorter videos, mostly under 500 seconds (around 8 minutes).</p>
</li>
<li><p>Videos over 1500 seconds (25 minutes) consistently show retention rates below 15%.</p>
</li>
<li><p>The size and color of the dots represent the number of views, with larger, redder dots indicating more views (up to 1000) and smaller, blue dots representing fewer views (around 200).</p>
</li>
<li><p>Interestingly, some mid-length videos (around 500 seconds) have both higher view counts (indicated by larger red dots) and decent retention rates of about 25%.</p>
</li>
<li><p>The longest video in the dataset (at around 3500 seconds or 58 minutes) has a retention rate of about 14% and relatively few views.</p>
</li>
</ul>
<p>This plot further confirms the claim that shorter videos tend to better maintain audience attention on my channel, though some mid-length videos can still perform well in terms of both retention and view count.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>What we’ve learned from my data is just the tip of the iceberg. YouTube has many metrics, and because my channel is not monetized and has few subscribers and videos, I don’t have data on monetization, demographics, and other metrics.</p>
<p>But after reading this article, I hope that you can think of endless information you want to get based on these metrics. You can even forecast your views, subscriber counts, and revenue for the next days or months. You can also perform a multivariate time series analysis to see how these factors affect your primary variable of interest.</p>
<p>If you find this article interesting, don’t forget to check out my <a target="_blank" href="https://learndata.xyz/blog">blog</a> for other interesting articles, follow me on <a target="_blank" href="https://medium.com/@adejumo999">Medium</a>, connect on <a target="_blank" href="https://www.linkedin.com/in/adejumoridwan/">LinkedIn</a>, and subscribe to my <a target="_blank" href="http://www.youtube.com/@learndata_xyz">YouTube channel</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
