<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Kuriko - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Kuriko - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Tue, 19 May 2026 10:28:08 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/kuriko/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Build End-to-End Machine Learning Lineage ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance. While many services for tracking ML lineage exist, creating a comprehensive and manageabl... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-end-to-end-machine-learning-lineage/</link>
                <guid isPermaLink="false">68f0f6719ac2ae80d4c5be03</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Thu, 16 Oct 2025 13:43:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760622158648/b990ff01-06f0-495d-8554-f832813609ab.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance.</p>
<p>While many services for tracking ML lineage exist, creating a comprehensive and manageable lineage often proves complicated.</p>
<p>In this article, I’ll walk you through integrating a comprehensive ML lineage solution for an ML application deployed on serverless AWS Lambda, covering the end-to-end pipeline stages:</p>
<ul>
<li><p>ETL pipeline</p>
</li>
<li><p>Data drift detection</p>
</li>
<li><p>Preprocessing</p>
</li>
<li><p>Model tuning</p>
</li>
<li><p>Risk and fairness evaluation.</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-well-build">What We’ll Build</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture - AI Pricing for Retailers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-ml-lineage">The ML Lineage</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-workflow-in-action">Workflow in Action</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-the-ml-lineage">Step 2: The ML Lineage</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-3-preprocessing">Stage 3: Preprocessing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-5-performing-inference">Stage 5: Performing Inference</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-configuring-scheduled-run-with-prefect">Step 4: Configuring Scheduled Run with Prefect</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local-1">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-deploying-the-application">Step 5: Deploying the Application</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-test-in-local-2">Test in Local</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<ul>
<li><p>Knowledge of key Machine Learning / Deep Learning concepts including the full lifecycle: data handling, model training, tuning, and validation.</p>
</li>
<li><p>Proficiency in Python, with experience using major ML libraries.</p>
</li>
<li><p>Basic understanding of DevOps principles.</p>
</li>
</ul>
<h3 id="heading-tools-well-use">Tools we’ll use:</h3>
<p>Here is a summary of the tools we’re going to use to track the ML lineage:</p>
<ul>
<li><p><strong>DVC</strong>: An open-source version system for data. Used to track the ML lineage.</p>
</li>
<li><p><strong>AWS S3</strong>: A secure object storage service from AWS. Used as a remote storage.</p>
</li>
<li><p><strong>Evently AI</strong>: An open-source ML and LLM observability framework. Used to detect data drift.</p>
</li>
<li><p><strong>Prefect</strong>: A workflow orchestration engine. Used to manage the schedule run of the lineage.</p>
</li>
</ul>
<h2 id="heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</h2>
<p><strong>Machine learning (ML) lineage</strong> is a framework for tracking and understanding the complete lifecycle of a machine learning model.</p>
<p>It contains information at different levels such as:</p>
<ul>
<li><p><strong>Code:</strong> The scripts, libraries, and configurations for model training.</p>
</li>
<li><p><strong>Data:</strong> The original data, transformations, and features.</p>
</li>
<li><p><strong>Experiments:</strong> Training runs, hyperparameter tuning results.</p>
</li>
<li><p><strong>Models:</strong> The trained models and their versions.</p>
</li>
<li><p><strong>Predictions:</strong> The outputs of deployed models.</p>
</li>
</ul>
<p>ML lineage is essential for multiple reasons:</p>
<ul>
<li><p><strong>Reproducibility:</strong> Recreate the same model and prediction for validation.</p>
</li>
<li><p><strong>Root cause analysis:</strong> Trace back to the data, code, or configuration change when a model fails in production.</p>
</li>
<li><p><strong>Compliance:</strong> Some regulated industries require proof of model training to ensure fairness, transparency, and adherence to laws like GDPR and the EU AI Act.</p>
</li>
</ul>
<h2 id="heading-what-well-build">What We’ll Build</h2>
<p>In this project, I’ll integrate an ML lineage into <a target="_blank" href="https://levelup.gitconnected.com/building-a-dynamic-pricing-system-with-a-multi-layered-neural-network-c2a4c70bfcec">this price prediction system built on AWS Lambda architecture</a> using DVC, an open-source version control system for ML applications.</p>
<p>The below diagram illustrates the system architecture and the ML lineage we’ll integrate:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759825040233/5027e5dd-a2fc-4d35-b7a3-4d9184f5f179.png" alt="Figure A. A comprehensive ML lineage for an ML application on serverless Lambda (Created by Kuriko IWAI)" class="image--center mx-auto" width="25020" height="7926" loading="lazy"></p>
<p><strong>Figure A:</strong> A comprehensive ML lineage for an ML application on serverless Lambda (Created by <a target="_blank" href="https://kuriko-iwai.vercel.app/">Kuriko IWAI</a>)</p>
<h3 id="heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture: AI Pricing for Retailers</h3>
<p>The system operates as a containerized, serverless microservice designed to provide optimal price recommendations to maximize retailer sales.</p>
<p>Its core intelligence comes from AI models trained on historical purchase data to predict the quantity of the product sold at various prices, allowing sellers to determine the best price.</p>
<p>For consistent deployment, the prediction logic and its dependencies are packaged into a Docker container image and stored in AWS ECR (Elastic Container Registry).</p>
<p>The prediction is then served by an AWS Lambda function, which retrieves and runs the container from ECR and exposes the result via AWS API Gateway for the Flask application to consume.</p>
<p>If you want to see how to build this from the ground up, you can follow along with my tutorial <a target="_blank" href="https://www.freecodecamp.org/news/how-to-build-a-machine-learning-system-on-serverless-architecture/">How to Build a Machine Learning System on Serverless Architecture</a>.</p>
<h3 id="heading-the-ml-lineage">The ML Lineage</h3>
<p>In the system, GitHub handles the code lineage, while DVC captures the lineage of:</p>
<ul>
<li><p><strong>Data</strong> (blue boxes): ETL and preprocessing.</p>
</li>
<li><p><strong>Experiments</strong> (light orange): Hyperparamters tuning and validation.</p>
</li>
<li><p><strong>Models</strong> and <strong>Prediction</strong> (dark orange): Final model artifacts and prediction results.</p>
</li>
</ul>
<p><strong>DVC</strong> tracks the lineage through separate stages, from data extraction to fairness testing (yellow rows in Figure A).</p>
<p>For each stage, DVC uses an <strong>MD5</strong> or <strong>SHA256 hash</strong> to track and push metadata like artifacts, metrics, and reports to its remote on <strong>AWS S3</strong>.</p>
<p>The pipeline incorporates <strong>Evently AI</strong> to handle data drift tests, which are essential for identifying shifts in data distributions that could compromise the model's generalization capabilities in production.</p>
<p>Only models that successfully pass both the data drift and fairness tests can serve predictions via the AWS API gateway (red box in Figure A).</p>
<p>Lastly, this entire lineage process is triggered weekly by the open-source workflow scheduler, <strong>Prefect</strong>.</p>
<p>Prefect prompts DVC to check for updates in data and scripts, and executes the full lineage process if changes are detected.</p>
<h2 id="heading-workflow-in-action">Workflow in Action</h2>
<p>The building process involves five main steps:</p>
<ol>
<li><p>Initiate a DVC project</p>
</li>
<li><p>Define the lineage stages with the DVC script <code>dvc.yaml</code> and corresponding Python script</p>
</li>
<li><p>Deploy the DVC project</p>
</li>
<li><p>Configure scheduled run with Prefect</p>
</li>
<li><p>Deploy the application</p>
</li>
</ol>
<p>Let’s walk through each step together.</p>
<h2 id="heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</h2>
<p>The first step is to initiate a DVC project:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> init
</code></pre>
<p>This command automatically creates a <code>.dvc</code> directory at the root of the project folder:</p>
<pre><code class="lang-bash">.
.dvc/
│
└── cache/         <span class="hljs-comment"># [.gitignore] store dvc caches (cached actual data files)</span>
└── tmp/           <span class="hljs-comment"># [.gitignore]</span>
└── .gitignore     <span class="hljs-comment"># gitignore cache, tmp, and config.local</span>
└── config         <span class="hljs-comment"># dvc config for production</span>
└── config.local   <span class="hljs-comment"># [.gitignore] dvc config for local</span>
</code></pre>
<p>DVC maintains a fast, lightweight Git repository by separating the original data in large files from the repository.</p>
<p>The process involves caching the original data in the local <code>.dvc/cache</code> directory, creating a small <code>.dvc</code> metadata file which contains an MD5 hash and a link to the original data file path, pushing <em>only</em> the small metadata files to Git, and pushing the original data to the DVC remote.</p>
<h2 id="heading-step-2-the-ml-lineage">Step 2: The ML Lineage</h2>
<p>Next, we’ll configure the ML lineage with the following stages:</p>
<ol>
<li><p><code>etl_pipeline</code>: Extract, clean, impute the original data and perform feature engineering.</p>
</li>
<li><p><code>data_drift_check</code>: Run data drift tests. If they fail, the system exits.</p>
</li>
<li><p><code>preprocess</code>: Create training, validation, and test datasets.</p>
</li>
<li><p><code>tune_primary_model</code>: Tune hyperparameters and train the model.</p>
</li>
<li><p><code>inference_primary_model</code>: Perform inference on the test dataset.</p>
</li>
<li><p><code>assess_model_risk</code>: Runs risk and fairness tests.</p>
</li>
</ol>
<p>Each stage requires defining the DVC command and its corresponding Python script.</p>
<p>Let’s get started.</p>
<h3 id="heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</h3>
<p>The first stage is to extract, clean, impute the original data, and perform feature engineering.</p>
<h4 id="heading-dvc-configuration"><strong>DVC Configuration</strong></h4>
<p>We’ll create the <code>dvc.yaml</code> file at the root of the project directory and add the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code></p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-comment"># output paths for dvc to track</span>
    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/original_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/processed_df.parquet</span>
</code></pre>
<p>The <code>dvc.yaml</code> file defines a sequence of steps (stages) using sections like:</p>
<ul>
<li><p><code>cmd</code>: The shell command to be executed for that stage</p>
</li>
<li><p><code>deps</code>: Dependencies that need to run the <code>cmd</code></p>
</li>
<li><p><code>prams</code>: Default parameters for the <code>cmd</code> defined in the <code>params.yaml</code> file</p>
</li>
<li><p><code>metrics</code>: The metrics files to track</p>
</li>
<li><p><code>reports</code>: The report files to track</p>
</li>
<li><p><code>plots</code>: The DVC plot files for visualization</p>
</li>
<li><p><code>outs</code>: The output files produced by the <code>cmd</code>, which DVC will track</p>
</li>
</ul>
<p>The configuration helps DVC ensure reproducibility by explicitly listing dependencies, outputs, and the commands of each stage. It also helps it manage the lineage by establishing a <strong>Directed Acyclic Graph (DAG)</strong> of the workflow, linking each stage to the next.</p>
<h4 id="heading-python-scripts"><strong>Python Scripts</strong></h4>
<p>Next, let’s add Python scripts, ensuring the data is stored using the file paths specified in the <code>outs</code> section of the <code>dvc.yaml</code> file:</p>
<p><code>src/data_handling/etl_pipeline.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">etl_pipeline</span>():</span>
    <span class="hljs-comment"># extract the entire data</span>
    df = scripts.extract_original_dataframe()

    <span class="hljs-comment"># load perquet file</span>
    ORIGINAL_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'original_df.parquet'</span>)
    df.to_parquet(ORIGINAL_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>

    <span class="hljs-comment"># transform</span>
    df = scripts.structure_missing_values(df=df)
    df = scripts.handle_feature_engineering(df=df)

    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df.to_parquet(PROCESSED_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>
    <span class="hljs-keyword">return</span> df

<span class="hljs-comment"># for dvc execution</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:  
    parser = argparse.ArgumentParser(description=<span class="hljs-string">"run etl pipeline"</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">"specific stockcode to process. empty runs full pipeline."</span>)
    parser.add_argument(<span class="hljs-string">'--impute'</span>, action=<span class="hljs-string">'store_true'</span>, help=<span class="hljs-string">"flag to create imputation values"</span>)
    args = parser.parse_args()

    etl_pipeline(stockcode=args.stockcode, impute_stockcode=args.impute)
</code></pre>
<h4 id="heading-outputs"><strong>Outputs</strong></h4>
<p>The original and structured data in Pandas’ DataFrames are stored in the DVC cache:</p>
<ul>
<li><p><code>data/original_df.parquet</code></p>
</li>
<li><p><code>data/processed_df.parquet</code></p>
</li>
</ul>
<h3 id="heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</h3>
<p>Before jumping into preprocessing, we’ll run data drift tests to ensure any notable drift is in the data. To do this, we’ll use <strong>EventlyAI</strong>, an open-source ML and LLM observability framework.</p>
<h4 id="heading-what-is-data-drift">What is Data Drift?</h4>
<p>Data drift refers to any changes in the statistical properties like the mean, variance, or distribution of the data that the model is trained on.</p>
<p>There are three main types of data drift:</p>
<ul>
<li><p><strong>Covariate Drift</strong> (Feature Drift): A change in the input feature distribution.</p>
</li>
<li><p><strong>Prior Probability Drift</strong> (Label Drift): A change in the target variable distribution.</p>
</li>
<li><p><strong>Concept Drift</strong>: A change in the relationship between the input data and the target variable.</p>
</li>
</ul>
<p>Data drift compromises the model's generalization capabilities over time, making its detection after deployment crucial.</p>
<h4 id="heading-dvc-configuration-1">DVC Configuration</h4>
<p>We’ll add the <code>data_drift_check</code> stage right after the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
     <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/report_data_drift.py
      data/processed/processed_df.csv 
      data/processed_df_${params.stockcode}.parquet
      reports/data_drift_report_${params.stockcode}.html
      metrics/data_drift_${params.stockcode}.json
      ${params.stockcode}
</span>
    <span class="hljs-comment"># default values to the parameters (defined in the param.yaml file)</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/report_data_drift.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-comment"># output file pathes for dvc to track</span>
    <span class="hljs-attr">plots:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/data_drift_report_${params.stockcode}.html:</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/data_drift_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then, add default values to the parameters passed to the DVC command:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">stockcode:</span> <span class="hljs-string">&lt;STOCKCODE</span> <span class="hljs-string">OF</span> <span class="hljs-string">CHOICE&gt;</span>
</code></pre>
<h4 id="heading-python-scripts-1">Python Scripts</h4>
<p>After <a target="_blank" href="https://docs.evidentlyai.com/quickstart_ml#1-1-set-up-evidently-cloud">generating an API token from the EventlyAI workplace,</a> we’ll add a Python script to detect data drift and store the results in the <code>metrics</code> variable:</p>
<p><code>src/data_handling/report_data_drift.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> evidently <span class="hljs-keyword">import</span> Dataset, DataDefinition, Report
<span class="hljs-keyword">from</span> evidently.presets <span class="hljs-keyword">import</span> DataDriftPreset
<span class="hljs-keyword">from</span> evidently.ui.workspace <span class="hljs-keyword">import</span> CloudWorkspace

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># initiate evently cloud workspace</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ws = CloudWorkspace(token=os.getenv(<span class="hljs-string">'EVENTLY_API_TOKEN'</span>), url=<span class="hljs-string">'https://app.evidently.cloud'</span>)

    <span class="hljs-comment"># retrieve evently project</span>
    project = ws.get_project(<span class="hljs-string">'EVENTLY AI PROJECT ID'</span>)

    <span class="hljs-comment"># retrieve paths from the command line args</span>
    REFERENCE_DATA_PATH = sys.argv[<span class="hljs-number">1</span>]
    CURRENT_DATA_PATH = sys.argv[<span class="hljs-number">2</span>]
    REPORT_OUTPUT_PATH = sys.argv[<span class="hljs-number">3</span>]
    METRICS_OUTPUT_PATH = sys.argv[<span class="hljs-number">4</span>]
    STOCKCODE = sys.argv[<span class="hljs-number">5</span>]

    <span class="hljs-comment"># create folders if not exist</span>
    os.makedirs(os.path.dirname(REPORT_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    os.makedirs(os.path.dirname(METRICS_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># extract datasets</span>
    reference_data_full = pd.read_csv(REFERENCE_DATA_PATH)
    reference_data_stockcode = reference_data_full[reference_data_full[<span class="hljs-string">'stockcode'</span>] == STOCKCODE]
    current_data_stockcode = pd.read_parquet(CURRENT_DATA_PATH)

    <span class="hljs-comment"># define data schema</span>
    nums, cats = scripts.categorize_num_cat_cols(df=reference_data_stockcode)
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> nums: current_data_stockcode[col] = pd.to_numeric(current_data_stockcode[col], errors=<span class="hljs-string">'coerce'</span>)

    schema = DataDefinition(numerical_columns=nums, categorical_columns=cats)

    <span class="hljs-comment"># define evently dataset w/ the data schema</span>
    eval_data_1 = Dataset.from_pandas(reference_data_stockcode, data_definition=schema)
    eval_data_2 = Dataset.from_pandas(current_data_stockcode, data_definition=schema)

    <span class="hljs-comment"># execute drift detection</span>
    report = Report(metrics=[DataDriftPreset()])
    data_eval = report.run(reference_data=eval_data_1, current_data=eval_data_2)
    data_eval.save_html(REPORT_OUTPUT_PATH)

    <span class="hljs-comment"># create metrics for dvc tracking</span>
    report_dict = json.loads(data_eval.json())
    num_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'count'</span>]
    shared_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'share'</span>]
    metrics = dict(
        drift_detected=bool(num_drifts &gt; <span class="hljs-number">0.0</span>), num_drifts=num_drifts, shared_drifts=shared_drifts,
        num_cols=nums,
        cat_cols=cats,
        stockcode=STOCKCODE,
        timestamp=datetime.datetime.now().isoformat(),
    )

    <span class="hljs-comment"># load metrics file</span>
    <span class="hljs-keyword">with</span> open(METRICS_OUTPUT_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... drift metrics saved to <span class="hljs-subst">{METRICS_OUTPUT_PATH}</span>... '</span>)

    <span class="hljs-comment"># stop the system if data drift is found</span>
    <span class="hljs-keyword">if</span> num_drifts &gt; <span class="hljs-number">0.0</span>: sys.exit(<span class="hljs-string">'❌ FATAL: data drift detected. stopping pipeline'</span>)
</code></pre>
<p>If data drift is found, the script immediately exits using the final <code>sys.exit</code> command.</p>
<h4 id="heading-outputs-1">Outputs</h4>
<p>The script generates two files that DVC will track:</p>
<ul>
<li><p><code>reports/data_drift_report.html</code>: The data drift report in a HTML file.</p>
</li>
<li><p><code>metrics/data_drift.json</code>: The data drift metics in a JSON file including drift results along with feature columns and a timestamp:</p>
</li>
</ul>
<p><code>metrics/data_drift.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"drift_detected"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-attr">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>
}
</code></pre>
<p>The drift test results are also available on the Evently workplace dashboard for further analysis:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*2C1ICzvVazAUH7fk.png" alt="Figure B. Screenshot of the Evently workspace dashboard" width="600" height="400" loading="lazy"></p>
<p><strong>Figure B.</strong> Screenshot of the Evently workspace dashboard</p>
<h3 id="heading-stage-3-preprocessing">Stage 3: Preprocessing</h3>
<p>If no data drift is detected, the linage moves onto the preprocessing stage.</p>
<h4 id="heading-dvc-configuration-2">DVC Configuration</h4>
<p>We’ll add the <code>preprocess</code> stage right after the <code>data_drift_check</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/preprocess.py --target_col ${params.target_col} --should_scale ${params.should_scale} --verbose ${params.verbose}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/preprocess.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils</span>

    <span class="hljs-comment"># params from params.yaml</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.target_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.should_scale</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.verbose</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-comment"># train, val, test datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_test_df.parquet</span>

      <span class="hljs-comment"># preprocessed input datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_processed.parquet</span>

      <span class="hljs-comment"># trained preprocessor and human readable feature names for shap analysis</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/column_transformer.pkl</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/feature_names.json</span>
</code></pre>
<p>And then add default values of the parameters used in the <code>cmd</code>:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-2">Python Scripts</h4>
<p>Next, we’ll add a Python script to create training, validation, and test datasets and preprocess input data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess</span>(<span class="hljs-params">stockcode: str = <span class="hljs-string">''</span>, target_col: str = <span class="hljs-string">'quantity'</span>, should_scale: bool = True, verbose: bool = False</span>):</span>
    <span class="hljs-comment"># initiate metrics to track (dvc)</span>
    DATA_DRIFT_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'data_drift_<span class="hljs-subst">{args.stockcode}</span>.json'</span>)

    <span class="hljs-keyword">if</span> os.path.exists(DATA_DRIFT_METRICS_PATH):
        <span class="hljs-keyword">with</span> open(DATA_DRIFT_METRICS_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
            metrics = json.load(f)
    <span class="hljs-keyword">else</span>: metrics = dict()

    <span class="hljs-comment"># load processed df from dvc cache</span>
    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df = pd.read_parquet(PROCESSED_DF_PATH)

    <span class="hljs-comment"># categorize num and cat columns</span>
    num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
    <span class="hljs-keyword">if</span> verbose: main_logger.info(<span class="hljs-string">f'num_cols: <span class="hljs-subst">{num_cols}</span> \ncat_cols: <span class="hljs-subst">{cat_cols}</span>'</span>)

    <span class="hljs-comment"># structure cat cols</span>
    <span class="hljs-keyword">if</span> cat_cols:
        <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cat_cols: df[col] = df[col].astype(<span class="hljs-string">'string'</span>)

    <span class="hljs-comment"># initiate preprocessor (either load from the dvc cache or create from scratch)</span>
    PREPROCESSOR_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'column_transformer.pkl'</span>)
    <span class="hljs-keyword">try</span>:
        preprocessor = joblib.load(PREPROCESSOR_PATH)
    <span class="hljs-keyword">except</span>:
        preprocessor = scripts.create_preprocessor(num_cols=num_cols <span class="hljs-keyword">if</span> should_scale <span class="hljs-keyword">else</span> [], cat_cols=cat_cols)

    <span class="hljs-comment"># creates train, val, test datasets</span>
    y = df[target_col]
    X = df.copy().drop(target_col, axis=<span class="hljs-string">'columns'</span>)

    <span class="hljs-comment"># split</span>
    test_size, random_state = <span class="hljs-number">50000</span>, <span class="hljs-number">42</span>
    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)
    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># store train, val, test datasets (dvc track)</span>
    X_train.to_parquet(<span class="hljs-string">'data/x_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_val.to_parquet(<span class="hljs-string">'data/x_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_test.to_parquet(<span class="hljs-string">'data/x_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_train.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_val.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_test.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># preprocess</span>
    X_train = preprocessor.fit_transform(X_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    <span class="hljs-comment"># store preprocessed input data (dvc track)</span>
    pd.DataFrame(X_train).to_parquet(<span class="hljs-string">f'data/x_train_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_val).to_parquet(<span class="hljs-string">f'data/x_val_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_test).to_parquet(<span class="hljs-string">f'data/x_test_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># save feature names (dvc track) for shap</span>
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'preprocessors/feature_names.json'</span>, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        feature_names = preprocessor.get_feature_names_out()
        json.dump(feature_names.tolist(), f)

    <span class="hljs-keyword">return</span>  X_train, X_val, X_test, y_train, y_val, y_test, preprocessor


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'run data preprocessing'</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">'specific stockcode'</span>)
    parser.add_argument(<span class="hljs-string">'--target_col'</span>, type=str, default=<span class="hljs-string">'quantity'</span>, help=<span class="hljs-string">'the target column name'</span>)
    parser.add_argument(<span class="hljs-string">'--should_scale'</span>, type=bool, default=<span class="hljs-literal">True</span>, help=<span class="hljs-string">'flag to scale numerical features'</span>)
    parser.add_argument(<span class="hljs-string">'--verbose'</span>, type=bool, default=<span class="hljs-literal">False</span>, help=<span class="hljs-string">'flag for verbose logging'</span>)
    args = parser.parse_args()

    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = preprocess(
        target_col=args.target_col,
        should_scale=args.should_scale,
        verbose=args.verbose,
        stockcode=args.stockcode,
    )
</code></pre>
<h4 id="heading-outputs-2">Outputs</h4>
<p>This stage generates the necessary datasets for both model training and inference:</p>
<p>Input features:</p>
<ul>
<li><p><code>data/x_train_df.parquet</code></p>
</li>
<li><p><code>data/x_val_df.parquet</code></p>
</li>
<li><p><code>data/x_test_df.parquet</code></p>
</li>
</ul>
<p>Preprocessed input features:</p>
<ul>
<li><p><code>data/x_train_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_val_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_test_processed_df.parquet</code></p>
</li>
</ul>
<p>Target variables:</p>
<ul>
<li><p><code>data/y_train_df.parquet</code></p>
</li>
<li><p><code>data/y_val_df.parquet</code></p>
</li>
<li><p><code>data/y_test_df.parquet</code></p>
</li>
</ul>
<p>The preprocessor and human-readable feature names are also stored in cache for inference and SHAP feature impact analysis later:</p>
<ul>
<li><p><code>preprocessors/column_transformer.pk</code></p>
</li>
<li><p><code>preprocessors/feature_names.json</code></p>
</li>
</ul>
<p>Lastly, DVC adds the <code>preprocess_status</code> , <code>x_train_processed_path</code>, and <code>preprocessor_path</code> to the data summary metrics file <code>data.json</code> created in Step 2 to track the end-to-end process of Steps 2 and 3:</p>
<p><code>metrics/data.json</code>:</p>
<pre><code class="lang-python">{
    <span class="hljs-string">"drift_detected"</span>: false,
    <span class="hljs-string">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-string">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-string">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>,

    <span class="hljs-comment"># updates</span>
    <span class="hljs-string">"preprocess_status"</span>: <span class="hljs-string">"completed"</span>,
    <span class="hljs-string">"x_train_processed_path"</span>: <span class="hljs-string">"data/x_train_processed_85123A.parquet"</span>,
    <span class="hljs-string">"preprocessor_path"</span>: <span class="hljs-string">"preprocessors/column_transformer.pkl"</span>
}
</code></pre>
<p>Next, let’s move onto the model/experiment lineage.</p>
<h3 id="heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</h3>
<p>Now that we’ve created the datasets, we’ll tune and train the primary model. It’s a multi-layered feedforward network on <strong>PyTorch</strong>, using training and validation datasets created in the <code>preprocess</code> stage.</p>
<h4 id="heading-dvc-configuration-3">DVC Configuration</h4>
<p>First, we’ll add the <code>tuning_primary_model</code> stage right after the <code>preprocess</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/main.py
      data/x_train_processed_${params.stockcode}.parquet
      data/x_val_processed_${params.stockcode}.parquet
      data/y_train_df_${params.stockcode}.parquet
      data/y_val_df_${params.stockcode}.parquet
      ${tuning.should_local_save}
      ${tuning.grid}
      ${tuning.n_trials}
      ${tuning.num_epochs}
      ${params.stockcode}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/main.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.n_trials</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.grid</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.should_local_save</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/dfn_best_${params.stockcode}.pth</span> <span class="hljs-comment"># dvc track</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_val_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-3">Python Scripts</h4>
<p>Next, we’ll add the Python scripts to tune the model using <strong>Bayesian optimization</strong> and then train the optimal model on the complete <code>X_train</code> and <code>y_train</code> datasets created in the <code>preprocess</code> stage.</p>
<p><code>src/model/torch_model/main.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tune_and_train</span>(<span class="hljs-params">
        X_train, X_val, y_train, y_val,
        stockcode: str = <span class="hljs-string">''</span>,
        should_local_save: bool = True,
        grid: bool = False,
        n_trials: int = <span class="hljs-number">50</span>,
        num_epochs: int = <span class="hljs-number">3000</span>
    </span>) -&gt; tuple[nn.Module, dict]:</span>

    <span class="hljs-comment"># perform bayesian optimization</span>
    best_dfn, best_optimizer, best_batch_size, best_checkpoint = scripts.bayesian_optimization(
        X_train, X_val, y_train, y_val, n_trials=n_trials, num_epochs=num_epochs
    )

    <span class="hljs-comment"># save the model artifact (dvc track)</span>
    DFN_FILE_PATH = os.path.join(<span class="hljs-string">'models'</span>, <span class="hljs-string">'production'</span>, <span class="hljs-string">f'dfn_best_<span class="hljs-subst">{stockcode}</span>.pth'</span> <span class="hljs-keyword">if</span> stockcode <span class="hljs-keyword">else</span> <span class="hljs-string">'dfn_best.pth'</span>)
    os.makedirs(os.path.dirname(DFN_FILE_PATH), exist_ok=<span class="hljs-literal">True</span>)
    torch.save(best_checkpoint, DFN_FILE_PATH)

    <span class="hljs-keyword">return</span> best_dfn, best_checkpoint



<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">track_metrics_by_stockcode</span>(<span class="hljs-params">X_val, y_val, best_model, checkpoint: dict, stockcode: str</span>):</span>
    MODEL_VAL_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_val_<span class="hljs-subst">{stockcode}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_VAL_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># validate the tuned model</span>
    _, mse, exp_mae, rmsle = scripts.perform_inference(model=best_model, X=X_val, y=y_val)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    metrics = dict(
        stockcode=stockcode,
        mse_val=mse,
        mae_val=exp_mae,
        rmsle_val=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-comment"># store the validation results (dvc track)</span>
    <span class="hljs-keyword">with</span> open(MODEL_VAL_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... validation metrics saved to <span class="hljs-subst">{MODEL_VAL_METRICS_PATH}</span> ...'</span>)


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># fetch command arg values</span>
    X_TRAIN_PATH = sys.argv[<span class="hljs-number">1</span>]
    X_VAL_PATH = sys.argv[<span class="hljs-number">2</span>]
    Y_TRAIN_PATH = sys.argv[<span class="hljs-number">3</span>]
    Y_VAL_PATH = sys.argv[<span class="hljs-number">4</span>]
    SHOULD_LOCAL_SAVE = sys.argv[<span class="hljs-number">5</span>] == <span class="hljs-string">'True'</span>
    GRID = sys.argv[<span class="hljs-number">6</span>] == <span class="hljs-string">'True'</span>
    N_TRIALS = int(sys.argv[<span class="hljs-number">7</span>])
    NUM_EPOCHS = int(sys.argv[<span class="hljs-number">8</span>])
    STOCKCODE = str(sys.argv[<span class="hljs-number">9</span>])

    <span class="hljs-comment"># extract training and validation datasets from dvc cache</span>
    X_train, X_val = pd.read_parquet(X_TRAIN_PATH), pd.read_parquet(X_VAL_PATH)
    y_train, y_val = pd.read_parquet(Y_TRAIN_PATH), pd.read_parquet(Y_VAL_PATH)

    <span class="hljs-comment"># tuning</span>
    best_model, checkpoint = tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode=STOCKCODE, should_local_save=SHOULD_LOCAL_SAVE, grid=GRID, n_trials=N_TRIALS, num_epochs=NUM_EPOCHS
    )

    <span class="hljs-comment"># metrics tracking</span>
    track_metrics_by_stockcode(X_val, y_val, best_model=best_model, checkpoint=checkpoint, stockcode=STOCKCODE)
</code></pre>
<h4 id="heading-outputs-3">Outputs</h4>
<p>The stage generates two files:</p>
<ul>
<li><p><code>models/production/dfn_best.pth</code>: Includes model artifacts and checkpoint like the optimal hyperparameter set.</p>
</li>
<li><p><code>metrics/dfn_val.json</code>: Contains tuning results, model version, timestamp, and validation results for MSE, MAE, and RMSLE:</p>
</li>
</ul>
<p><code>metrics/dfn_val.json</code>:</p>
<pre><code class="lang-yaml">{
    <span class="hljs-attr">"stockcode":</span> <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_val":</span> <span class="hljs-number">0.6137686967849731</span>,
    <span class="hljs-attr">"mae_val":</span> <span class="hljs-number">9.092489242553711</span>,
    <span class="hljs-attr">"rmsle_val":</span> <span class="hljs-number">0.6953299045562744</span>,
    <span class="hljs-attr">"model_version":</span> <span class="hljs-string">"dfn_85123A_35604"</span>,
    <span class="hljs-attr">"hparams":</span> {
        <span class="hljs-attr">"num_layers":</span> <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm":</span> <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0":</span> <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0":</span> <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1":</span> <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1":</span> <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2":</span> <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2":</span> <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3":</span> <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3":</span> <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate":</span> <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr":</span> <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp":</span> <span class="hljs-string">"2025-10-07T00:31:08.700294"</span>
}
</code></pre>
<h3 id="heading-stage-5-performing-inference">Stage 5: Performing Inference</h3>
<p>After the model tuning phase is complete, we’ll configure the test inference for a final evaluation.</p>
<p>The final evaluation uses the MSE, MAE, and RMSLE metrics, as well as SHAP for feature impact and interpretability analysis.</p>
<p><strong>SHAP</strong> <strong>(SHapley Additive exPlanations)</strong> is a framework for quantifying how much each feature contributes to a model’s prediction by using the concept of Shapley values from game theory.</p>
<p>The SHAP values are leveraged for future EDA and feature engineering.</p>
<h4 id="heading-dvc-configuration-4">DVC Configuration</h4>
<p>First, we’ll add the <code>inference_primary_model</code> stage to the DVC configuration.</p>
<p>This stage has the <code>plots</code> section where DVC will track and version the generated visualization files on the SHAP values.</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/inference.py
      data/x_test_processed_${params.stockcode}.parquet
      data/y_test_df_${params.stockcode}.parquet
      models/production/dfn_best_${params.stockcode}.pth
      ${params.stockcode}
      ${tracking.sensitive_feature_col}
      ${tracking.privileged_group}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/inference.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_inf_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>

    <span class="hljs-attr">plots:</span>
      <span class="hljs-comment"># shap summary / beeswarm plot for global interpretability</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_summary_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">simple</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">shap_value</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Beeswarm</span> <span class="hljs-string">Plot</span>

      <span class="hljs-comment"># shap mean absolute vals - feature importance bar plot</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_mean_abs_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">bar</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">mean_abs_shap</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">Mean</span> <span class="hljs-string">Absolute</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Importance</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_raw_shap_values_${params.stockcode}.parquet</span> <span class="hljs-comment"># save raw shap vals for detailed analysis later</span>
</code></pre>
<h4 id="heading-python-scripts-4"><strong>Python Scripts</strong></h4>
<p>Next, we’ll add scripts where the trained model performs inference:</p>
<p><code>src/model/torch_model/inference.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> shap

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># load test dataset</span>
    X_TEST_PATH = sys.argv[<span class="hljs-number">1</span>]
    Y_TEST_PATH = sys.argv[<span class="hljs-number">2</span>]
    X_test, y_test = pd.read_parquet(X_TEST_PATH), pd.read_parquet(Y_TEST_PATH)

    <span class="hljs-comment"># create X_test w/ column names for shap analysis and sensitive feature tracking</span>
    X_test_with_col_names = X_test.copy()
    FEATURE_NAMES_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'feature_names.json'</span>)
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">with</span> open(FEATURE_NAMES_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f: feature_names = json.load(f)
    <span class="hljs-keyword">except</span> FileNotFoundError: feature_names = X_test.columns.tolist()
    <span class="hljs-keyword">if</span> len(X_test_with_col_names.columns) == len(feature_names): X_test_with_col_names.columns = feature_names

    <span class="hljs-comment"># reconstruct the optimal model tuned in the previous stage</span>
    MODEL_PATH = sys.argv[<span class="hljs-number">3</span>]
    checkpoint = torch.load(MODEL_PATH)
    model = scripts.load_model(checkpoint=checkpoint)

    <span class="hljs-comment"># perform inference</span>
    y_pred, mse, exp_mae, rmsle = scripts.perform_inference(model=model, X=X_test, y=y_test, batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>])

    <span class="hljs-comment"># create result df w/ y_pred, y_true, and sensitive features</span>
    STOCKCODE = sys.argv[<span class="hljs-number">4</span>]
    SENSITIVE_FEATURE = sys.argv[<span class="hljs-number">5</span>]
    PRIVILEGED_GROUP = sys.argv[<span class="hljs-number">6</span>]
    inference_df = pd.DataFrame(y_pred.cpu().numpy().flatten(), columns=[<span class="hljs-string">'y_pred'</span>])
    inference_df[<span class="hljs-string">'y_true'</span>] = y_test
    inference_df[SENSITIVE_FEATURE] = X_test_with_col_names[<span class="hljs-string">f'cat__<span class="hljs-subst">{SENSITIVE_FEATURE}</span>_<span class="hljs-subst">{str(PRIVILEGED_GROUP)}</span>'</span>].astype(bool)
    inference_df.to_parquet(path=os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">f'dfn_inference_results_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>))

    <span class="hljs-comment"># record inference metrics</span>
    MODEL_INF_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_inf_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_INF_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{STOCKCODE}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    inf_metrics = dict(
        stockcode=STOCKCODE,
        mse_inf=mse,
        mae_inf=exp_mae,
        rmsle_inf=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-keyword">with</span> open(MODEL_INF_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f: <span class="hljs-comment"># dvc track</span>
        json.dump(inf_metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... inference metrics saved to <span class="hljs-subst">{MODEL_INF_METRICS_PATH}</span> ...'</span>)


    <span class="hljs-comment">## shap analysis</span>
    <span class="hljs-comment"># compute shap vals</span>
    model.eval()

    <span class="hljs-comment"># prepare backgdound data</span>
    X_test_tensor = torch.from_numpy(X_test.values.astype(np.float32)).to(device_type)

    <span class="hljs-comment"># take the small samples from x_test as background</span>
    background = X_test_tensor[np.random.choice(X_test_tensor.shape[<span class="hljs-number">0</span>], <span class="hljs-number">100</span>, replace=<span class="hljs-literal">False</span>)].to(device_type)

    <span class="hljs-comment"># define deepexplainer</span>
    explainer = shap.DeepExplainer(model, background)

    <span class="hljs-comment"># compute shap vals</span>
    shap_values = explainer.shap_values(X_test_tensor) <span class="hljs-comment"># outputs = numpy array or tensor</span>

    <span class="hljs-comment"># convert shap array to pandas df</span>
    <span class="hljs-keyword">if</span> isinstance(shap_values, list): shap_values = shap_values[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">if</span> isinstance(shap_values, torch.Tensor): shap_values = shap_values.cpu().numpy()
    shap_values = shap_values.squeeze(axis=<span class="hljs-number">-1</span>) <span class="hljs-comment"># type: ignore</span>
    shap_df = pd.DataFrame(shap_values, columns=feature_names)

    <span class="hljs-comment"># shap raw data (dvc track)</span>
    RAW_SHAP_OUT_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_raw_shap_values_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>)
    os.makedirs(os.path.dirname(RAW_SHAP_OUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    shap_df.to_parquet(RAW_SHAP_OUT_PATH, index=<span class="hljs-literal">False</span>)
    main_logger.info(<span class="hljs-string">f'... shap values saved to <span class="hljs-subst">{RAW_SHAP_OUT_PATH}</span> ...'</span>)

    <span class="hljs-comment"># bar plot of mean abs shap vals (dvc report)</span>
    mean_abs_shap = shap_df.abs().mean().sort_values(ascending=<span class="hljs-literal">False</span>)
    shap_mean_abs_df = pd.DataFrame({<span class="hljs-string">'feature_name'</span>: feature_names, <span class="hljs-string">'mean_abs_shap'</span>: mean_abs_shap.values })
    MEAN_ABS_SHAP_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_shap_mean_abs_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    shap_mean_abs_df.to_json(MEAN_ABS_SHAP_PATH, orient=<span class="hljs-string">'records'</span>, indent=<span class="hljs-number">4</span>)
</code></pre>
<h4 id="heading-outputs-4"><strong>Outputs</strong></h4>
<p>This stage generates five output files:</p>
<ul>
<li><p><code>data/dfn_inference_result_${params_stockcode}.parquet</code>: Stores prediction results, labeled targets, and any columns with sensitive features like gender, age, income, and more. I’ll use this file for the fairness test in the last stage.</p>
</li>
<li><p><code>metrics/dfn_inf.json</code>: Stores evaluation metrics and tuning results:</p>
</li>
</ul>
<pre><code class="lang-json">{
    <span class="hljs-attr">"stockcode"</span>: <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_inf"</span>: <span class="hljs-number">0.6841545701026917</span>,
    <span class="hljs-attr">"mae_inf"</span>: <span class="hljs-number">11.5866117477417</span>,
    <span class="hljs-attr">"rmsle_inf"</span>: <span class="hljs-number">0.7423332333564758</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35834"</span>,
    <span class="hljs-attr">"hparams"</span>: {
        <span class="hljs-attr">"num_layers"</span>: <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm"</span>: <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0"</span>: <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0"</span>: <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1"</span>: <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1"</span>: <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2"</span>: <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2"</span>: <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3"</span>: <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3"</span>: <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate"</span>: <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr"</span>: <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:12.946405"</span>
}
</code></pre>
<ul>
<li><code>reports/dfn_shap_mean_abs.json</code>:  Stores the mean SHAP values:</li>
</ul>
<pre><code class="lang-json">[
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__invoicedate"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.219255722</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__unitprice"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1069829418</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_avg_quantity_last_month"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1021453096</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_max_price_all_time"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.0855356899</span>
    },
...
]
</code></pre>
<ul>
<li><p><code>reports/dfn_shap_summary.json</code>: Contains the data points necessary to draw the beeswarm/bar plots.</p>
</li>
<li><p><code>reports/dfn_raw_shap_values.parquet</code>: Stores raw SHAP values.</p>
</li>
</ul>
<h3 id="heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</h3>
<p>The last stage is to assess risk and fairness of the final inference results.</p>
<h4 id="heading-the-fairness-testing">The Fairness Testing</h4>
<p>Fairness testing in ML is the process of systematically evaluating a model’s predictions to ensure they are not unfairly biased toward specific groups defined by sensitive attributes like race and gender.</p>
<p>In this project, we’ll use the registration status <code>is_registered</code> column as a sensitive feature and make sure the <strong>Mean Outcome Difference (MOD)</strong> is within the specified threshold of <code>0.1</code>.</p>
<p>The MOD is calculated as the absolute difference between the mean prediction values of the privileged (registered) and unprivileged (unregistered) groups.</p>
<h4 id="heading-dvc-configuration-5">DVC Configuration</h4>
<p>First, we’ll add the <code>assess_model_risk</code> stage right after the <code>inference_primary_model</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">assess_model_risk:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/assess_risk_and_fairness.py
      data/dfn_inference_results_${params.stockcode}.parquet
      metrics/dfn_risk_fairness_${params.stockcode}.json
      ${tracking.sensitive_feature_col}
      ${params.stockcode}
      ${tracking.privileged_group}
      ${tracking.mod_threshold}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/assess_risk_and_fairness.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span> <span class="hljs-comment"># ensure the result df as dependency</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.mod_threshold</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_risk_fairness_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>param.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>

<span class="hljs-comment"># adding default values to the tracking metrics</span>
<span class="hljs-attr">tracking:</span>
  <span class="hljs-attr">sensitive_feature_col:</span> <span class="hljs-string">"is_registered"</span>
  <span class="hljs-attr">privileged_group:</span> <span class="hljs-number">1</span> <span class="hljs-comment"># member</span>
  <span class="hljs-attr">mod_threshold:</span> <span class="hljs-number">0.1</span>
</code></pre>
<h4 id="heading-python-script">Python Script</h4>
<p>The corresponding Python script contains the <code>calculate_fairness_metrics</code> function which performs the risk and fairness assessment:</p>
<p><code>src/model/torch_model/assess_risk_and_fairness.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_absolute_error, mean_squared_error, root_mean_squared_log_error

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_fairness_metrics</span>(<span class="hljs-params">
        df: pd.DataFrame,
        sensitive_feature_col: str,
        label_col: str = <span class="hljs-string">'y_true'</span>,
        prediction_col: str = <span class="hljs-string">'y_pred'</span>,
        privileged_group: int = <span class="hljs-number">1</span>,
        mod_threshold: float = <span class="hljs-number">0.1</span>,
    </span>) -&gt; dict:</span>

    metrics = dict()
    unprivileged_group = <span class="hljs-number">0</span> <span class="hljs-keyword">if</span> privileged_group == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-number">1</span>

    <span class="hljs-comment">## 1. risk assessment - predictive performance metrics by group</span>
    <span class="hljs-keyword">for</span> group, name <span class="hljs-keyword">in</span> zip([unprivileged_group, privileged_group], [<span class="hljs-string">'unprivileged'</span>, <span class="hljs-string">'privileged'</span>]):
        subset = df[df[sensitive_feature_col] == group]
        <span class="hljs-keyword">if</span> len(subset) == <span class="hljs-number">0</span>: <span class="hljs-keyword">continue</span>

        y_true = subset[label_col].values
        y_pred = subset[prediction_col].values

        metrics[<span class="hljs-string">f'mse_<span class="hljs-subst">{name}</span>'</span>] = float(mean_squared_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'mae_<span class="hljs-subst">{name}</span>'</span>] = float(mean_absolute_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'rmsle_<span class="hljs-subst">{name}</span>'</span>] = float(root_mean_squared_log_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>

        <span class="hljs-comment"># mean prediction (outcome disparity component)</span>
        metrics[<span class="hljs-string">f'mean_prediction_<span class="hljs-subst">{name}</span>'</span>] = float(y_pred.mean()) <span class="hljs-comment"># type: ignore</span>

    <span class="hljs-comment">## 2. bias assessment - fairness metrics</span>
    <span class="hljs-comment"># absolute mean error difference</span>
    mae_diff = metrics.get(<span class="hljs-string">'mae_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mae_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mae_diff'</span>] = float(mae_diff)

    <span class="hljs-comment"># mean outcome difference</span>
    mod = metrics.get(<span class="hljs-string">'mean_prediction_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mean_prediction_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mean_outcome_difference'</span>] = float(mod)
    metrics[<span class="hljs-string">'is_mod_acceptable'</span>] = <span class="hljs-number">1</span> <span class="hljs-keyword">if</span> abs(mod) &lt;= mod_threshold <span class="hljs-keyword">else</span> <span class="hljs-number">0</span>

    <span class="hljs-keyword">return</span> metrics


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'assess bias and fairness metrics on model inference results.'</span>)
    parser.add_argument(<span class="hljs-string">'inference_file_path'</span>, type=str, help=<span class="hljs-string">'parquet file path to the inference results w/ y_true, y_pred, and sensitive feature cols.'</span>)
    parser.add_argument(<span class="hljs-string">'metrics_output_path'</span>, type=str, help=<span class="hljs-string">'json file path to save the metrics output.'</span>)
    parser.add_argument(<span class="hljs-string">'sensitive_feature_col'</span>, type=str, help=<span class="hljs-string">'column name of sensitive features'</span>)
    parser.add_argument(<span class="hljs-string">'stockcode'</span>, type=str)
    parser.add_argument(<span class="hljs-string">'privileged_group'</span>, type=int, default=<span class="hljs-number">1</span>)
    parser.add_argument(<span class="hljs-string">'mod_threshold'</span>, type=float, default=<span class="hljs-number">.1</span>)
    args = parser.parse_args()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># load inf df</span>
        df_inference = pd.read_parquet(args.inference_file_path)
        LABEL_COL = <span class="hljs-string">'y_true'</span>
        PREDICTION_COL = <span class="hljs-string">'y_pred'</span>
        SENSITIVE_COL = args.sensitive_feature_col

        <span class="hljs-comment"># compute fairness metrics</span>
        metrics = calculate_fairness_metrics(
            df=df_inference,
            sensitive_feature_col=SENSITIVE_COL,
            label_col=LABEL_COL,
            prediction_col=PREDICTION_COL,
            privileged_group=args.privileged_group,
            mod_threshold=args.mod_threshold,
        )

        <span class="hljs-comment"># add items to metrics</span>
        metrics[<span class="hljs-string">'model_version'</span>] = <span class="hljs-string">f'dfn_<span class="hljs-subst">{args.stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>'</span>
        metrics[<span class="hljs-string">'sensitive_feature'</span>] = args.sensitive_feature_col
        metrics[<span class="hljs-string">'privileged_group'</span>] = args.privileged_group
        metrics[<span class="hljs-string">'mod_threshold'</span>] = args.mod_threshold
        metrics[<span class="hljs-string">'stockcode'</span>] = args.stockcode
        metrics[<span class="hljs-string">'timestamp'</span>] = datetime.datetime.now().isoformat()

        <span class="hljs-comment"># load metrics (dvc track)</span>
        <span class="hljs-keyword">with</span> open(args.metrics_output_path, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
            json_metrics = { k: (v <span class="hljs-keyword">if</span> pd.notna(v) <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>) <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> metrics.items() }
            json.dump(json_metrics, f, indent=<span class="hljs-number">4</span>)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        main_logger.error(<span class="hljs-string">f'... an error occurred during risk and fairness assessment: <span class="hljs-subst">{e}</span> ...'</span>)
        exit(<span class="hljs-number">1</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    main()
</code></pre>
<h4 id="heading-outputs-5">Outputs</h4>
<p>The final stage generates a metrics file which contains test results and model version:</p>
<p><code>metrics/dfn_risk_fairness.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"mse_unprivileged"</span>: <span class="hljs-number">3.5370739412593575</span>,
    <span class="hljs-attr">"mae_unprivileged"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"rmsle_unprivileged"</span>: <span class="hljs-number">0.6080000224747837</span>,
    <span class="hljs-attr">"mean_prediction_unprivileged"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"mae_diff"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"mean_outcome_difference"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"is_mod_acceptable"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35971"</span>,
    <span class="hljs-attr">"sensitive_feature"</span>: <span class="hljs-string">"is_registered"</span>,
    <span class="hljs-attr">"privileged_group"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"mod_threshold"</span>: <span class="hljs-number">0.1</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:15.998590"</span>
}
</code></pre>
<p>That’s all for the lineage configuration. Now, we’ll test it in local.</p>
<h3 id="heading-test-in-local">Test in Local</h3>
<p>We’ll run the entire ML lineage with this command:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> repro -f
</code></pre>
<p><code>-f</code> forces DVC to rerun all the stages with or without any updates.</p>
<p>The command will automatically create the <code>dvc.lock</code> file at the root of the project directory:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">schema:</span> <span class="hljs-string">'2.0'</span>
<span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline_full:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
    <span class="hljs-attr">deps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">ae41392532188d290395495f6827ed00.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">15870</span>
      <span class="hljs-attr">nfiles:</span> <span class="hljs-number">10</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">a8a61a4b270581a7c387d51e416f4e86.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">95715</span>
<span class="hljs-string">...</span>
</code></pre>
<p>The <code>dvc.lock</code> file must be published in Git to make sure DVC will load the latest files:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add dvc.lock .dvc dvc.yaml params.yaml
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dvc config'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<h2 id="heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</h2>
<p>Next, we’ll deploy the DVC project to ensure the AWS Lambda function can access the cached files in production.</p>
<p>We’ll start by configuring the DVC remote where the cached files are stored.</p>
<p>DVC offers <a target="_blank" href="https://dvc.org/doc/user-guide/data-management/remote-storage#supported-storage-types">various storage types</a> like AWS S3 and Google Cloud. We’ll use AWS S3 for this project but your choice depend on the project ecosystem, your familiarity with the tool, and any resource constraints.</p>
<p>First, we’ll create a new S3 bucket in the selected AWS region:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> s3 mb s3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;  --region &lt;AWS REGION&gt;
</code></pre>
<p>Make sure the IAM role has the following permissions: <code>s3:ListBucket</code>, <code>s3:GetObject</code>, <code>s3:PutObject</code>, and <code>s3:DeleteObject</code>.</p>
<p>Then, add theURI of the S3 bucket to the DVC remote:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> remote add -d &lt;DVC REMOTE NAME&gt; ss3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;
</code></pre>
<p>Next, push the cache files to the DVC remote:</p>
<pre><code class="lang-python">$dvc push
</code></pre>
<p>Now, all cache files are stored in the S3 bucket:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*yl9N4P8LNI7d_G_z.png" alt="Figure C. Screenshot of the DVC remote in AWS S3 bucket" width="600" height="400" loading="lazy"></p>
<p><strong>Figure C.</strong> Screenshot of the DVC remote in AWS S3 bucket</p>
<p>As shown in <strong>Figure A,</strong> this deployment step is necessary for the AWS Lambda function to access the DVC cache in production.</p>
<h2 id="heading-step-4-configuring-scheduled-run-with-prefect"><strong>Step 4: Configuring Scheduled Run with Prefect</strong></h2>
<p>The next step is to configure the scheduled run of the entire lineage with Prefect.</p>
<p>Prefect is an open-source workflow orchestration tool for building, scheduling, and monitoring pipelines. It uses a concept called a work pool to effectively decouple the orchestration logic from the execution infrastructure.</p>
<p>Then, the work pool serves as a standardized base configuration by running a Docker container image to guarantee a consistent execution environment for all flows.</p>
<h3 id="heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</h3>
<p>The first step is to configure the Docker image registry for the Prefect work pool:</p>
<ul>
<li><p>For local deployment: <strong>A container registry in the Docker Hub.</strong></p>
</li>
<li><p>For production deployment: <strong>AWS ECR</strong>.</p>
</li>
</ul>
<p>For local deployment, we’ll first authenticate the Docker client:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> login
</code></pre>
<p>And grant a user permission to run Docker commands without <code>sudo</code>:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$sudo</span> dscl . -append /Groups/docker GroupMembership <span class="hljs-variable">$USER</span>
</code></pre>
<p>For production deployment, we’ll create a new ECR:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> ecr create-repository --repository-name &lt;REGISTORY NAME&gt; --region &lt;AWS REGION&gt;
</code></pre>
<p>(Make sure the IAM role has access to this new ECR URI.)</p>
<h3 id="heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</h3>
<p>Next, we’ll configure the Prefect <code>task</code> and <code>flow</code> in the project:</p>
<ul>
<li><p>The Prefect <code>task</code> executes the <code>dvc repro</code> and <code>dvc push</code> commands</p>
</li>
<li><p>The Prefect <code>flow</code> weekly executes the Prefect <code>task</code>.</p>
</li>
</ul>
<p><code>src/prefect_flows.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> subprocess
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> timedelta, datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> prefect <span class="hljs-keyword">import</span> flow, task
<span class="hljs-keyword">from</span> prefect.schedules <span class="hljs-keyword">import</span> Schedule
<span class="hljs-keyword">from</span> prefect_aws <span class="hljs-keyword">import</span> AwsCredentials

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># add project root to the python path - enabling prefect to find the script</span>
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), <span class="hljs-string">'..'</span>)))

<span class="hljs-comment"># define the prefect task</span>
<span class="hljs-meta">@task(retries=3, retry_delay_seconds=30)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_dvc_pipeline</span>():</span>
    <span class="hljs-comment"># execute the dvc pipeline </span>
    result = subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"repro"</span>], capture_output=<span class="hljs-literal">True</span>, text=<span class="hljs-literal">True</span>, check=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># push the updated data</span>
    subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"push"</span>], check=<span class="hljs-literal">True</span>)


<span class="hljs-comment"># define the prefect flow</span>
<span class="hljs-meta">@flow(name="Weekly Data Pipeline")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">weekly_data_flow</span>():</span>
    run_dvc_pipeline()

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># docker image registry (either docker hub or aws ecr)</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ENV = os.getenv(<span class="hljs-string">'ENV'</span>, <span class="hljs-string">'production'</span>)
    DOCKER_HUB_REPO = os.getenv(<span class="hljs-string">'DOCKER_HUB_REPO'</span>)
    ECR_FOR_PREFECT_PATH = os.getenv(<span class="hljs-string">'S3_BUCKET_FOR_PREFECT_PATH'</span>)
    image_repo = <span class="hljs-string">f'<span class="hljs-subst">{DOCKER_HUB_REPO}</span>:ml-sales-pred-data-latest'</span> <span class="hljs-keyword">if</span> ENV == <span class="hljs-string">'local'</span> <span class="hljs-keyword">else</span> <span class="hljs-string">f'<span class="hljs-subst">{ECR_FOR_PREFECT_PATH}</span>:latest'</span>

    <span class="hljs-comment"># define weekly schedule</span>
    weekly_schedule = Schedule(
        interval=timedelta(weeks=<span class="hljs-number">1</span>),
        anchor_date=datetime(<span class="hljs-number">2025</span>, <span class="hljs-number">9</span>, <span class="hljs-number">29</span>, <span class="hljs-number">9</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>),
        active=<span class="hljs-literal">True</span>,
    )

    <span class="hljs-comment"># aws credentials to access ecr</span>
    AwsCredentials(
        aws_access_key_id=os.getenv(<span class="hljs-string">'AWS_ACCESS_KEY_ID'</span>),
        aws_secret_access_key=os.getenv(<span class="hljs-string">'AWS_SECRET_ACCESS_KEY'</span>),
        region_name=os.getenv(<span class="hljs-string">'AWS_REGION_NAME'</span>),
    ).save(<span class="hljs-string">'aws'</span>, overwrite=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># deploy the prefect flow</span>
    weekly_data_flow.deploy(
        name=<span class="hljs-string">'weekly-data-flow'</span>,
        schedule=weekly_schedule, <span class="hljs-comment"># schedule</span>
        work_pool_name=<span class="hljs-string">"wp-ml-sales-pred"</span>, <span class="hljs-comment"># work pool where the docker image (flow) runs</span>
        image=image_repo, <span class="hljs-comment"># create a docker image at docker hub (local) or ecr (production)</span>
        concurrency_limit=<span class="hljs-number">3</span>,
        push=<span class="hljs-literal">True</span> <span class="hljs-comment"># push the docker image to the image_repo</span>
    )
</code></pre>
<h3 id="heading-test-in-local-1">Test in Local</h3>
<p>Next, we’ll test the workflow locally with the Prefect server:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run prefect server start

<span class="hljs-variable">$export</span> PREFECT_API_URL=<span class="hljs-string">"http://127.0.0.1:4200/api"</span>
</code></pre>
<p>Run the <code>prefect_flows.py</code> script:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run src/prefect_flows.py
</code></pre>
<p>Upon the successful execution, the Prefect dashboard indicates the workflow is scheduled to run:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*pUJppTJ4MloU2DVr.png" alt="Figure D. The screenshot of the Prefect dashboard" width="1260" height="586" loading="lazy"></p>
<p><strong>Figure D.</strong> As screenshot of the Prefect dashboard</p>
<h2 id="heading-step-5-deploying-the-application">Step 5: Deploying the Application</h2>
<p>The final step is to deploy the entire application as a containerized Lambda by configuring the <code>Dockerfile</code> and the Flask application scripts.</p>
<p>The specific process in this final deployment step depends on the infrastructure.</p>
<p>But the common point is that DVC eliminates the need to store the large Parquet or CSV files directly in the feature store or model store because it caches them as lightweight hashed files.</p>
<p>So, first, we’ll simplify the loading logic of the Flask application script by using the <code>dvc.api</code> framework:</p>
<p><code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-keyword">import</span> dvc.api

DVC_REMOTE_NAME=&lt;REMOTE NAME IN .dvc/config file&gt;


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">configure_dvc_for_lambda</span>():</span>
    <span class="hljs-comment"># set dvc directories to /tmp</span>
    os.environ.update({
        <span class="hljs-string">'DVC_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-cache'</span>,
        <span class="hljs-string">'DVC_DATA_DIR'</span>: <span class="hljs-string">'/tmp/dvc-data'</span>,
        <span class="hljs-string">'DVC_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-config'</span>,
        <span class="hljs-string">'DVC_GLOBAL_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-global-config'</span>,
        <span class="hljs-string">'DVC_SITE_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-site-cache'</span>
    })
    <span class="hljs-keyword">for</span> dir_path <span class="hljs-keyword">in</span> [<span class="hljs-string">'/tmp/dvc-cache'</span>, <span class="hljs-string">'/tmp/dvc-data'</span>, <span class="hljs-string">'/tmp/dvc-config'</span>]:
        os.makedirs(dir_path, exist_ok=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_x_test</span>():</span>
    <span class="hljs-keyword">global</span> X_test
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading x_test ..."</span>)

        <span class="hljs-comment"># config dvc directories</span>
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(X_TEST_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                X_test = pd.read_parquet(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded x_test via dvc api'</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_preprocessor</span>():</span>
    <span class="hljs-keyword">global</span> preprocessor
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading preprocessor ..."</span>)
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(PREPROCESSOR_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                preprocessor = joblib.load(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded preprocessor via dvc api'</span>)

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)

<span class="hljs-comment">### ... the rest components remain the same  ...</span>
</code></pre>
<p>Then, update the Dockerfile to enable Docker to correctly reference the DVC components:</p>
<p><code>Dockerfile.lambda.production</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># use an official python runtime</span>
FROM public.ecr.aws/<span class="hljs-keyword">lambda</span>/python:<span class="hljs-number">3.12</span>

<span class="hljs-comment"># set environment variables (adding dvc related env variables)</span>
ENV JOBLIB_MULTIPROCESSING=<span class="hljs-number">0</span>
ENV DVC_HOME=<span class="hljs-string">"/tmp/.dvc"</span>
ENV DVC_CACHE_DIR=<span class="hljs-string">"/tmp/.dvc/cache"</span>
ENV DVC_REMOTE_NAME=<span class="hljs-string">"storage"</span>
ENV DVC_GLOBAL_SITE_CACHE_DIR=<span class="hljs-string">"/tmp/dvc_global"</span>

<span class="hljs-comment"># copy requirements file and install dependencies</span>
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir dvc dvc-s3

<span class="hljs-comment"># setup dvc</span>
RUN dvc init --no-scm
RUN dvc config core.no_scm true

<span class="hljs-comment"># copy the code to the lambda task root</span>
COPY . ${LAMBDA_TASK_ROOT}
CMD [ <span class="hljs-string">"app.handler"</span> ]
</code></pre>
<p>Lastly, ensure the large files are ignored from the Docker container image:</p>
<p><code>.dockerignore</code>:</p>
<pre><code class="lang-bash"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-comment"># dvc cache contains large files</span>
.dvc/cache
.dvcignore

<span class="hljs-comment"># add all folders that DVC will track</span>
data/
preprocessors/
models/
reports/
metrics/
</code></pre>
<h3 id="heading-test-in-local-2">Test in Local</h3>
<p>Finally, we’ll build and test the Docker image:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> build -t my-app -f Dockerfile.lambda.local .
<span class="hljs-variable">$docker</span> run -p 5002:5002 -e ENV=<span class="hljs-built_in">local</span> my-app app.py
</code></pre>
<p>Upon the successful configuration, the waitress server will run the Flask application.</p>
<p>After confirming the changes, push the code to Git:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add .
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dockerfiles and flask app scripts'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<p>This <code>push</code> command triggers the CI/CD pipeline via GitHub Actions, which generates a Docker container image and pushes it to AWS ECR.</p>
<p>And then after a successful pipeline flow and verification, we can manually run the deployment workflow using GitHub Actions.</p>
<p>And that’s it!</p>
<p>You can learn more here: <a target="_blank" href="https://medium.com/towards-artificial-intelligence/integrating-ci-cd-pipelines-to-machine-learning-applications-f5657c7fa164">Integrating the infrastructure CI/CD pipeline to an ML application</a></p>
<p>All code is available in <a target="_blank" href="https://github.com/krik8235/ml-sales-prediction">my GitHub repository</a>.</p>
<p>The mock app is available <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Building robust ML applications requires comprehensive ML lineage to ensure reliability and traceability.</p>
<p>In this article, you learned how to build an ML lineage by integrating open-source services like DVC and Prefect.</p>
<p>In practice, initial planning matters. Specifically, defining how metrics are tracked and at which stages leads directly to a cleaner, more maintainable code structure and the extensibility in the future.</p>
<p>Moving forward, we can consider adding more stages to the lineage and integrating advanced logic for data drift detection or fairness tests.</p>
<p>This will further ensure continued model performance and data integrity in the production environment.</p>
<p><strong>You can check out my</strong> <a target="_blank" href="https://kuriko-iwai.vercel.app/"><strong>Portfolio</strong></a> <strong>/</strong> <a target="_blank" href="https://github.com/krik8235"><strong>Github</strong></a><strong>.</strong></p>
<p><em>All images, unless otherwise noted, are by the author.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Machine Learning System on Serverless Architecture ]]>
                </title>
                <description>
                    <![CDATA[ Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks. But a model isn’t truly valuable until it’s in production, serving real users and solving real problems. In this article, you’ll learn how to ship a pro... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-machine-learning-system-on-serverless-architecture/</link>
                <guid isPermaLink="false">68addf802314e8b22eae4655</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ coding ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Tue, 26 Aug 2025 16:23:28 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756225357023/04572f1b-b9a7-43e0-aabc-2842faa2703f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks.</p>
<p>But a model isn’t truly valuable until it’s in production, serving real users and solving real problems.</p>
<p>In this article, you’ll learn how to ship a production-ready ML application built on serverless architecture.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-were-building">What We’re Building</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-ai-pricing-for-retailers">AI Pricing for Retailers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-models">The Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tuning-and-training">Tuning and Training</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-prediction">The Prediction</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-performance-validation">Performance Validation</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-system-architecture">The System Architecture</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-core-aws-resources-in-the-architecture">Core AWS Resources in the Architecture</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-deployment-workflow-in-action">The Deployment Workflow in Action</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-step-1-draft-python-scripts">Step 1: Draft Python Scripts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-configure-featuremodel-stores-in-s3">Step 2: Configure Feature/Model Stores in S3</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-create-a-flask-application-with-api-endpoints">Step 3: Create a Flask Application with API Endpoints</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-publish-a-docker-image-to-ecr">Step 4: Publish a Docker Image to ECR</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-create-a-lambda-function">Step 5: Create a Lambda Function</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-configure-aws-resources">Step 6: Configure AWS Resources</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-building-a-client-application-optional">Building a Client Application (Optional)</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-the-react-application">The React Application</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-final-results">Final Results</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>This project requires some basic experience with:</p>
<ul>
<li><p><strong>Machine Learning / Deep Learning:</strong> The full lifecycle, including data handling, model training, tuning, and validation.</p>
</li>
<li><p><strong>Coding:</strong> Proficiency in Python, with experience using major ML libraries such as PyTorch and Scikit-Learn.</p>
</li>
<li><p><strong>Full-stack deployment:</strong> Experience deploying applications using RESTful APIs.</p>
</li>
</ul>
<h2 id="heading-what-were-building">What We’re Building</h2>
<h3 id="heading-ai-pricing-for-retailers">AI Pricing for Retailers</h3>
<p>This project aims to help a middle-sized retailer compete with large players like Amazon.</p>
<p>Smaller companies often can’t afford significant price discounts, so they can face challenges finding optimal price points as they expand their product lines.</p>
<p>Our goal is to leverage AI models to recommend the best price for a selected product to maximize sales for the retailer, and display it on a client-side user interface (UI):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755873936847/ecf696ef-e161-4453-a6ad-e97d92ac1677.png" alt="What the UI will look like" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You can explore the UI from <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<h3 id="heading-the-models">The Models</h3>
<p>I’ll train and tune multiple models so that when the primary model fails, a backup model gets loaded to serve predictions.</p>
<ul>
<li><p><strong>Primary Model</strong>: Multi-layered feedforward network (on the <strong>PyTorch</strong> library)</p>
</li>
<li><p><strong>Backup Models (Backups)</strong>: LightGBM, SVR, and Elastic Net (on the <strong>Scikit-Learn</strong> library)</p>
</li>
</ul>
<p>The backup models are prioritized based on learning capabilities.</p>
<h3 id="heading-tuning-and-training">Tuning and Training</h3>
<p>The primary model was trained on a dataset of around 500,000 samples (<a target="_blank" href="https://archive.ics.uci.edu/dataset/352/online+retail">source)</a> and fine-tuned using <code>Optuna</code>'s Bayesian Optimization, with grid search available for further refinement.</p>
<p>The backups are also trained on the same samples and tuned using the <code>Scikit-Optimize</code> framework.</p>
<h3 id="heading-the-prediction">The Prediction</h3>
<p>All models serve predictions on <strong>logged quantity values.</strong></p>
<p>Logarithmic transformations of the quantity data make the distribution denser, which helps models learn patterns more effectively. This is because logarithms reduce the impact of extreme values, or outliers, and can help normalize skewed data.</p>
<h3 id="heading-performance-validation">Performance Validation</h3>
<p>We’ll evaluate model performance using different metrics for the transformed and original data, with a lower value always indicating better performance.</p>
<ul>
<li><p><strong>Logged values</strong>: Mean Squared Error (MSE)</p>
</li>
<li><p><strong>Actual values</strong>: Root Mean Squared Log Error (RMSLE) and Mean Absolute Error (MAE)</p>
</li>
</ul>
<h2 id="heading-the-system-architecture">The System Architecture</h2>
<p>We’re going to build a complete ecosystem around an <strong>AWS Lambda function</strong> to create a scalable ML system:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:4680/0*ulcNtwJeU5EOfhTg.png" alt="Fig. The system architecture (Created by Kuriko IWAI)" width="600" height="400" loading="lazy"></p>
<p>Fig. The system architecture (Created by <a target="_blank" href="https://kuriko-iwai.vercel.app/">Kuriko IWAI)</a></p>
<p><strong>AWS Lambda</strong> is a <strong>serverless production</strong> where a service provider can run the application without managing servers. Once they upload the code, AWS takes on the responsibility of managing the underlying infrastructure.</p>
<p>In the serverless production, the code is deployed as <strong>a stateless function</strong> that runs only when it’s triggered by an event like HTTP requests or scheduled tasks.</p>
<p>This event-driven nature makes serverless production extremely efficient in resource allocation because:</p>
<ul>
<li><p><strong>There’s no server management</strong>: The cloud provider takes care of operational tasks.</p>
</li>
<li><p><strong>You have automatic scaling</strong>: Serverless applications automatically scale up or down based on demand.</p>
</li>
<li><p><strong>You have pay-per-use billing</strong>: Charged for the exact amount of compute resources the application consumes.</p>
</li>
</ul>
<p>Note that other cloud ecosystems like Google Cloud Platform (GCP) and Microsoft Azure offer comprehensive alternatives to AWS. Which one you choose depends on your budget, project type, and familiarity with each ecosystem.</p>
<h3 id="heading-core-aws-resources-in-the-architecture">Core AWS Resources in the Architecture</h3>
<p>The system architecture focuses on the following points:</p>
<ul>
<li><p>The application is fully containerized on Docker for universal accessibility.</p>
</li>
<li><p>The container image is stored in AWS Elastic Container Registry (ECR).</p>
</li>
<li><p>The API Gateway’s REST API endpoints trigger an event to invoke the Lambda function.</p>
</li>
<li><p>The Lambda function loads the container image from ECR and perform inference.</p>
</li>
<li><p>Trained models, processors, and input features are stored in AWS S3 buckets.</p>
</li>
<li><p>A Redis client serves cached analytical data and past predictions stored in the ElastiCache.</p>
</li>
</ul>
<p>And to build the system, we’ll use the following AWS resources:</p>
<ul>
<li><p><strong>Lamda</strong>: Serves a function to perform inference.</p>
</li>
<li><p><strong>API Gateway:</strong> Routes API calls to the Lambda function.</p>
</li>
<li><p><strong>S3 Storage</strong>: Serves feature store and model store.</p>
</li>
<li><p><strong>ElastiCache:</strong> Store cached predictions and analytical data.</p>
</li>
<li><p><strong>ECR</strong>: Stores Docker container images to allow Lambda to pull the image.</p>
</li>
</ul>
<p>Each resource requires configuration. I’ll explore those details in the next section.</p>
<h2 id="heading-the-deployment-workflow-in-action"><strong>The Deployment Workflow in Action</strong></h2>
<p>The deployment workflow involves the following steps:</p>
<ol>
<li><p>Draft data preparation, model training, and serialization scripts</p>
</li>
<li><p>Configure designated feature store and model store in S3</p>
</li>
<li><p>Create a Flask application with API endpoints</p>
</li>
<li><p>Publish a Docker image to ECR</p>
</li>
<li><p>Create a Lambda function</p>
</li>
<li><p>Configure related AWS resources</p>
</li>
</ol>
<p>We’ll now walk through each of these steps to help you fully understand the process.</p>
<p>For your reference, here is the repository structure:</p>
<pre><code class="lang-markdown">.
.venv/                  [.gitignore]    # stores uv venv
│
└── data/               [.gitignore]
│     └──raw/                           # stores raw data
│     └──preprocessed/                  # stores processed data after imputation and engineering
│
└── models/             [.gitignore]    # stores serialized model after training and tuning
│     └──dfn/                           # deep feedforward network
│     └──gbm/                           # light gbm
│     └──en/                            # elastic net
│     └──production/                    # models to be stored in S3 for production use
|
└── notebooks/                          # stores experimentation notebooks
│
└── src/                                # core functions
│     └──<span class="hljs-emphasis">_utils/                        # utility functions
│     └──data_</span>handling/                 # functions to engineer features
│     └──model/                         # functions to train, tune, validate models
│     │     └── sklearn<span class="hljs-emphasis">_model
│     │     └── torch_</span>model
│     │     └── ...
│     └──main.py                        # main script to run the inference locally
│
└──app.py                               # Flask application (API endpoints)
└──pyproject.toml                       # project configuration
└──.env                [.gitignore]     # environment variables
└──uv.lock                              # dependency locking
└──Dockerfile                           # for Docker container image
└──.dockerignore
└──requirements.txt
└──.python-version                      # python version locking (3.12)
</code></pre>
<h3 id="heading-step-1-draft-python-scripts">Step 1: Draft Python Scripts</h3>
<p>The first step is to draft Python scripts for data preparation, model training and tuning.</p>
<p>We’ll run these scripts in a <strong>batch process</strong> because these are resource-intensive and stateful tasks that aren’t suitable for serverless functions optimized for short-lived, stateless, and event-driven tasks.</p>
<p>Serverless functions also can experience <a target="_blank" href="https://www.freecodecamp.org/news/cold-start-problem-in-recommender-systems/"><strong>cold starts</strong></a>. With heavy tasks in the function, the API gateway would timeout before serving predictions.</p>
<p><code>src/main.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> warnings
<span class="hljs-keyword">import</span> pickle
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> lightgbm <span class="hljs-keyword">as</span> lgb
<span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> ElasticNet
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVR
<span class="hljs-keyword">from</span> skopt.space <span class="hljs-keyword">import</span> Real, Integer, Categorical
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">import</span> src.data_handling <span class="hljs-keyword">as</span> data_handling
<span class="hljs-keyword">import</span> src.model.torch_model <span class="hljs-keyword">as</span> t
<span class="hljs-keyword">import</span> src.model.sklearn_model <span class="hljs-keyword">as</span> sk


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>: 
    load_dotenv(override=<span class="hljs-literal">True</span>)
    os.makedirs(PRODUCTION_MODEL_FOLDER_PATH, exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># create train, validation, test datasets</span>
    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = data_handling.main_script()

    <span class="hljs-comment"># store the trained preprocessor in local storage</span>
    joblib.dump(preprocessor, PREPROCESSOR_PATH)

    <span class="hljs-comment"># model tuning and training</span>
    best_dfn_full_trained, checkpoint = t.main_script(X_train, X_val, y_train, y_val)

    <span class="hljs-comment"># serialize the trained model</span>
    torch.save(checkpoint, DFN_FILE_PATH)

    <span class="hljs-comment"># svr</span>
    best_svr_trained, best_hparams_svr = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">1</span>]
    )
    <span class="hljs-keyword">if</span> best_svr_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(SVR_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({ <span class="hljs-string">'best_model'</span>: best_svr_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_svr }, f)

    <span class="hljs-comment"># elastic net</span>
    best_en_trained, best_hparams_en = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">0</span>]
    )
    <span class="hljs-keyword">if</span> best_en_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(EN_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({ <span class="hljs-string">'best_model'</span>: best_en_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_en }, f)

    <span class="hljs-comment"># light gbm</span>
    best_gbm_trained, best_hparams_gbm = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">2</span>]
    )

    <span class="hljs-keyword">if</span> best_gbm_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(GBM_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({<span class="hljs-string">'best_model'</span>: best_gbm_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_gbm }, f)
</code></pre>
<p>Run the script to train and serialize the models using the <code>uv</code> package management:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> venv
<span class="hljs-variable">$source</span> .venv/bin/activate
<span class="hljs-variable">$uv</span> run src/main.py
</code></pre>
<p>The <code>main.py</code> script includes several key components.</p>
<h4 id="heading-scripts-for-data-handling">Scripts for Data Handling</h4>
<p>These scripts involve loading original data, structure missing values, and engineer features necessary for the future prediction.</p>
<p><code>src/data_handling/main.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-comment"># load and save the original data frame in parquet</span>
df = scripts.load_original_dataframe()
df.to_parquet(ORIGINAL_DF_PATH, index=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># imputation</span>
df = scripts.structure_missing_values(df=df)

<span class="hljs-comment"># feature engineering</span>
df = scripts.handle_feature_engineering(df=df)

<span class="hljs-comment"># save processed df in csv and parquet</span>
scripts.save_df_to_csv(df=df)
df.to_parquet(PROCESSED_DF_PATH, index=<span class="hljs-literal">False</span>)


<span class="hljs-comment"># for preprocessing, classify numerical and categorical columns</span>
num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
<span class="hljs-keyword">if</span> cat_cols:
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cat_cols: df[col] = df[col].astype(<span class="hljs-string">'string'</span>)

<span class="hljs-comment"># creates training, validation, and test datasets (test dataset is for inference only)</span>
y = df[target_col]
X = df.copy().drop(target_col, axis=<span class="hljs-string">'columns'</span>)
test_size, random_state = <span class="hljs-number">50000</span>, <span class="hljs-number">42</span>
X_tv, X_test, y_tv, y_test = train_test_split(
    X, y, test_size=test_size, random_state=random_state
)
X_train, X_val, y_train, y_val = train_test_split(
    X_tv, y_tv, test_size=test_size, random_state=random_state
)

<span class="hljs-comment"># transform the input datasets</span>
X_train, X_val, X_test, preprocessor = scripts.transform_input(
    X_train, X_val, X_test, num_cols=num_cols, cat_cols=cat_cols
)

<span class="hljs-comment"># retrain and serialize the preprocessor</span>
<span class="hljs-keyword">if</span> preprocessor <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>: preprocessor.fit(X)
joblib.dump(preprocessor, PREPROCESSOR_PATH)
</code></pre>
<h4 id="heading-scripts-for-model-training-and-tuning-pytorch-model">Scripts for Model Training and Tuning (PyTorch Model)</h4>
<p>The scripts involve initiating the model, searching optimal neural architecture and hyperparameters, and serializing the fully-trained model so that the system can load the trained model when performing inference.</p>
<p>Because the primary model is built on PyTorch and the backups use Scikit-Learn, we’re drafting the scripts separately.</p>
<h4 id="heading-1-pytorch-models">1. PyTorch Models</h4>
<p><strong>The training script</strong> contains training the model with the validation over a subset of training data.</p>
<p>It contains the early stopping logic when the loss history is not improved for a given consecutive epochs (that is, 10 epochs).</p>
<p><code>src/model/torch_model/scripts/training.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">import</span> optuna <span class="hljs-comment"># type: ignore</span>
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># device</span>
device_type = device_type <span class="hljs-keyword">if</span> device_type <span class="hljs-keyword">else</span> <span class="hljs-string">'cuda'</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'mps'</span> <span class="hljs-keyword">if</span> torch.backends.mps.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'cpu'</span>
device = torch.device(device_type)

<span class="hljs-comment"># gradient scaler for stability (only applicable for cuba)</span>
scaler = torch.GradScaler(device=device_type) <span class="hljs-keyword">if</span> device_type == <span class="hljs-string">'cuba'</span> <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># start training</span>
best_val_loss = float(<span class="hljs-string">'inf'</span>)
epochs_no_improve = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(num_epochs):
    model.train()
    <span class="hljs-keyword">for</span> batch_X, batch_y <span class="hljs-keyword">in</span> train_data_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()

        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># pytorch's AMP system automatically handles the casting of tensors to Float16 or Float32</span>
            <span class="hljs-keyword">with</span> torch.autocast(device_type=device_type):
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)

                <span class="hljs-comment"># break the training loop when models return nan or inf</span>
                <span class="hljs-keyword">if</span> torch.any(torch.isnan(outputs)) <span class="hljs-keyword">or</span> torch.any(torch.isinf(outputs)):
                    main_logger.error(
                        <span class="hljs-string">'pytorch model returns nan or inf. break the training loop.'</span>
                    )
                    <span class="hljs-keyword">break</span>

            <span class="hljs-comment"># create scaled gradients of losses</span>
            <span class="hljs-keyword">if</span> scaler <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)  <span class="hljs-comment"># cliping grad</span>
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=<span class="hljs-number">1.0</span>)
                scaler.step(optimizer)  <span class="hljs-comment"># unscales the gradients</span>
                scaler.update()  <span class="hljs-comment"># updates the scale</span>

            <span class="hljs-keyword">else</span>:
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=<span class="hljs-number">1.0</span>) <span class="hljs-comment"># cliping grad</span>
                optimizer.step()

        <span class="hljs-keyword">except</span>:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()


    <span class="hljs-comment"># run validation on a subset of the training dataset</span>
    model.eval()
    val_loss = <span class="hljs-number">0.0</span>

    <span class="hljs-comment"># switch the torch mode</span>
    <span class="hljs-keyword">with</span> torch.inference_mode():
        <span class="hljs-keyword">for</span> batch_X_val, batch_y_val <span class="hljs-keyword">in</span> val_data_loader:
            batch_X_val, batch_y_val = batch_X_val.to(device), batch_y_val.to(device)
            outputs_val = model(batch_X_val)
            val_loss += criterion(outputs_val, batch_y_val).item()

    val_loss /= len(val_data_loader)

    <span class="hljs-comment"># check if early stop</span>
    <span class="hljs-keyword">if</span> val_loss &lt; best_val_loss - min_delta:
        best_val_loss = val_loss
        epochs_no_improve = <span class="hljs-number">0</span>
    <span class="hljs-keyword">else</span>:
        epochs_no_improve += <span class="hljs-number">1</span>
        <span class="hljs-keyword">if</span> epochs_no_improve &gt;= patience:
            main_logger.info(<span class="hljs-string">f'early stopping at epoch <span class="hljs-subst">{epoch + <span class="hljs-number">1</span>}</span>'</span>)
            <span class="hljs-keyword">break</span>
</code></pre>
<p><strong>The tuning script</strong> uses the <code>study</code> component from the <code>Optuna</code> library to run the Bayesian Optimization.</p>
<p>The <code>study</code> component choose a neural architecture and hyperparameter set to test from the global search space.</p>
<p>Then, it builds, trains, and validates the model to find the optimal neural architecture that can minimize the loss (MSE, for instance).</p>
<p><code>src/model/torch_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> itertools
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> optuna
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">import</span> torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader, TensorDataset
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">from</span> src.model.torch_model.scripts.pretrained_base <span class="hljs-keyword">import</span> DFN
<span class="hljs-keyword">from</span> src.model.torch_model.scripts.training <span class="hljs-keyword">import</span> train_model
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># device</span>
device_type = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"mps"</span> <span class="hljs-keyword">if</span> torch.backends.mps.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
device = torch.device(device_type)

<span class="hljs-comment"># loss function</span>
criterion = nn.MSELoss()

<span class="hljs-comment"># define objective function for optuna</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">objective</span>(<span class="hljs-params">trial</span>):</span>
    <span class="hljs-comment"># model</span>
    num_layers = trial.suggest_int(<span class="hljs-string">'num_layers'</span>, <span class="hljs-number">1</span>, <span class="hljs-number">20</span>)
    batch_norm = trial.suggest_categorical(<span class="hljs-string">'batch_norm'</span>, [<span class="hljs-literal">True</span>, <span class="hljs-literal">False</span>])
    dropout_rates = []
    hidden_units_per_layer = []
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(num_layers):
        dropout_rates.append(trial.suggest_float(<span class="hljs-string">f'dropout_rate_layer_<span class="hljs-subst">{i}</span>'</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.6</span>))
        hidden_units_per_layer.append(trial.suggest_int(<span class="hljs-string">f'n_units_layer_<span class="hljs-subst">{i}</span>'</span>, <span class="hljs-number">8</span>, <span class="hljs-number">256</span>)) <span class="hljs-comment"># hidden units per layer</span>

    model = DFN(
        input_dim=X_train.shape[<span class="hljs-number">1</span>],
        num_layers=num_layers,
        dropout_rates=dropout_rates,
        batch_norm=batch_norm,
        hidden_units_per_layer=hidden_units_per_layer
    ).to(device)

    <span class="hljs-comment"># optimizer</span>
    learning_rate = trial.suggest_float(<span class="hljs-string">'learning_rate'</span>, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1e-1</span>, log=<span class="hljs-literal">True</span>)
    optimizer_name = trial.suggest_categorical(<span class="hljs-string">'optimizer'</span>, [<span class="hljs-string">'adam'</span>, <span class="hljs-string">'rmsprop'</span>, <span class="hljs-string">'sgd'</span>, <span class="hljs-string">'adamw'</span>, <span class="hljs-string">'adamax'</span>, <span class="hljs-string">'adadelta'</span>, <span class="hljs-string">'radam'</span>])
    optimizer = _handle_optimizer(optimizer_name=optimizer_name, model=model, lr=learning_rate)

    <span class="hljs-comment"># data loaders</span>
    batch_size = trial.suggest_categorical(<span class="hljs-string">'batch_size'</span>, [<span class="hljs-number">32</span>, <span class="hljs-number">64</span>, <span class="hljs-number">128</span>, <span class="hljs-number">256</span>])
    test_size = <span class="hljs-number">10000</span> <span class="hljs-keyword">if</span> len(X_train) &gt; <span class="hljs-number">15000</span> <span class="hljs-keyword">else</span> int(len(X_train) * <span class="hljs-number">0.2</span>)
    X_train_search, X_val_search, y_train_search, y_val_search = train_test_split(X_train, y_train, test_size=test_size, random_state=<span class="hljs-number">42</span>)
    train_data_loader = create_torch_data_loader(X=X_train_search, y=y_train_search, batch_size=batch_size)
    val_data_loader = create_torch_data_loader(X=X_val_search, y=y_val_search, batch_size=batch_size)

    <span class="hljs-comment"># training</span>
    num_epochs = <span class="hljs-number">3000</span> <span class="hljs-comment"># ensure enough epochs (early stopping would stop the loop when overfitting)</span>
    _, best_val_loss = train_model(
        train_data_loader=train_data_loader,
        val_data_loader=val_data_loader,
        model=model,
        optimizer=optimizer,
        criterion = criterion,
        num_epochs=num_epochs,
        trial=trial,
    )
    <span class="hljs-keyword">return</span> best_val_loss


<span class="hljs-comment"># start to optimize hyperparameters and architecture</span>
study = optuna.create_study(direction=<span class="hljs-string">'minimize'</span>, sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=<span class="hljs-number">50</span>, timeout=<span class="hljs-number">600</span>)

<span class="hljs-comment"># best </span>
best_trial = study.best_trial
best_hparams = best_trial.params

<span class="hljs-comment"># construct the model based on the tuning results</span>
best_lr = best_hparams[<span class="hljs-string">'learning_rate'</span>]
best_batch_size = best_hparams[<span class="hljs-string">'batch_size'</span>]
input_dim = X_train.shape[<span class="hljs-number">1</span>]
best_model = DFN(
    input_dim=input_dim,
    num_layers=best_hparams[<span class="hljs-string">'num_layers'</span>],
    hidden_units_per_layer=[v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> best_hparams.items() <span class="hljs-keyword">if</span> <span class="hljs-string">'n_units_layer_'</span> <span class="hljs-keyword">in</span> k],
    batch_norm=best_hparams[<span class="hljs-string">'batch_norm'</span>],
    dropout_rates=[v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> best_hparams.items() <span class="hljs-keyword">if</span> <span class="hljs-string">'dropout_rate_layer_'</span> <span class="hljs-keyword">in</span> k],
).to(device)

<span class="hljs-comment"># construct an optimizer based on the tuning results</span>
best_optimizer_name = best_hparams[<span class="hljs-string">'optimizer'</span>]
best_optimizer = _handle_optimizer(
    optimizer_name=best_optimizer_name, model=best_model, lr=best_lr
)

<span class="hljs-comment"># create torch data loaders</span>
train_data_loader = create_torch_data_loader(
    X=X_train, y=y_train, batch_size=best_batch_size
)
val_data_loader = create_torch_data_loader(
    X=X_val, y=y_val, batch_size=best_batch_size
)

<span class="hljs-comment"># retrain the best model with full training dataset applying the optimal batch size and optimizer</span>
best_model, _ = train_model(
    train_data_loader=train_data_loader,
    val_data_loader=val_data_loader,
    model=best_model,
    optimizer=best_optimizer,
    criterion = criterion,
    num_epochs=<span class="hljs-number">1000</span>
)

<span class="hljs-comment"># create a checkpoint for serialization (reconstruct the model using the checkpoint)</span>
checkpoint = {
    <span class="hljs-string">'state_dict'</span>: best_model.state_dict(),
    <span class="hljs-string">'hparams'</span>: best_hparams,
    <span class="hljs-string">'input_dim'</span>: X_train.shape[<span class="hljs-number">1</span>],
    <span class="hljs-string">'optimizer'</span>: best_optimizer,
    <span class="hljs-string">'batch_size'</span>: best_batch_size
}

<span class="hljs-comment"># serialize the model w/ checkpoint</span>
torch.save(checkpoint, FILE_PATH)
</code></pre>
<h4 id="heading-2-scikit-learn-models-backups">2. Scikit-Learn Models (Backups)</h4>
<p>For Scikit-Learn models, we’ll run <strong>k-fold cross validation</strong> during training to prevent overfitting.</p>
<p>K-fold cross-validation is a technique for evaluating a machine learning model's performance by training and testing it on different subsets of training data.</p>
<p>We define the <code>run_kfold_validation</code> function where the model is trained and validated using <strong>5-fold cross-validation</strong>.</p>
<p><code>src/model/sklearn_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> KFold
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_kfold_validation</span>(<span class="hljs-params">
        X_train,
        y_train,
        base_model,
        hparams: dict,
        n_splits: int = <span class="hljs-number">5</span>, <span class="hljs-comment"># the number of folds </span>
        early_stopping_rounds: int = <span class="hljs-number">10</span>,
        max_iters: int = <span class="hljs-number">200</span>
    </span>) -&gt; float:</span>

    mses = <span class="hljs-number">0.0</span>

    <span class="hljs-comment"># create k-fold component</span>
    kf = KFold(n_splits=n_splits, shuffle=<span class="hljs-literal">True</span>, random_state=<span class="hljs-number">42</span>)

    <span class="hljs-keyword">for</span> fold, (train_index, val_index) <span class="hljs-keyword">in</span> enumerate(kf.split(X_train)):
        <span class="hljs-comment"># create a subset of training and validation datasets from the entire training data</span>
        X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

        <span class="hljs-comment"># reconstruct a model</span>
        model = base_model(**hparams)

        <span class="hljs-comment"># start the cross validation</span>
        best_val_mse = float(<span class="hljs-string">'inf'</span>)
        patience_counter = <span class="hljs-number">0</span>
        best_model_state = <span class="hljs-literal">None</span>
        best_iteration = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> iteration <span class="hljs-keyword">in</span> range(max_iters):
            <span class="hljs-comment"># train on a subset of the training data</span>
            <span class="hljs-keyword">try</span>:
                model.train_one_step(X_train_fold, y_train_fold, iteration)
            <span class="hljs-keyword">except</span>:
                model.fit(X_train_fold, y_train_fold)

            <span class="hljs-comment"># make a prediction on validation data </span>
            y_pred_val_kf = model.predict(X_val_fold)

            <span class="hljs-comment"># compute validation loss (MSE)</span>
            current_val_mse = mean_squared_error(y_val_fold, y_pred_val_kf)

            <span class="hljs-comment"># check if epochs should be stopped (early stopping)</span>
           <span class="hljs-keyword">if</span> current_val_mse &lt; best_val_mse:
                best_val_mse = current_val_mse
                patience_counter = <span class="hljs-number">0</span>
                best_model_state = model.get_params()
                best_iteration = iteration
           <span class="hljs-keyword">else</span>:
                patience_counter += <span class="hljs-number">1</span>

           <span class="hljs-comment"># execute early stopping when patience_counter exceeds early_stopping_rounds</span>
           <span class="hljs-keyword">if</span> patience_counter &gt;= early_stopping_rounds:
                main_logger.info(<span class="hljs-string">f"Fold <span class="hljs-subst">{fold}</span>: Early stopping triggered at iteration <span class="hljs-subst">{iteration}</span> (best at <span class="hljs-subst">{best_iteration}</span>). Best MSE: <span class="hljs-subst">{best_val_mse:<span class="hljs-number">.4</span>f}</span>"</span>)
                <span class="hljs-keyword">break</span>


        <span class="hljs-comment"># after training epochs, reconstruct the best performing model </span>
        <span class="hljs-keyword">if</span> best_model_state: model.set_params(**best_model_state)

        <span class="hljs-comment"># make prediction</span>
        y_pred_val_kf = model.predict(X_val_fold)

        <span class="hljs-comment"># add MSEs</span>
        mses += mean_squared_error(y_pred_val_kf, y_val_fold)

    <span class="hljs-comment"># compute the final loss (avarage of MSEs across folds)</span>
    ave_mse = mses / n_splits
    <span class="hljs-keyword">return</span> ave_mse
</code></pre>
<p>Then, for the <strong>tuning script</strong>, we use the <code>gp_minimize</code> function from the <code>Scikit-Optimize</code> library.</p>
<p>The <code>gp_minimize</code> function is used to tune hyperparameters with Bayesian optimization.</p>
<p>This function intelligently searches the best hyperparameter set that can minimize the model's error, which is calculated using the <code>run_kfold_validation</code> function defined earlier.</p>
<p>The best-performing hyperparameters are then used to reconstruct and train the final model.</p>
<p><code>src/model/sklearn_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial
<span class="hljs-keyword">from</span> skopt <span class="hljs-keyword">import</span> gp_minimize


<span class="hljs-comment"># define the objective function for Bayesian Optimization using Scikit-Optimize</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">objective</span>(<span class="hljs-params">params, X_train, y_train, base_model, hparam_names</span>):</span>
    hparams = {item: params[i] <span class="hljs-keyword">for</span> i, item <span class="hljs-keyword">in</span> enumerate(hparam_names)}
    ave_mse = run_kfold_validation(X_train=X_train, y_train=y_train, base_model=base_model, hparams=hparams)
    <span class="hljs-keyword">return</span> ave_mse

<span class="hljs-comment"># create the search space</span>
hparam_names = [s.name <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> space]
objective_partial = partial(objective, X_train=X_train, y_train=y_train, base_model=base_model, hparam_names=hparam_names)

<span class="hljs-comment"># search the optimal hyperparameters</span>
results = gp_minimize(
    func=objective_partial,
    dimensions=space,
    n_calls=n_calls,
    random_state=<span class="hljs-number">42</span>,
    verbose=<span class="hljs-literal">False</span>,
    n_initial_points=<span class="hljs-number">10</span>,
)
<span class="hljs-comment"># results</span>
best_hparams = dict(zip(hparam_names, results.x))
best_mse = results.fun

<span class="hljs-comment"># reconstruct the model with the best hyperparameters</span>
best_model = base_model(**best_hparams)

<span class="hljs-comment"># retrain the model with full training dataset</span>
best_model.fit(X_train, y_train)
</code></pre>
<h3 id="heading-step-2-configure-featuremodel-stores-in-s3">Step 2: Configure Feature/Model Stores in S3</h3>
<p>The trained models and processed data are stored in the S3 bucket as a <strong>Parquet file</strong>.</p>
<p>We’ll draft the <code>s3_upload</code> function where the <strong>Boto3 client</strong>, a low-level interface to an AWS service, initiates the connection to S3:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">s3_upload</span>(<span class="hljs-params">file_path: str</span>):</span>
    <span class="hljs-comment"># initiate the boto3 client</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    S3_BUCKET_NAME = os.environ.get(<span class="hljs-string">'S3_BUCKET_NAME'</span>) <span class="hljs-comment"># the bucket created in s3</span>
    s3_client = boto3.client(<span class="hljs-string">'s3'</span>, region_name=os.environ.get(<span class="hljs-string">'AWS_REGION_NAME'</span>)) <span class="hljs-comment"># your default region</span>

    <span class="hljs-keyword">if</span> s3_client:
        <span class="hljs-comment"># create s3 key and upload the file to the bucket</span>
        s3_key = file_path <span class="hljs-keyword">if</span> file_path[<span class="hljs-number">0</span>] != <span class="hljs-string">'/'</span> <span class="hljs-keyword">else</span> file_path[<span class="hljs-number">1</span>:]
        s3_client.upload_file(file_path, S3_BUCKET_NAME, s3_key)
        main_logger.info(<span class="hljs-string">f"file uploaded to s3://<span class="hljs-subst">{S3_BUCKET_NAME}</span>/<span class="hljs-subst">{s3_key}</span>"</span>)
    <span class="hljs-keyword">else</span>:
        main_logger.error(<span class="hljs-string">'failed to create an S3 client.'</span>)
</code></pre>
<h4 id="heading-model-store">Model Store</h4>
<p>Trained PyTorch models are serialized (converted) into <code>.pth</code> files.</p>
<p>Then, these files are uploaded to the S3 bucket, enabling the system to load the trained model when it performs inference in production.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_upload

<span class="hljs-comment"># model serialization, store in local</span>
torch.save(trained_model.state_dict(), MODEL_FILE_PATH)

<span class="hljs-comment"># upload to s3 model store</span>
s3_upload(file_path=MODEL_FILE_PATH)
</code></pre>
<h4 id="heading-feature-store">Feature Store</h4>
<p>The processed data is converted into a CSV and Parquet file format.</p>
<p>Then, the Parquet files are uploaded to the S3 bucket, enabling the system to load the lightweight data when it creates prediction data to perform inference in production.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_upload

<span class="hljs-comment"># store csv and parquet files in local</span>
df.to_csv(file_path, index=<span class="hljs-literal">False</span>)
df.to_parquet(DATA_FILE_PATH, index=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># store in s3 feature store</span>
s3_upload(file_path=DATA_FILE_PATH)

<span class="hljs-comment"># trained preprocessor is also stored to transform the prediction data</span>
s3_upload(file_path=PROCESSOR_PATH)
</code></pre>
<h3 id="heading-step-3-create-a-flask-application-with-api-endpoints">Step 3: Create a Flask Application with API Endpoints</h3>
<p>Next, we’ll create a Flask application with API endpoints.</p>
<p>Flask needs to configure Python scripts in the <code>app.py</code> file located at the root of the project repository.</p>
<p>As showed in the code snippets, the <code>app.py</code> file needs to contain the components in order of:</p>
<ol>
<li><p>AWS Boto3 client setup,</p>
</li>
<li><p>Flask app configuration and API endpoint setup,</p>
</li>
<li><p>Loading the trained preprocessor, processed input data <code>X_test</code>, and trained models,</p>
</li>
<li><p>Invoke the Lambda function via API Gateway, and</p>
</li>
<li><p>The local test section.</p>
</li>
</ol>
<p>Note that <code>X_test</code> should never be used during model training to avoid data leakage.</p>
<p><code>app.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> flask <span class="hljs-keyword">import</span> Flask
<span class="hljs-keyword">from</span> flask_cors <span class="hljs-keyword">import</span> cross_origin
<span class="hljs-keyword">from</span> waitress <span class="hljs-keyword">import</span> serve
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># global variables (will be loaded from the S3 buckets)</span>
_redis_client = <span class="hljs-literal">None</span>
X_test = <span class="hljs-literal">None</span>
preprocessor = <span class="hljs-literal">None</span>
model = <span class="hljs-literal">None</span>
backup_model = <span class="hljs-literal">None</span>

<span class="hljs-comment"># load env if local else skip (lambda refers to env in production)</span>
AWS_LAMBDA_RUNTIME_API = os.environ.get(<span class="hljs-string">'AWS_LAMBDA_RUNTIME_API'</span>, <span class="hljs-literal">None</span>)
<span class="hljs-keyword">if</span> AWS_LAMBDA_RUNTIME_API <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>: load_dotenv(override=<span class="hljs-literal">True</span>)


<span class="hljs-comment">#### &lt;---- 1. AWS BOTO3 CLIENT ----&gt;</span>
<span class="hljs-comment"># boto3 client </span>
S3_BUCKET_NAME = os.environ.get(<span class="hljs-string">'S3_BUCKET_NAME'</span>, <span class="hljs-string">'ml-sales-pred'</span>)
s3_client = boto3.client(<span class="hljs-string">'s3'</span>, region_name=os.environ.get(<span class="hljs-string">'AWS_REGION_NAME'</span>, <span class="hljs-string">'us-east-1'</span>))
<span class="hljs-keyword">try</span>:
    <span class="hljs-comment"># test connection to boto3 client</span>
    sts_client = boto3.client(<span class="hljs-string">'sts'</span>)
    identity = sts_client.get_caller_identity()
    main_logger.info(<span class="hljs-string">f"Lambda is using role: <span class="hljs-subst">{identity[<span class="hljs-string">'Arn'</span>]}</span>"</span>)
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
    main_logger.error(<span class="hljs-string">f"Lambda credentials/permissions error: <span class="hljs-subst">{e}</span>"</span>)

<span class="hljs-comment">#### &lt;---- 2. FLASK CONFIGURATION &amp; API ENDPOINTS ----&gt;</span>
<span class="hljs-comment"># configure the flask app</span>
app = Flask(__name__)
app.config[<span class="hljs-string">'CORS_HEADERS'</span>] = <span class="hljs-string">'Content-Type'</span>

<span class="hljs-comment"># add a simple API endpoint to serve the prediction by price point to test</span>
<span class="hljs-meta">@app.route('/v1/predict-price/&lt;string:stockcode&gt;', methods=['GET', 'OPTIONS'])</span>
<span class="hljs-meta">@cross_origin(origins=origins, methods=['GET', 'OPTIONS'], supports_credentials=True)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_price</span>(<span class="hljs-params">stockcode</span>):</span>
    df_stockcode = <span class="hljs-literal">None</span>

    <span class="hljs-comment"># fetch request params</span>
    data = request.args.to_dict()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># fetch cache</span>
        <span class="hljs-keyword">if</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
            <span class="hljs-comment"># returns cached prediction results if any without performing inference</span>
            cached_prediction_result = _redis_client.get(cache_key_prediction_result_by_stockcode)
            <span class="hljs-keyword">if</span> cached_prediction_result: 
                <span class="hljs-keyword">return</span> jsonify(json.loads(json.dumps(cached_prediction_result)))

            <span class="hljs-comment"># historical data of the selected product</span>
            cached_df_stockcode = _redis_client.get(cache_key_df_stockcode)
            <span class="hljs-keyword">if</span> cached_df_stockcode: df_stockcode = json.loads(json.dumps(cached_df_stockcode))


        <span class="hljs-comment"># define the price range to make predictions. can be a request param, or historical min/max prices</span>
        min_price = float(data.get(<span class="hljs-string">'unitprice_min'</span>, df_stockcode[<span class="hljs-string">'unitprice_min'</span>][<span class="hljs-number">0</span>]))
        max_price = float(data.get(<span class="hljs-string">'unitprice_max'</span>, df_stockcode[<span class="hljs-string">'unitprice_max'</span>][<span class="hljs-number">0</span>]))

        <span class="hljs-comment"># create bins in the price range. when the number of the bins increase, the prediction becomes more smooth, but requires more computational cost</span>
        NUM_PRICE_BINS = int(data.get(<span class="hljs-string">'num_price_bins'</span>, <span class="hljs-number">100</span>))
        price_range = np.linspace(min_price, max_price, NUM_PRICE_BINS)

        <span class="hljs-comment"># create a prediction dataset by merging X_test (dataset never used in model training) and df_stockcode</span>
        price_range_df = pd.DataFrame({ <span class="hljs-string">'unitprice'</span>: price_range })
        test_sample = X_test.sample(n=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">42</span>)
        test_sample_merged = test_sample.merge(price_range_df, how=<span class="hljs-string">'cross'</span>) <span class="hljs-keyword">if</span> X_test <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> price_range_df
        test_sample_merged.drop(<span class="hljs-string">'unitprice_x'</span>, axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-literal">True</span>)
        test_sample_merged.rename(columns={<span class="hljs-string">'unitprice_y'</span>: <span class="hljs-string">'unitprice'</span>}, inplace=<span class="hljs-literal">True</span>)

        <span class="hljs-comment"># preprocess the dataset</span>
        X = preprocessor.transform(test_sample_merged) <span class="hljs-keyword">if</span> preprocessor <span class="hljs-keyword">else</span> test_sample_merged

        <span class="hljs-comment"># perform inference</span>
        y_pred_actual = <span class="hljs-literal">None</span>
        epsilon = <span class="hljs-number">0</span>
        <span class="hljs-comment"># try using the primary model</span>
        <span class="hljs-keyword">if</span> model:
            input_tensor = torch.tensor(X, dtype=torch.float32)
            model.eval()
            <span class="hljs-keyword">with</span> torch.inference_mode():
                y_pred = model(input_tensor)
                y_pred = y_pred.cpu().numpy().flatten()
                y_pred_actual = np.exp(y_pred + epsilon)

        <span class="hljs-comment"># if not, use backups</span>
        <span class="hljs-keyword">elif</span> backup_model:
            y_pred = backup_model.predict(X)
            y_pred_actual = np.exp(y_pred + epsilon)


        <span class="hljs-comment"># finalize the outcome for client app</span>
        df_ = test_sample_merged.copy()
        df_[<span class="hljs-string">'quantity'</span>] = np.floor(y_pred_actual) <span class="hljs-comment"># quantity must be an integer</span>
        df_[<span class="hljs-string">'sales'</span>] = df_[<span class="hljs-string">'quantity'</span>] * df_[<span class="hljs-string">'unitprice'</span>] <span class="hljs-comment"># compute sales</span>
        df_ = df_.sort_values(by=<span class="hljs-string">'unitprice'</span>)

        <span class="hljs-comment"># aggregate the results by the unitprice in the price range</span>
        df_results = df_.groupby(<span class="hljs-string">'unitprice'</span>).agg(
            quantity=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'median'</span>),
            quantity_min=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'min'</span>),
            quantity_max=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'max'</span>),
            sales=(<span class="hljs-string">'sales'</span>, <span class="hljs-string">'median'</span>),
        ).reset_index()

        <span class="hljs-comment"># find the optimal price point</span>
        optimal_row = df_results.loc[df_results[<span class="hljs-string">'sales'</span>].idxmax()]
        optimal_price = optimal_row[<span class="hljs-string">'unitprice'</span>]
        optimal_quantity = optimal_row[<span class="hljs-string">'quantity'</span>]
        best_sales = optimal_row[<span class="hljs-string">'sales'</span>]

        all_outputs = []
        <span class="hljs-keyword">for</span> _, row <span class="hljs-keyword">in</span> df_results.iterrows():
            current_output = {
                <span class="hljs-string">"stockcode"</span>: stockcode,
                <span class="hljs-string">"unit_price"</span>: float(row[<span class="hljs-string">'unitprice'</span>]),
                <span class="hljs-string">'quantity'</span>: int(row[<span class="hljs-string">'quantity'</span>]),
                <span class="hljs-string">'quantity_min'</span>: int(row[<span class="hljs-string">'quantity_min'</span>]),
                <span class="hljs-string">'quantity_max'</span>: int(row[<span class="hljs-string">'quantity_max'</span>]),
                <span class="hljs-string">"predicted_sales"</span>: float(row[<span class="hljs-string">'sales'</span>]),
            }
            all_outputs.append(current_output)

        <span class="hljs-comment"># store the prediction results in cache</span>
        <span class="hljs-keyword">if</span> all_outputs <span class="hljs-keyword">and</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
             serialized_data = json.dumps(all_outputs)
            _redis_client.set(
                cache_key_prediction_result_by_stockcode, 
                serialized_data,
                ex=<span class="hljs-number">3600</span>     <span class="hljs-comment"># expire in an hour</span>
            )

        <span class="hljs-comment"># return a list of all outputs</span>
        <span class="hljs-keyword">return</span> jsonify(all_outputs)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e: <span class="hljs-keyword">return</span> jsonify([])


<span class="hljs-comment"># request header management (for the process from API gateway to the Lambda)</span>
<span class="hljs-meta">@app.after_request</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_header</span>(<span class="hljs-params">response</span>):</span>
    response.headers[<span class="hljs-string">'Cache-Control'</span>] = <span class="hljs-string">'public, max-age=0'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Origin'</span>] = CLIENT_A
    response.headers[<span class="hljs-string">'Access-Control-Allow-Headers'</span>] = <span class="hljs-string">'Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token,Origin'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Methods'</span>] = <span class="hljs-string">'GET, POST, OPTIONSS'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Credentials'</span>] = <span class="hljs-string">'true'</span>
    <span class="hljs-keyword">return</span> response

<span class="hljs-comment">#### &lt;---- 3. LOADING PROCESSOR, DATASET, AND MODELS ----&gt;</span>
load_processor()
load_x_test()
load_model()

<span class="hljs-comment">#### &lt;---- 4. INVOKE LAMBDA ----&gt;</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handler</span>(<span class="hljs-params">event, context</span>):</span>
    logger.info(<span class="hljs-string">"lambda handler invoked."</span>)
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># connecting the redis client after the lambda is invoked</span>
        get_redis_client()
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        logger.critical(<span class="hljs-string">f"failed to establish initial Redis connection in handler: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> {
            <span class="hljs-string">'statusCode'</span>: <span class="hljs-number">500</span>,
            <span class="hljs-string">'body'</span>: json.dumps({<span class="hljs-string">'error'</span>: <span class="hljs-string">'Failed to initialize Redis client. Check environment variables and network config.'</span>})
        }

    <span class="hljs-comment"># use the awsgi package to convert JSON to WSGI</span>
    <span class="hljs-keyword">return</span> awsgi.response(app, event, context)


<span class="hljs-comment">#### &lt;---- 5. FOR LOCAL TEST ----&gt;</span>
<span class="hljs-comment"># serve the application locally on WSGI server, waitress</span>
<span class="hljs-comment"># lambda will ignore this section.</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:   
    <span class="hljs-keyword">if</span> os.getenv(<span class="hljs-string">'ENV'</span>) == <span class="hljs-string">'local'</span>:
        main_logger.info(<span class="hljs-string">"...start the operation (local)..."</span>)
        serve(app, host=<span class="hljs-string">'0.0.0.0'</span>, port=<span class="hljs-number">5002</span>)
    <span class="hljs-keyword">else</span>:
        app.run(host=<span class="hljs-string">'0.0.0.0'</span>, port=<span class="hljs-number">8080</span>)
</code></pre>
<p>I’ll test the endpoint locally using the <code>uv</code> package manager:</p>
<pre><code class="lang-python">$uv run app.py --cache-clear

$curl http://localhost:<span class="hljs-number">5002</span>/v1/predict-price/{STOCKCODE}
</code></pre>
<p>The system provided a list of sales predictions for each price point:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755607075000/e0e8cbcb-8817-4aa5-b3d1-37b76cc684fb.png" alt="Fig. Screenshot of the Flask app local response" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the Flask app local response</p>
<h4 id="heading-key-points-on-flask-app-configuration">Key Points on Flask App Configuration</h4>
<p>There are various points you should take into consideration when configuring a Flask application with Lambda. Let’s go over them now:</p>
<h5 id="heading-1-a-few-api-endpoints-per-container"><strong>1. A Few API Endpoints Per Container</strong></h5>
<p>Adding many API endpoints to a single serverless instance can lead to <strong>monolithic function concern</strong> where issues in one endpoint impact others.</p>
<p>In this project, we’ll focus on a single endpoint per container – and if needed, we can add separate Lambda functions to the system.</p>
<h5 id="heading-2-understanding-the-handler-function-and-the-role-of-awsgi"><strong>2. Understanding the</strong> <code>handler</code> <strong>Function and the role of AWSGI</strong></h5>
<p>The <code>handler</code> function is invoked every time the Lambda function receives a client request from the API Gateway.</p>
<p>The function takes the <code>event</code> argument that includes the request details in a <strong>JSON dictionary</strong> and passes it to the Flask application.</p>
<p><strong>AWSGI</strong> acts as an adapter, translating a Lambda event in JSON format into a WSGI request that a Flask application can understand, and converts the application’s response back into a JSON format that Lambda and API Gateway can process.</p>
<h5 id="heading-3-using-cache-storage"><strong>3. Using Cache Storage</strong></h5>
<p>The <code>get_redis_client</code> function is called once the <code>handler</code> function is called by the API Gateway. This allows the Flask application to store or fetch a cache from the Redis client:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> redis
<span class="hljs-keyword">import</span> redis.cluster
<span class="hljs-keyword">from</span> redis.cluster <span class="hljs-keyword">import</span> ClusterNode

_redis_client = <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_redis_client</span>():</span>
    <span class="hljs-keyword">global</span> _redis_client
    <span class="hljs-keyword">if</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        REDIS_HOST = os.environ.get(<span class="hljs-string">"REDIS_HOST"</span>)
        REDIS_PORT = int(os.environ.get(<span class="hljs-string">"REDIS_PORT"</span>, <span class="hljs-number">6379</span>))
        REDIS_TLS = os.environ.get(<span class="hljs-string">"REDIS_TLS"</span>, <span class="hljs-string">"true"</span>).lower() == <span class="hljs-string">"true"</span>
        <span class="hljs-keyword">try</span>:
            startup_nodes = [ClusterNode(host=REDIS_HOST, port=REDIS_PORT)]
            _redis_client = redis.cluster.RedisCluster(
                startup_nodes=startup_nodes,
                decode_responses=<span class="hljs-literal">True</span>,
                skip_full_coverage_check=<span class="hljs-literal">True</span>,
                ssl=REDIS_TLS,                  <span class="hljs-comment"># elasticache has encryption in transit: enabled -&gt; must be true</span>
                ssl_cert_reqs=<span class="hljs-literal">None</span>,
                socket_connect_timeout=<span class="hljs-number">5</span>,
                socket_timeout=<span class="hljs-number">5</span>,
                health_check_interval=<span class="hljs-number">30</span>,
                retry_on_timeout=<span class="hljs-literal">True</span>,
                retry_on_error=[
                    redis.exceptions.ConnectionError,
                    redis.exceptions.TimeoutError
                ],
                max_connections=<span class="hljs-number">10</span>,            <span class="hljs-comment"># limit connections for Lambda</span>
                max_connections_per_node=<span class="hljs-number">2</span>     <span class="hljs-comment"># limit per node</span>
            )
            _redis_client.ping()
            main_logger.info(<span class="hljs-string">"successfully connected to ElastiCache Redis Cluster (Configuration Endpoint)"</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f"an unexpected error occurred during Redis Cluster connection: <span class="hljs-subst">{e}</span>"</span>, exc_info=<span class="hljs-literal">True</span>)
            _redis_client = <span class="hljs-literal">None</span>
    <span class="hljs-keyword">return</span> _redis_client
</code></pre>
<h5 id="heading-4-handling-heavy-tasks-outside-of-the-handler-function"><strong>4. Handling Heavy Tasks Outside of the</strong> <code>handler</code> <strong>Function</strong></h5>
<p>Serverless functions can experience a <strong>cold start duration</strong>.</p>
<p>While a Lambda function can run for up to 15 minutes, its associated API Gateway has a timeout of 29 seconds (29,000 ms) for a RESTful API.</p>
<p>So, any heavy tasks like loading preprocessors, input data, or models should be performed once outside of the <code>handler</code> function, ensuring they are ready <em>before</em> the API endpoint is called.</p>
<p>Here are the loading functions called in <code>app.py</code>.</p>
<p><code>app.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> joblib

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_load, s3_load_to_temp_file

preprocessor = <span class="hljs-literal">None</span>
X_test = <span class="hljs-literal">None</span>
model = <span class="hljs-literal">None</span>
backup_model = <span class="hljs-literal">None</span>


<span class="hljs-comment"># load processor</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_preprocessor</span>():</span>
    <span class="hljs-keyword">global</span> preprocessor
    preprocessor_tempfile_path = s3_load_to_temp_file(PREPROCESSOR_PATH)
    preprocessor = joblib.load(preprocessor_tempfile_path)
    os.remove(preprocessor_tempfile_path)


<span class="hljs-comment"># load input data</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_x_test</span>():</span>
    <span class="hljs-keyword">global</span> X_test
    x_test_io = s3_load(file_path=X_TEST_PATH)
    X_test = pd.read_parquet(x_test_io)


<span class="hljs-comment"># load model</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_model</span>():</span>
    <span class="hljs-keyword">global</span> model, backup_model
    <span class="hljs-comment"># try loading &amp; reconstructing the primary model</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># first load io file from the s3 bucket</span>
        model_data_bytes_io_ = s3_load(file_path=DFN_FILE_PATH)
        <span class="hljs-comment"># convert to checkpoint dictionary (containing hyperparameter set)</span>
        checkpoint_ = torch.load(
            model_data_bytes_io_, 
            weights_only=<span class="hljs-literal">False</span>, 
            map_location=device
        )
        <span class="hljs-comment"># reconstruct the model</span>
        model = t.scripts.load_model(checkpoint=checkpoint_, file_path=DFN_FILE_PATH)
        <span class="hljs-comment"># set the model evaluation mode</span>
        model.eval()

    <span class="hljs-comment"># else, backup model</span>
     <span class="hljs-keyword">except</span>:
        load_artifacts_backup_model()
</code></pre>
<h3 id="heading-step-4-publish-a-docker-image-to-ecr">Step 4: Publish a Docker Image to ECR</h3>
<p>After configuring the Flask application, we’ll containerize the entire application on <strong>Docker</strong>.</p>
<p>Containerization makes a package of the application, including models, its dependencies, and configuration in machine learning context, as a container<strong>.</strong></p>
<p>Docker creates a container image based on the instructions defined in a Dockerfile, and the Docker engine uses the image to run the isolated container.</p>
<p>In this project, we’ll upload the Docker container image to ECR, so the Lambda function can access it in production.</p>
<p>After this, we’ll define the <code>.dockerignore</code> file to optimize the container image:</p>
<p><code>.dockerignore</code></p>
<pre><code class="lang-plaintext"># any irrelevant data
__pycache__/
.ruff_cache/
.DS_Store/
.venv/
dist/
.vscode
*.psd
*.pdf
[a-f]*.log
tmp/
awscli-bundle/

# add any experimental models, unnecessary data
dfn_bayesian/
dfn_grid/
data/
notebooks/
</code></pre>
<p><code>Dockerfile</code></p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># serve from aws ecr </span>
<span class="hljs-keyword">FROM</span> public.ecr.aws/lambda/python:<span class="hljs-number">3.12</span>

<span class="hljs-comment"># define a working directory in the container</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>

<span class="hljs-comment"># copy the entire repository (except .dockerignore) into the container at /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . /app/</span>

<span class="hljs-comment"># install dependencies defined in the requirements.txt</span>
<span class="hljs-keyword">RUN</span><span class="bash"> pip install --no-cache-dir -r requirements.txt</span>

<span class="hljs-comment"># define commands</span>
<span class="hljs-keyword">ENTRYPOINT</span><span class="bash"> [ <span class="hljs-string">"python"</span> ]</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [ <span class="hljs-string">"-m"</span>, <span class="hljs-string">"awslambdaric"</span>, <span class="hljs-string">"app.handler"</span> ]</span>
</code></pre>
<h4 id="heading-test-in-local">Test in Local</h4>
<p>Next, we’ll test the Docker image by building the container named <code>my-app</code> locally:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> build -t my-app -f Dockerfile .
</code></pre>
<p>Then, we’ll run the container with the <code>waitress</code> server in local:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> run -p 5002:5002 -e ENV=<span class="hljs-built_in">local</span> my-app app.py
</code></pre>
<p>The <code>-e ENV=local</code> flag sets the environment variable inside the container, which will trigger the <code>waitress.serve()</code> call in the <code>app.py</code>.</p>
<p>In the terminal, you’ll find a message saying the following:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*zu8mamgKMKOUxwCA.png" alt="Flask app response" width="600" height="400" loading="lazy"></p>
<p>You can also call the endpoint created to see the results returned:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run app.py --cache-clear

<span class="hljs-variable">$curl</span> http://localhost:5002/v1/predict-price/{STOCKCODE}
</code></pre>
<h4 id="heading-publish-the-docker-image-to-ecr">Publish the Docker Image to ECR</h4>
<p>To publish the Docker image, we first need to configure the default AWS credentials and region:</p>
<ul>
<li><p>From the AWS account console, issue an access token and check the default region.</p>
</li>
<li><p>Store them in the <code>~/aws/credentials</code> and <code>~/aws/config</code> files:</p>
</li>
</ul>
<p><code>~/aws/credentials</code></p>
<pre><code class="lang-plaintext">[default] 
aws_secret_access_key=
aws_access_key_id=
</code></pre>
<p><code>~/aws/config</code></p>
<pre><code class="lang-plaintext">[default]
region=
</code></pre>
<p>After the configuration, we’ll publish the Docker image to ECR.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># authenticate the docker client to ECR</span>
<span class="hljs-variable">$aws</span> ecr get-login-password --region &lt;your-aws-region&gt; | docker login --username AWS --password-stdin &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com

<span class="hljs-comment"># create repository</span>
<span class="hljs-variable">$aws</span> ecr create-repository --repository-name &lt;your-repo-name&gt; --region &lt;your-aws-region&gt;

<span class="hljs-comment"># tag the docker image</span>
<span class="hljs-variable">$docker</span> tag &lt;your-repo-name&gt;:&lt;your-app-version&gt;  &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com/&lt;your-app-name&gt;:&lt;your-app-version&gt;

<span class="hljs-comment"># push</span>
<span class="hljs-variable">$docker</span> push &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com/&lt;your-repo-name&gt;:&lt;your-app-version&gt;
</code></pre>
<p>Here’s what’s going on:</p>
<ul>
<li><p><code>&lt;your-aws-region&gt;</code>: Your default AWS region (for example, <code>us-east-1</code> ).</p>
</li>
<li><p><code>&lt;your-aws-account-id&gt;</code>: 12-digit AWS account ID.</p>
</li>
<li><p><code>&lt;your-repo-name&gt;</code>: Your desired repository name.</p>
</li>
<li><p><code>&lt;your-app-version&gt;</code>: Your desired tag name (for example, <code>v1.0</code>).</p>
</li>
</ul>
<p>Now, the Docker image is stored in ECR with the tag:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*tUQkbDW-uAmrjBfx.png" alt="Fig. Screenshot of the AWS ECR console" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS ECR console</p>
<h3 id="heading-step-5-create-a-lambda-function">Step 5: Create a Lambda Function</h3>
<p>Next, we’ll create a Lambda function.</p>
<p>From the Lambda console, choose:</p>
<ul>
<li><p>The <code>Container Image</code> option,</p>
</li>
<li><p>The container image URL from the pull down list,</p>
</li>
<li><p>A function name of our choice, and</p>
</li>
<li><p>An architecture type (arm64 is recommended for a better price-performance).</p>
</li>
</ul>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*3b-wIEUzRooQcvN_.png" alt="Fig. Screenshot of AWS Lambda function configurationFig. Screenshot of AWS Lambda function configuration" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of AWS Lambda function configuration</p>
<p>The Lambda function <code>my-app</code> was successfully launched.</p>
<h4 id="heading-connect-the-lambda-function-to-api-gateway">Connect the Lambda function to API Gateway</h4>
<p>Next, we’ll add API gateway as an event trigger to the Lambda function.</p>
<p>First, visit the API Gateway console and create <strong>REST API methods</strong> using the ARN of the Lambda function (press enter or click to view image in full size):</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*60TP64gdSjhKfiO8.png" alt="Fig. Screenshot of the AWS API Gateway configurationFig. Screenshot of the AWS API Gateway configuration" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS API Gateway configuration</p>
<p>Then, add resources to the created API gateway to create an endpoint:<br><code>API Gateway &gt; APIs &gt; Resources &gt; Create Resource</code></p>
<ul>
<li><p>Align the resource endpoint with the API endpoint defined in the <a target="_blank" href="http://app.py"><code>app.py</code></a>.</p>
</li>
<li><p>Configure CORS (for example, accept specific origins).</p>
</li>
<li><p>Deploy the resource to the stage.</p>
</li>
</ul>
<p>Going back to the Lambda console, you’ll find the API Gateway is connected as an event trigger:<br><code>Lambda &gt; Function &gt; my-app (your function name)</code></p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*DlfiEieZArmYlOuT.png" alt="Fig. Screenshot of the AWS Lambda dashboard" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS Lambda dashboard</p>
<h3 id="heading-step-6-configure-aws-resources">Step 6: Configure AWS Resources</h3>
<p>Lastly, we’ll configure the related AWS resources to make the system work in production.</p>
<p>This process involves the following steps:</p>
<h4 id="heading-1-the-iam-role-controls-who-to-access-resources">1. The IAM Role: Controls Who to Access Resources</h4>
<p>AWS requires <strong>IAM roles</strong> to grant temporary, secure permissions to users, mitigating security risks related to long-term credentials like passwords.</p>
<p>The IAM role leverages policies to grant accesses to the selected service. Policies can be issued by AWS or customized by the user by defining the inline policy.</p>
<p>It is important to avoid overly permissive access rights for the IAM role.</p>
<ol>
<li><p>In the Lambda function console, check the execution role:<br> <code>Lambda &gt; Function &gt; &lt;FUNCTION&gt; &gt; Permission &gt; The execution role</code>.</p>
</li>
<li><p>Set up the following policies to allow the Lambda’s IAM role to handle necessary operations:</p>
<ul>
<li><p><strong>Lambda</strong> <code>AWSLambdaExecute</code>: Allows executing the function.</p>
</li>
<li><p><strong>EC2</strong> <code>Inline policy</code>: Allows controlling the security group and the VPC of the Lambda function.</p>
</li>
<li><p><strong>ECR</strong> <code>AmazonElasticContainerRegistryPublicFullAccess</code> + <code>Inline policy</code>: Allows storing and pulling the Docker image.</p>
</li>
<li><p><strong>ElastiCache</strong> <code>AmazonElastiCacheFullAccess</code> + <code>Inline policy</code>: Allows storing and pulling caches.</p>
</li>
<li><p><strong>S3</strong>: <code>AmazonS3ReadOnlyAccess</code> + <code>Inline policy</code>: Allows reading and storing contents.</p>
</li>
</ul>
</li>
</ol>
<p>Now, the IAM role can access these resources and perfo the allowed actions.</p>
<h4 id="heading-2-the-security-group-controls-network-traffic">2. The Security Group: Controls Network Traffic</h4>
<p>A <strong>security group</strong> is a virtual firewall that controls inbound and outbound network traffic for AWS resources.</p>
<p>It uses stateful (allowing return traffic automatically) “allow-only” rules based on protocol, port, and IP address, where it denies all traffic by default.</p>
<p>Create a new security group for the Lambda function:<br><code>EC2 &gt; Security Groups &gt; &lt;YOUR SECURITY GROUP&gt;</code></p>
<p>Now, we’ll want to setup inbound / outbound traffic rules.</p>
<p>The inbound rules:</p>
<ul>
<li><p><strong>S3 → Lambda</strong>:<strong>Type</strong>*: HTTPS /* <strong>Protocol</strong>*: TCP /* <strong>Port range</strong>*: 443 / Source: Custom**</p>
</li>
<li><p><strong>ElastiCache → Lambda</strong>:<strong>Type</strong>*: Custom TCP /* <strong>Port range</strong>*: 6379 / Source: Custom**</p>
</li>
</ul>
<p>*Choose the created security group for the Lambda function as a custom source.</p>
<p>The outbound rules:</p>
<ul>
<li><p><strong>Lambda → Internet</strong>: <strong>Type</strong>*: HTTPS /* <strong>Protocol</strong>*: TCP /* <strong>Port range</strong>*: 443 /* <strong>Destination</strong>*: 0.0.0.0/0*</p>
</li>
<li><p><strong>ElastiCache → Internet</strong>: <strong>Type</strong>*: All Traffic /* <strong>Destination</strong>*: 0.0.0.0/0*</p>
</li>
</ul>
<h4 id="heading-3-the-virtual-private-cloud-vpc">3. The Virtual Private Cloud (VPC)</h4>
<p>A <strong>Virtual Private Cloud (VPC)</strong> provides a logically isolated private network for the AWS resources, acting as our own private data center within AWS.</p>
<p>AWS can create a <strong>Hyperplane ENI</strong> (Elastic Network Interface) for the Lambda function and its connected resources in the subnets of the VPC.</p>
<p>Though it’s optional, we’ll use the VPC to connect the Lambda function to the S3 storage and ElastiCache.</p>
<p>This process involves:</p>
<ol>
<li><p>Creating a VPC endpoint from the VPC console:<code>VPC &gt; Create VPC</code>.</p>
</li>
<li><p>Creating an STS (Security Token Service) endpoint:<br> <code>VPC &gt; PrivateLink and Lattice &gt; Endpoints &gt; Create Endpoint &gt;</code></p>
<ul>
<li><p><strong>Type</strong>*: AWS Service*</p>
</li>
<li><p><strong>Service name</strong>*: com.amazonaws.&lt;YOUR REGION&gt;.sts*</p>
</li>
<li><p><strong>Type</strong>*: Interface*</p>
</li>
<li><p><strong>VPC:</strong> Select the VPC created earlier.</p>
</li>
<li><p><strong>Subnets</strong>*: Select all subnets.*</p>
</li>
<li><p><strong>Security groups</strong>*: Select the security group of the Lambda function.*</p>
</li>
<li><p><strong>Policy</strong>*: Full access*</p>
</li>
<li><p><strong>Enable DNS names</strong></p>
</li>
</ul>
</li>
</ol>
<p>The VPC must have a dedicated endpoint for STS to receive temporary credentials from STS.</p>
<ol start="3">
<li><p>Create an S3 endpoint in the VPC:<br> <code>VPC &gt; PrivateLink and Lattice &gt; Endpoints &gt; Create Endpoint &gt;</code></p>
<ul>
<li><p><strong>Type</strong>*: AWS Service*</p>
</li>
<li><p><strong>Service name</strong>*: com.amazonaws.&lt;YOUR REGION&gt;.s3*</p>
</li>
<li><p><strong>Type</strong>*: Gateway*</p>
</li>
<li><p><strong>VPC:</strong> Select the VPC created earlier.</p>
</li>
<li><p><strong>Subnets</strong>*: Select all subnets.*</p>
</li>
<li><p><strong>Security groups</strong>*: Select the security group of the Lambda function.*</p>
</li>
<li><p><strong>Policy</strong>*: Full access*</p>
</li>
</ul>
</li>
</ol>
<p>Lastly, check the security group of the Lambda function and ensure that its VPC ID directs to the VPC created: <code>EC2 &gt; Security Group &gt; &lt;YOUR SECURITY GROUP FOR THE LAMDA FUNCTION&gt; &gt; VPC ID</code>.</p>
<p>That’s all for the deployment flow.</p>
<p>We can now test the API endpoint in production. Copy the <strong>Invoke URL</strong> of the deployed API endpoint: <code>API Gateway &gt; APIs &gt; Stages &gt; Invoke URL</code>. Then call the API endpoint and check if it responds predictions:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$curl</span> -H <span class="hljs-string">'Authorization: Bearer YOUR_API_TOKEN'</span> -H <span class="hljs-string">'Accept: application/json'</span> \
     <span class="hljs-string">'&lt;INVOKE URL&gt;/&lt;ENDPOINT&gt;'</span>
</code></pre>
<p>For logging and debugging, we’ll use the LiveTail of CloudWatch: <code>CloudWatch &gt; LiveTail</code>.</p>
<h2 id="heading-building-a-client-application-optional">Building a Client Application (Optional)</h2>
<p>For full-stack deployment, we’ll build a simple React application to display the prediction using the <a target="_blank" href="https://recharts.org/en-US">recharts</a> library for visualization.</p>
<p>Other options for quick frontend deployment include <a target="_blank" href="https://streamlit.io/">Streamlit</a> or <a target="_blank" href="https://www.gradio.app/">Gradio</a>.</p>
<h3 id="heading-the-react-application">The React Application</h3>
<p>The React application creates a web page that fetches and visualizes sales predictions from an external API, recommending an optimal price point.</p>
<p>The app uses <code>useState</code> to manage its data and state, including the selected product, the list of sales predictions, and the loading/error status.</p>
<p>When the user initiates a request, a <code>useEffect</code> hook triggers a <code>fetch</code> request to a Flask backend. It handles the API response as a <strong>data stream</strong>, processing it line by line to progressively update the predictions.</p>
<p>The <code>AreaChart</code> from the <code>recharts</code> library then visualizes this data. The X-axis represents the <code>price</code> and the Y-axis represents the <code>sales</code>. The chart updates in real-time as the data streams in. Finally, the app displays the optimal price once all the predictions are received.</p>
<p><code>App.js</code>: (in a separate React app)</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> { useState, useEffect } <span class="hljs-keyword">from</span> <span class="hljs-string">"react"</span>
<span class="hljs-keyword">import</span> { AreaChart, Area, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer, ReferenceLine } <span class="hljs-keyword">from</span> <span class="hljs-string">'recharts'</span>


<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">App</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-comment">// state</span>
  <span class="hljs-keyword">const</span> [predictions, setPredictions] = useState([])
  <span class="hljs-keyword">const</span> [start, setStart] = useState(<span class="hljs-literal">false</span>)
  <span class="hljs-keyword">const</span> [isLoading, setIsLoading] = useState(<span class="hljs-literal">false</span>)

  <span class="hljs-comment">// product data</span>
  <span class="hljs-keyword">let</span> selectedStockcode = <span class="hljs-string">'85123A'</span>
  <span class="hljs-keyword">let</span> selectedProduct = productOptions.filter(<span class="hljs-function"><span class="hljs-params">item</span> =&gt;</span> item.id === selectedStockcode)[<span class="hljs-number">0</span>]

  <span class="hljs-comment">// api endpoint</span>
  <span class="hljs-keyword">const</span> flaskBackendUrl = <span class="hljs-string">"YOUR FLASK BACKEND URL"</span>

  <span class="hljs-comment">// create chart data to display</span>
  <span class="hljs-keyword">const</span> chartDataSales = predictions &amp;&amp; predictions.length &gt; <span class="hljs-number">0</span>
    ? predictions
      .map(<span class="hljs-function"><span class="hljs-params">item</span> =&gt;</span> ({
        <span class="hljs-attr">price</span>: item.unit_price,
        <span class="hljs-attr">sales</span>: item.predicted_sales,
        <span class="hljs-attr">volume</span>: item.unit_price !== <span class="hljs-number">0</span> ? item.predicted_sales / item.unit_price : <span class="hljs-number">0</span>
      }))
      .sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> a.price - b.price)
    : [...selectedProduct[<span class="hljs-string">'histPrices'</span>]]

  <span class="hljs-comment">// optimal price to display</span>
  <span class="hljs-keyword">const</span> optimalPrice = predictions.length &gt; <span class="hljs-number">0</span>
    ? predictions.sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> b.predicted_sales - a.predicted_sales)[<span class="hljs-number">0</span>][<span class="hljs-string">'unit_price'</span>]
    : <span class="hljs-number">0</span>

  <span class="hljs-comment">// fetch prediction results</span>
  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-keyword">const</span> handlePrediction = <span class="hljs-keyword">async</span> () =&gt; {
      setIsLoading(<span class="hljs-literal">true</span>)
      setPredictions([])
      <span class="hljs-keyword">const</span> errorPrices = selectedProduct[<span class="hljs-string">'errorPrices'</span>]

      <span class="hljs-keyword">await</span> fetch(flaskBackendUrl)
        .then(<span class="hljs-function"><span class="hljs-params">res</span> =&gt;</span> {
          <span class="hljs-keyword">if</span> (res.status !== <span class="hljs-number">200</span>) { setPredictions(errorPrices); setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>) }
          <span class="hljs-keyword">else</span> <span class="hljs-keyword">return</span> <span class="hljs-built_in">Promise</span>.resolve(res.clone().json())
        })
        .then(<span class="hljs-function"><span class="hljs-params">res</span> =&gt;</span> {
          <span class="hljs-keyword">if</span> (res &amp;&amp; res.length &gt; <span class="hljs-number">0</span>) setPredictions(res)
          <span class="hljs-keyword">else</span> setPredictions(errorPrices)
          setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>)
        })
        .catch(<span class="hljs-function"><span class="hljs-params">err</span> =&gt;</span> { setPredictions(errorPrices); setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>) })
        .finally(setStart(<span class="hljs-literal">false</span>))
    }

    <span class="hljs-keyword">if</span> (start) handlePrediction()
    <span class="hljs-keyword">if</span> (predictions &amp;&amp; predictions.length &gt; <span class="hljs-number">0</span>) setStart(<span class="hljs-literal">false</span>)
  }, [flaskBackendUrl, start])


  <span class="hljs-comment">// render</span>
  <span class="hljs-keyword">if</span> (isLoading) <span class="hljs-keyword">return</span> <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">Loading</span> /&gt;</span></span>
  <span class="hljs-keyword">return</span> (
    <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">ResponsiveContainer</span> <span class="hljs-attr">width</span>=<span class="hljs-string">"100%"</span> <span class="hljs-attr">height</span>=<span class="hljs-string">"100%"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">AreaChart</span>
          <span class="hljs-attr">key</span>=<span class="hljs-string">{chartDataSales.length}</span>
          <span class="hljs-attr">data</span>=<span class="hljs-string">{chartDataSales.sort(data</span> =&gt;</span> data.unit_price)}
          margin={{ top: 10, right: 30, left: 0, bottom: 0 }}
        &gt;
          <span class="hljs-tag">&lt;<span class="hljs-name">CartesianGrid</span> <span class="hljs-attr">strokeDasharray</span>=<span class="hljs-string">"3 3"</span> <span class="hljs-attr">strokeOpacity</span>=<span class="hljs-string">{0.6}</span> /&gt;</span>

          <span class="hljs-tag">&lt;<span class="hljs-name">XAxis</span>
            <span class="hljs-attr">dataKey</span>=<span class="hljs-string">"price"</span>
            <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">value:</span> "<span class="hljs-attr">Unit</span> <span class="hljs-attr">Price</span> ($)", <span class="hljs-attr">position:</span> "<span class="hljs-attr">insideBottom</span>", <span class="hljs-attr">offset:</span> <span class="hljs-attr">0</span>, <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span>, <span class="hljs-attr">marginTop:</span> <span class="hljs-attr">10</span> }}
            <span class="hljs-attr">tickFormatter</span>=<span class="hljs-string">{(tick)</span> =&gt;</span> `$${parseFloat(tick).toFixed(2)}`}
            tick={{ fontSize: 12 }}
            padding={{ left: 20, right: 20 }}
          /&gt;

          <span class="hljs-tag">&lt;<span class="hljs-name">YAxis</span>
            <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">value:</span> "<span class="hljs-attr">Predicted</span> <span class="hljs-attr">Sales</span> ($)", <span class="hljs-attr">angle:</span> <span class="hljs-attr">-90</span>, <span class="hljs-attr">position:</span> "<span class="hljs-attr">insideLeft</span>", <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span> }}
            <span class="hljs-attr">tick</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span> }}
            <span class="hljs-attr">tickFormatter</span>=<span class="hljs-string">{(tick)</span> =&gt;</span> `$${tick.toLocaleString()}`}
          /&gt;

          {/* tooltips with the prediction result data */}
          <span class="hljs-tag">&lt;<span class="hljs-name">Tooltip</span>
            <span class="hljs-attr">contentStyle</span>=<span class="hljs-string">{{</span>
              <span class="hljs-attr">borderRadius:</span> '<span class="hljs-attr">8px</span>',
              <span class="hljs-attr">padding:</span> '<span class="hljs-attr">10px</span>',
              <span class="hljs-attr">boxShadow:</span> '<span class="hljs-attr">0px</span> <span class="hljs-attr">0px</span> <span class="hljs-attr">15px</span> <span class="hljs-attr">rgba</span>(<span class="hljs-attr">0</span>,<span class="hljs-attr">0</span>,<span class="hljs-attr">0</span>,<span class="hljs-attr">0.5</span>)'
            }}
            <span class="hljs-attr">formatter</span>=<span class="hljs-string">{(value,</span> <span class="hljs-attr">name</span>) =&gt;</span> {
              if (name === 'sales') {
                return [`$${value.toFixed(4)}`, 'Predicted Sales']
              }
              if (name === 'volume') {
                return [`${value.toFixed(0)}`, 'Volume']
              }
              return value
            }}
            labelFormatter={(label) =&gt; `Price: $${label.toFixed(2)}`}
          /&gt;

          {/* chart area = sales */}
          <span class="hljs-tag">&lt;<span class="hljs-name">Area</span>
            <span class="hljs-attr">type</span>=<span class="hljs-string">"monotone"</span>
            <span class="hljs-attr">dataKey</span>=<span class="hljs-string">"sales"</span>
            <span class="hljs-attr">fillOpacity</span>=<span class="hljs-string">{1}</span>
            <span class="hljs-attr">fill</span>=<span class="hljs-string">"url(#colorSales)"</span>
          /&gt;</span>

          {/* vertical line for the optimal price */}
          {optimalPrice &amp;&amp;
            <span class="hljs-tag">&lt;<span class="hljs-name">ReferenceLine</span>
              <span class="hljs-attr">x</span>=<span class="hljs-string">{optimalPrice}</span>
              <span class="hljs-attr">strokeDasharray</span>=<span class="hljs-string">"4 4"</span>
              <span class="hljs-attr">ifOverflow</span>=<span class="hljs-string">"visible"</span>
              <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span>
                <span class="hljs-attr">value:</span> `<span class="hljs-attr">Optimal</span> <span class="hljs-attr">Price:</span> $${<span class="hljs-attr">optimalPrice</span> !== <span class="hljs-string">null</span> &amp;&amp; <span class="hljs-attr">optimalPrice</span> &gt;</span> 0 ? Math.ceil(optimalPrice * 10000) / 10000 : ''}`,
                position: "right",
                fontSize: 12,
                offset: 10
              }}
            /&gt;
          }
        <span class="hljs-tag">&lt;/<span class="hljs-name">AreaChart</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">ResponsiveContainer</span>&gt;</span>

      {optimalPrice &amp;&amp; <span class="hljs-tag">&lt;<span class="hljs-name">p</span>&gt;</span>Optimal Price: $ {Math.ceil(optimalPrice * 10000) / 10000}<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>}

    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span></span>
  )
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> App
</code></pre>
<h2 id="heading-final-results">Final Results</h2>
<p>Now, the application is ready to serve.</p>
<p>You can explore the UI from <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<p>All code (backend) is available in <a target="_blank" href="https://github.com/krik8235/ml-sales-prediction">my Github Repo</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building a machine learning system requires thoughtful project scoping and architecture design.</p>
<p>In this article, we built a dynamic pricing system as a simple single interface on containerized serverless architecture.</p>
<p>Moving forward, we’d need to consider potential drawbacks of this minimal architecture:</p>
<ul>
<li><p><strong>Increase in cold start duration</strong>: The WSGI adapter <code>awsgi</code> layer adds a small overhead. Loading a larger container image takes longer time.</p>
</li>
<li><p><strong>Monolithic function:</strong> Adding endpoints to the Lambda function can lead to a monolithic function where an issue in one endpoint impacts others.</p>
</li>
<li><p><strong>Less granular observability</strong>: AWS CloudWatch cannot provide individual invocation/error metrics per API endpoint without custom instrumentation.</p>
</li>
</ul>
<p>To scale the application effectively, extracting functionalities into a new microservice can be a good strategy to the next step.</p>
<p>I’m Kuriko IWAI, and you can find more of my work and learn more about me here:</p>
<p><a target="_blank" href="https://kuriko-iwai.vercel.app/"><strong>Portfolio</strong></a> <strong>/</strong> <a target="_blank" href="https://www.linkedin.com/in/k-i-i/"><strong>LinkedIn</strong></a> <strong>/</strong> <a target="_blank" href="https://github.com/krik8235"><strong>Github</strong></a></p>
<p><em>All images, unless otherwise noted, are by the author. This application utilizes synthetic dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.</em></p>
<p><em>This information about AWS is current as of August 2025 and is subject to change.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code ]]>
                </title>
                <description>
                    <![CDATA[ The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design. In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks: Custom class... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-multilayer-perceptron-with-examples-and-python-code/</link>
                <guid isPermaLink="false">6839f729798ea464918cffe8</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ neural networks ]]>
                    </category>
                
                    <category>
                        <![CDATA[ binary classification ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MLP (Multi-Layer Perceptrons) ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Fri, 30 May 2025 18:21:29 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748616370600/01903917-4be7-476b-90d1-18295d19edef.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The <strong>perceptron</strong> is a fundamental concept in deep learning, with many algorithms stemming from its original design.</p>
<p>In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:</p>
<ul>
<li><p>Custom classifier</p>
</li>
<li><p>Scikit-learn’s MLPClassifier</p>
</li>
<li><p>Keras Sequential classifier using SGD and Adam optimizers.</p>
</li>
</ul>
<p>This will help you learn about their various use cases and how they work.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-a-perceptron">What is a Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-optimizers">Understanding Optimizers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-results-generalization">Final Results: Generalization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Mathematics (Calculus, Linear Algebra, Statistics)</p>
</li>
<li><p>Coding in Python</p>
</li>
<li><p>Basic understanding of Machine Learning concepts</p>
</li>
</ul>
<h2 id="heading-what-is-a-perceptron">What is a Perceptron?</h2>
<p>A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.</p>
<p>A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.</p>
<p>But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.</p>
<p>The perceptron consists of four main parts:</p>
<ul>
<li><p><strong>Input layer</strong>: Takes the initial numerical values into the system for further processing.</p>
</li>
<li><p><strong>Weights</strong>: Combines input values with weights (and bias terms).</p>
</li>
<li><p><strong>Activation function</strong>: Determines whether the neuron should fire based on the threshold value.</p>
</li>
<li><p><strong>Output layer</strong>: Produces classification result.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748438698612/5b2920db-4ec1-455b-840e-7b5e9d6c2e75.png" alt="Image: Organization of a perceptron. Source: Rosenblatt 1958" class="image--center mx-auto" width="2100" height="746" loading="lazy"></p>
<p>It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.</p>
<p>So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.</p>
<h3 id="heading-applications-of-perceptrons">Applications of Perceptrons</h3>
<p>Perceptrons are applied to tasks such as:</p>
<ul>
<li><p><strong>Image classification:</strong> Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.</p>
</li>
<li><p><strong>Linear regression:</strong> Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.</p>
</li>
</ul>
<h3 id="heading-how-the-activation-function-works">How the Activation Function Works</h3>
<p>For a single perceptron used for binary classification, the most common activation function is the <strong>step function</strong> (also known as the threshold function):</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq \theta \\ \\ 0 &amp;\text{if } z &lt; \theta \end{cases}$$</p><p>where:</p>
<ul>
<li><p><code>ϕ(z)</code>: the output of the activation function.</p>
</li>
<li><p><code>z</code>: the weighted sum of the inputs plus the bias:</p>
</li>
</ul>
<p>$$z = \sum_{i=1}^m w_i x_i + b$$</p><p>(xi: input values, w: weight associated with each input, b: bias terms)</p>
<p><code>θ</code> is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.</p>
<p>In that case, the formula becomes:</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq 0 \\ \\ 0 &amp;\text{if } z &lt; 0 \end{cases}$$</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748439460839/e74f1c1c-4e89-419b-aa9e-24a297d81ff5.png" alt="Image: Step Function (Author)" class="image--center mx-auto" width="1526" height="410" loading="lazy"></p>
<p>When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.</p>
<p>This occurs <strong>when the weighted sum is greater than zero,</strong> leading the perceptron to predict the input is in this binary class.</p>
<p>While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.</p>
<p>In modern implementations, we can use other activation functions like the <strong>sigmoid</strong> function:</p>
<p>$$\sigma (z) = \frac {1} {1 + e^{-z}}$$</p><p>The sigmoid function also outputs zero or one depending on the weighted sum (z).</p>
<h3 id="heading-how-the-loss-function-works">How the Loss Function Works</h3>
<p>The <strong>loss function</strong> is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.</p>
<p>Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.</p>
<p>In a binary classification task, the model may adopt the <strong>hinge loss function</strong> to penalize misclassifications by incurring an additional cost for incorrect predictions:</p>
<p>$$L(y, h(x)) = max(0, 1- y*h(x))$$</p><p>(h(x): prediction label, y: true label)</p>
<h2 id="heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</h2>
<p>Now, let’s build a simple single-layer perceptron for binary classification.</p>
<h3 id="heading-1-custom-classifier">1. Custom Classifier</h3>
<h4 id="heading-initialize-the-classifier">Initialize the classifier</h4>
<p>We’ll first initialize the classifier with <code>weights</code>, <code>bias</code>, number of epochs (<code>n_iterations)</code>, and <code>learning_rates</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = <span class="hljs-literal">None</span>
    self.bias = <span class="hljs-literal">None</span>
</code></pre>
<h4 id="heading-define-the-activation-function">Define the activation function</h4>
<p>Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the <code>threshold</code> is set to zero.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
     <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
</code></pre>
<h4 id="heading-train-the-model">Train the model</h4>
<p>Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: <code>weights</code> and <code>bias</code>.</p>
<p>This process is controlled by a specified number of training epochs defined by <code>n_iterations</code>.</p>
<p>In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined <code>learning_rate</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = <span class="hljs-number">0</span>

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
            <span class="hljs-comment"># compute weighted sum (z)</span>
            z = np.dot(X[i], self.weights) + self.bias

            <span class="hljs-comment"># apply the activation function</span>
            y_pred = self._step_function(z)

            <span class="hljs-comment"># update weights and bias</span>
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)
</code></pre>
<h4 id="heading-how-the-weights-work-in-the-iteration-loop">How the weights work in the iteration loop</h4>
<p>The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.</p>
<p>Its iterative update in the <code>for</code> loop aims to reduce classification errors such that:</p>
<p>$$\begin {align*} w_j &amp;:= w_j + \Delta w_j \\ &amp; := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &amp;= \begin{cases} w_j &amp;\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &amp;\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>(<code>w_j</code>: j-th weight, <code>η</code>: learning rate, (<code>yi​−y^​i​</code>): error)</p>
<p>This means that:</p>
<ol>
<li><p>When the prediction is <strong>correct</strong>, the error is zero, so the weight is unchanged.</p>
</li>
<li><p>When the prediction is <strong>too low</strong> (yi​=1 and y^​i​=0), the weight is adjusted to the same direction to increase the weighted sum.</p>
</li>
<li><p>When the prediction is <strong>too high</strong> (yi​=0 and y^​i​=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.</p>
</li>
</ol>
<h4 id="heading-how-the-bias-terms-work-in-the-iteration-loop">How the bias terms work in the iteration loop</h4>
<p>The bias determines the decision boundary’s intercept (position from the origin).</p>
<p>Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:</p>
<p>$$\begin {align*} b &amp;:= b + \Delta b \\ &amp; := b + \eta (y_i - \hat y_i) \\ &amp;= \begin{cases} b &amp;\text{(a) } y_i - \hat y_i = 0\\ b + \eta &amp;\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.</p>
<h4 id="heading-make-a-prediction">Make a prediction</h4>
<p>Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      <span class="hljs-keyword">return</span> predictions
</code></pre>
<p><strong>The entire classifier looks like this:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Perceptron</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = <span class="hljs-literal">None</span>
        self.bias = <span class="hljs-literal">None</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
        <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        <span class="hljs-keyword">return</span> y_pred
</code></pre>
<h4 id="heading-simulate-with-synthetic-datasets">Simulate with synthetic datasets</h4>
<p>First, we generated a synthetic linearly separable dataset using <code>make_blob</code> and computed a decision boundary, then train the classifier we created.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> make_blobs
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># create a mock dataset</span>
X, y = make_blobs(n_features=<span class="hljs-number">2</span>, centers=<span class="hljs-number">2</span>, n_samples=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">12</span>)

<span class="hljs-comment"># split</span>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># train the model</span>
perceptron = Perceptron(learning_rate=<span class="hljs-number">0.1</span>, n_iterations=<span class="hljs-number">1000</span>).fit(X_train, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

<span class="hljs-comment"># evaluate the results</span>
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results">Results</h4>
<p>The classifier generated a clear, highly accurate linear decision boundary.</p>
<ul>
<li><p><em>Accuracy (Train): 0.981</em></p>
</li>
<li><p><em>Accuracy (Test): 0.975</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440470195/0a01c5ad-124e-4f59-b4d5-9ee5dd5b23ce.png" alt="Decision boundary of single-layer perceptron (Custom classifier)" class="image--center mx-auto" width="1478" height="732" loading="lazy"></p>
<h3 id="heading-2-leverage-sckitlearns-mcp-classifier">2. Leverage SckitLearn’s MCP Classifier</h3>
<p>For our convenience, we’ll use sckit-learn’s build-in classifier ( <code>MCPClassifier</code>) to build a similar, yet more robust classifier:</p>
<pre><code class="lang-python">model = MLPClassifier(
    hidden_layer_sizes=(), <span class="hljs-comment"># intentionally set empty to create a single layer perceptron</span>
    activation=<span class="hljs-string">'logistic'</span>, <span class="hljs-comment"># choosing a sigmoid function as an activation function</span>
    solver=<span class="hljs-string">'sgd'</span>, <span class="hljs-comment"># choosing SGD optimizer</span>
    max_iter=<span class="hljs-number">1000</span>,
    random_state=<span class="hljs-number">42</span>, 
    learning_rate=<span class="hljs-string">'constant'</span>, 
    learning_rate_init=<span class="hljs-number">0.1</span>
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"MCPClassifier\nAccuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results-1">Results</h4>
<p>The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.</p>
<ul>
<li><p><em>Accuracy (Train): 0.985</em></p>
</li>
<li><p><em>Accuracy (Test): 0.995</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440118956/f5391f47-711a-4948-b956-1a76dbd7ca92.png" alt="Decision boundary of single-layer perceptron (MCP Classifier)" class="image--center mx-auto" width="1720" height="850" loading="lazy"></p>
<h3 id="heading-limitations-of-single-layer-perceptrons">Limitations of Single-Layer Perceptrons</h3>
<p>Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.</p>
<p>Unlike more general neural networks, single-layer perceptrons use a <strong>step function</strong> as their activation.</p>
<p>Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).</p>
<p>This fundamental property precludes the use of <strong>gradient-based optimization algorithms</strong> such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.</p>
<p>In contrast, most neural networks employ differentiable activation functions (for example, <strong>sigmoid</strong>, <strong>ReLU</strong>) and loss functions (for example, <strong>MSE</strong>, <strong>Cross-Entropy</strong>) for effective optimization.</p>
<p>Other challenges of a single-layer perceptron include:</p>
<ul>
<li><p><strong>Limited to linear separability:</strong> Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.</p>
</li>
<li><p><strong>Lack of depth:</strong> Being single-layered, they cannot learn complex hierarchical representations.</p>
</li>
<li><p><strong>Limited optimizer options:</strong> As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.</p>
</li>
</ul>
<p>So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.</p>
<h2 id="heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</h2>
<p>An MLP is a class of feedforward artificial neural network that consists of at least <strong>three layers</strong> of nodes:</p>
<ul>
<li><p>an input layer,</p>
</li>
<li><p>one or more hidden layers, and</p>
</li>
<li><p>an output layer.</p>
</li>
</ul>
<p>Except for the input nodes, each node is a neuron that uses a <strong>nonlinear</strong> activation function.​</p>
<p>MLPs are widely used for classification problems as well as regression:</p>
<ul>
<li><p><strong>Classification tasks:</strong> MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.​</p>
</li>
<li><p><strong>Regression analysis:</strong> They are also applied in regression problems where the relationship between input and output is complex.​</p>
</li>
</ul>
<h2 id="heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</h2>
<p>Let’s handle a binary classification task using a standard MLP architecture.</p>
<h3 id="heading-outline-of-the-project">Outline of the Project</h3>
<h4 id="heading-objective">Objective</h4>
<ul>
<li>Detect fraudulent transactions</li>
</ul>
<h4 id="heading-evaluation-metrics">Evaluation Metrics</h4>
<ul>
<li><p>Considering the cost of misclassification, we’ll prioritize improving <strong>Recall</strong> and <strong>Precision scores</strong></p>
</li>
<li><p>Then check the accuracy of classification with <strong>Accuracy</strong> Score (TP + TN / (TP + TN + FP + FN ))</p>
</li>
</ul>
<p><strong>Cost of Misclassification (from high to low):</strong></p>
<ul>
<li><p><strong>False Negative (FN):</strong> The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)</p>
</li>
<li><p><strong>False Positive (FP):</strong> The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)</p>
</li>
<li><p><strong>True Positive (TP):</strong> The model correctly identifies a fraudulent transaction as fraud.</p>
</li>
<li><p><strong>True Negative (TN):</strong>  The model correctly identifies a non-fraudulent transaction as non-fraud.</p>
</li>
</ul>
<h3 id="heading-planning-an-mlp-architecture">Planning an MLP Architecture</h3>
<p>In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.</p>
<p>Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.</p>
<p>During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440761512/37753a4c-f7f8-44bc-bea9-c50360830456.png" alt="Standard MLP Architecture for Binary Classification Tasks)" class="image--center mx-auto" width="1384" height="752" loading="lazy"></p>
<p>Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using <a target="_blank" href="https://www.researchgate.net/publication/355148120_SS-MLP_A_Novel_Spectral-Spatial_MLP_Architecture_for_Hyperspectral_Image_Classification">image source</a>)</p>
<p>Especially in deeper network, <strong>ReLU</strong> is advantageous in preventing <a target="_blank" href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem#:~:text=In%20machine%20learning%2C%20the%20vanishing,derivative%20of%20the%20loss%20function">vanishing gradient problems</a> where gradients become extremely small as they are backpropagated from the output layers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440797954/ba19bf66-cdb9-4bfb-9b92-e1e3f72e9fc7.png" alt="Comparison of major activation functions: From left to right: Sigmoid, Tanh, ReLU" class="image--center mx-auto" width="1694" height="578" loading="lazy"></p>
<p><a target="_blank" href="https://medium.com/data-science-collective/a-comprehensive-guide-on-neural-network-in-deep-learning-442ba9f1f0e5">Learn More: A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h3 id="heading-preprocessing-the-datasets">Preprocessing the Datasets</h3>
<p>First, we consolidate <a target="_blank" href="https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets">three datasets  –  transaction, customer, and credit card</a>  –  into a single DataFrame, independently sanitizing numerical and categorical data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler, OneHotEncoder
<span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

<span class="hljs-comment"># download the raw data to local</span>
<span class="hljs-keyword">import</span> kagglehub
path = kagglehub.dataset_download(<span class="hljs-string">"computingvictor/transactions-fraud-datasets"</span>)
dir = <span class="hljs-string">f'<span class="hljs-subst">{path}</span>/gd_card_flaud_demo'</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sanitize_df</span>(<span class="hljs-params">amount_str</span>):</span>
    <span class="hljs-string">"""Removes '$' and converts the string to a float."""</span>
    <span class="hljs-keyword">if</span> isinstance(amount_str, str):
        <span class="hljs-keyword">return</span> float(amount_str.replace(<span class="hljs-string">'$'</span>, <span class="hljs-string">''</span>))
    <span class="hljs-keyword">return</span> amount_str

<span class="hljs-comment"># load transaction data</span>
trx_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/transactions_data.csv'</span>)

<span class="hljs-comment"># sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)</span>
trx_df = trx_df[trx_df[<span class="hljs-string">'errors'</span>].isna()]
trx_df = trx_df.drop(columns=[<span class="hljs-string">'merchant_city'</span>,<span class="hljs-string">'merchant_state'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'mcc'</span>, <span class="hljs-string">'errors'</span>], axis=<span class="hljs-string">'columns'</span>)
trx_df[<span class="hljs-string">'amount'</span>] = trx_df[<span class="hljs-string">'amount'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge the dataframe with fraud transaction flag.</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/train_fraud_labels.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get(<span class="hljs-string">'target'</span>, {})
fraud_labels_series = pd.Series(fraud_labels_dict, name=<span class="hljs-string">'is_fraud'</span>)
fraud_labels_series.index = fraud_labels_series.index.astype(int) <span class="hljs-comment"># convert the datatype from string to integer</span>
merged_df = pd.merge(trx_df, fraud_labels_series, left_on=<span class="hljs-string">'id'</span>, right_index=<span class="hljs-literal">True</span>, how=<span class="hljs-string">'left'</span>)
merged_df.fillna({<span class="hljs-string">'is_fraud'</span>: <span class="hljs-string">'No'</span>}, inplace=<span class="hljs-literal">True</span>)
merged_df[<span class="hljs-string">'is_fraud'</span>] = merged_df[<span class="hljs-string">'is_fraud'</span>].map({<span class="hljs-string">'Yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'No'</span>: <span class="hljs-number">0</span>})

<span class="hljs-comment"># load card data</span>
card_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/cards_data.csv'</span>)
card_df = card_df.drop(columns=[<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'acct_open_date'</span>, <span class="hljs-string">'card_number'</span>, <span class="hljs-string">'expires'</span>, <span class="hljs-string">'cvv'</span>], axis=<span class="hljs-string">'columns'</span>)
card_df[<span class="hljs-string">'credit_limit'</span>] = card_df[<span class="hljs-string">'credit_limit'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge transaction and card data</span>
merged_df = pd.merge(left=merged_df, right=card_df, left_on=<span class="hljs-string">'card_id'</span>, right_on=<span class="hljs-string">'id'</span>, how=<span class="hljs-string">'inner'</span>)
merged_df = merged_df.drop(columns=[<span class="hljs-string">'id_y'</span>, <span class="hljs-string">'card_id'</span>], axis=<span class="hljs-string">'columns'</span>)

<span class="hljs-comment"># converts categorical variables into a new binary column (0 or 1)</span>
categorical_cols = merged_df.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=<span class="hljs-literal">False</span>, dtype=float) 
df = df.dropna().drop([<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'id_x'</span>], axis=<span class="hljs-number">1</span>)
print(<span class="hljs-string">'\nDataFrame: \n'</span>, df.head(n=<span class="hljs-number">3</span>))
</code></pre>
<p>DataFrame:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440856826/ba79bdaf-e0a1-457f-ab19-fda3e0f08141.png" alt="Base DataFrame" class="image--center mx-auto" width="1546" height="810" loading="lazy"></p>
<p>Our DataFrame shows an extremely <strong>skewed data distribution</strong> with:</p>
<ul>
<li><p>Fraud samples: 1,191</p>
</li>
<li><p>Non-fraud samples: 11,477,397</p>
</li>
</ul>
<p>For classification tasks, <strong>it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact</strong> on classification model performance, especially regarding the minority class.</p>
<p>For our data, we’ll:</p>
<ol>
<li><p>split the 1,191 fraud samples into training, validation, and test sets,</p>
</li>
<li><p>add an equal number of randomly chosen non-fraud samples from the DataFrame, and</p>
</li>
<li><p>adjust split balances later if generalization challenges arise.</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># define the desired size of the fraud samples for the validation and test sets</span>
val_size_per_class = <span class="hljs-number">200</span>
test_size_per_class = <span class="hljs-number">200</span>

<span class="hljs-comment"># create test sets</span>
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced test set</span>
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_test = X_test[<span class="hljs-string">'is_fraud'</span>]
X_test = X_test.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the original dataframes to avoid data leakage</span>
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


<span class="hljs-comment"># create validation sets</span>
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced validation set</span>
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_val = X_val[<span class="hljs-string">'is_fraud'</span>]
X_val = X_val.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the remaining dataframes</span>
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


<span class="hljs-comment"># create training sets</span>
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_train = X_train[<span class="hljs-string">'is_fraud'</span>]
X_train = X_train.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)


print(<span class="hljs-string">"\n--- Final Dataset Shapes and Distributions ---"</span>)
print(<span class="hljs-string">f"X_train shape: <span class="hljs-subst">{X_train.shape}</span>, y_train distribution: <span class="hljs-subst">{np.unique(y_train, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_val shape: <span class="hljs-subst">{X_val.shape}</span>, y_val distribution: <span class="hljs-subst">{np.unique(y_val, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_test shape: <span class="hljs-subst">{X_test.shape}</span>, y_test distribution: <span class="hljs-subst">{np.unique(y_test, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
</code></pre>
<p>After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a <strong>50:50 split between fraud and non-fraud transactions</strong>:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*IZtK3l0hSqmkOrm9h_d9Jw.png" alt="X, y datasets shape" width="600" height="400" loading="lazy"></p>
<p>Considering the high dimensional feature space with 19 input features, we’ll apply <strong>SMOTE</strong> to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> imblearn.over_sampling <span class="hljs-keyword">import</span> SMOTE
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter

train_target = <span class="hljs-number">2000</span>

smote_train = SMOTE(
  sampling_strategy={<span class="hljs-number">0</span>: train_target, <span class="hljs-number">1</span>: train_target},  <span class="hljs-comment"># increase sample size to 2,000</span>
  random_state=<span class="hljs-number">12</span>
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(<span class="hljs-string">f"\nAfter SMOTE with custom sampling_strategy (target train: <span class="hljs-subst">{train_target}</span>):"</span>)
print(<span class="hljs-string">f"X_train_oversampled shape: <span class="hljs-subst">{X_train.shape}</span>"</span>)
print(<span class="hljs-string">f"y_train_oversampled distribution: <span class="hljs-subst">{Counter(y_train)}</span>"</span>)
</code></pre>
<p>We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440986995/ed079321-3972-4226-b1a8-244010445162.png" alt="Training sample shape after SMOTE" class="image--center mx-auto" width="1578" height="218" loading="lazy"></p>
<p>Lastly, we’ll apply <strong>column transformers</strong> to numerical and categorical features separately.</p>
<p>Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

categorical_features = X_train.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns.tolist()
categorical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)),(<span class="hljs-string">'onehot'</span>, OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>))])

numerical_features = X_train.select_dtypes(include=[<span class="hljs-string">'int64'</span>, <span class="hljs-string">'float64'</span>]).columns.tolist()
numerical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)), (<span class="hljs-string">'scaler'</span>, StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        (<span class="hljs-string">'num'</span>, numerical_transformer, numerical_features),
        (<span class="hljs-string">'cat'</span>, categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)
</code></pre>
<h2 id="heading-understanding-optimizers">Understanding Optimizers</h2>
<p>In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.</p>
<p>Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.</p>
<p>In this article, we’ll use the SGD Optimizer and Adam Optimizer.</p>
<h3 id="heading-1-how-a-sgd-stochastic-gradient-descent-optimizer-works">1. How a SGD (Stochastic Gradient Descent) Optimizer Works</h3>
<p>SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:</p>
<p>$$\begin{align*} w_j &amp;:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &amp;:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$</p><p>(w: weight, b: bias, J: cost function, <em>η</em>: learning rate)</p>
<p>In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:</p>
<p>$$\begin{align*} J(y, \hat y) &amp;=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &amp;= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &amp;= \sum_{i=1}^m w_i x_i + b \end {align*}$$</p><h3 id="heading-2-how-adam-adaptive-moment-estimation-optimizer-works">2. How Adam (Adaptive Moment Estimation) Optimizer Works</h3>
<p>Adam is an optimization algorithm that computes <strong>individual adaptive learning rates</strong> for different parameters from estimates of first and second moments of the gradients.</p>
<p>Adam optimizer combines the advantages of <a target="_blank" href="https://keras.io/api/optimizers/rmsprop/"><strong>RMSprop</strong></a> (using squared gradients to scale the learning rate) and <a target="_blank" href="https://optimization.cbe.cornell.edu/index.php?title=Momentum"><strong>Momentum</strong></a> (using past gradients to accelerate convergence):</p>
<p>$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$</p><p>where:</p>
<ul>
<li><p><code>α</code>: The learning rate (default is 0.001)</p>
</li>
<li><p><code>ϵ</code>: A small positive constant used to avoid division by zero</p>
</li>
<li><p><code>m^</code>: First moment (mean) estimate with a bias correction, leveraging <strong>Momentum</strong>:</p>
</li>
</ul>
<p>$$\begin{align*} \hat m_t &amp;= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &amp;= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$</p><p>(<code>β1</code>​​: <strong>Decay rates</strong>, typically set to β1=0.9)</p>
<p><code>v^</code>: Second moment (variance) estimate with a bias correction, leveraging <strong>RMSprop</strong>:</p>
<p>$$\begin{align*} \hat v_t &amp;= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &amp;=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$</p><p>(<code>β2</code>​​: <strong>Decay rates</strong>, typically set to β2=0.999)</p>
<p>Since both <code>m</code>​​ and <code>v</code>​ are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.</p>
<p>Learn More: <a target="_blank" href="https://medium.com/@kuriko-iwai/a-comprehensive-guide-on-neural-network-in-deep-learning-9c795a1f1648">A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</h2>
<h3 id="heading-custom-classifier">Custom Classifier</h3>
<p>This process involves a <strong>forward pass</strong> and <strong>backpropagation</strong>, during which SGD computes optimal weights and biases using gradients:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
    <span class="hljs-comment"># SGD starts with randomly selected mini-batch for the epoch</span>
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    <span class="hljs-comment"># A. forward pass</span>
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[<span class="hljs-number">-1</span>]  <span class="hljs-comment"># final output of the network</span>

    <span class="hljs-comment"># B. backpropagation</span>
    <span class="hljs-comment"># 1) calculating gradients for the output layer)</span>
    delta = y_pred - y_batch
    dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

    <span class="hljs-comment"># 2) update output layer parameters</span>
    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

    <span class="hljs-comment"># 3) iterate backward from last hidden layer to the input layer</span>
    <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
        delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
        dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
        db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db
</code></pre>
<p>In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
    activations = [X]
    zs = []

    <span class="hljs-comment"># forward through hidden layers</span>
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
        z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) <span class="hljs-comment"># using ReLU for hidden layers</span>
        activations.append(a)

    <span class="hljs-comment"># forward through output layer</span>
    z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
    zs.append(z_output)

    <span class="hljs-comment"># computes the final output using sigmoid function</span>
    y_pred = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
    activations.append(y_pred)
    <span class="hljs-keyword">return</span> activations, zs
</code></pre>
<p>So the final classifier looks like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_SGD</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.01</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]
        self.weights = []
        self.biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            <span class="hljs-comment"># shuffle datasets</span>
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>]

                delta = y_pred - y_batch
                dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten() <span class="hljs-comment"># for 1D output</span>
</code></pre>
<h3 id="heading-training-prediction">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python"><span class="hljs-comment"># 1. define the model</span>
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>, ), <span class="hljs-comment"># 2 hidden layers with 30 neurons each</span>
  learning_rate=<span class="hljs-number">0.001</span>,           <span class="hljs-comment"># a step size</span>
  n_epochs=<span class="hljs-number">1000</span>,                 <span class="hljs-comment"># number of epochs</span>
  batch_size=<span class="hljs-number">32</span>                  <span class="hljs-comment"># mini-batch size</span>
)

<span class="hljs-comment"># 2. train the model</span>
mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># 3. make a prediction with training and validation datasets</span>
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

<span class="hljs-comment"># 4. compute evaluation matrics</span>
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
recall = recall_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
f1 = f1_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)


print(<span class="hljs-string">f"\nMLP (Custom SGD) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom SGD) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-2">Results</h3>
<ul>
<li><p>Recall: <em>0.7930 — 0.6650 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7790 — 0.6786 (from training to validation)</em></p>
</li>
</ul>
<p>The model effectively learned and generalized the patterns, achieving a <strong>Recall of 79.3%</strong> (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.</p>
<p><strong>Loss history:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441103897/088deb38-846d-4026-a706-701be93036ca.png" alt="Loss by epoch, weight history, bias history (Source: Kuriko Iwai)" class="image--center mx-auto" width="1770" height="460" loading="lazy"></p>
<p>We visualized the <strong>decision boundary</strong> using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442430297/032ee809-1b7e-4bb1-81c0-8715361658a5.png" alt="Image: Decision Boundary of MLP Classifier with SGD optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="1508" height="754" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier">Leverage SckitLearn’s MCP Classifier</h3>
<p>We can use an MCP Classifier to define a similar model, incorporating;</p>
<ul>
<li><p><strong>Early stopping</strong> using internal validation to prevent overfitting and</p>
</li>
<li><p><strong>L2 regularization</strong> with a small tolerance.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPClassifier

<span class="hljs-comment"># define a model</span>
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'sgd'</span>,
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    momentum=<span class="hljs-number">0.9</span>,
    nesterovs_momentum=<span class="hljs-literal">True</span>,
    alpha=<span class="hljs-number">0.00001</span>,           <span class="hljs-comment"># l2 regulation strength</span>
    max_iter=<span class="hljs-number">3000</span>,           <span class="hljs-comment"># max epochs (keep it high)</span>
    batch_size=<span class="hljs-number">16</span>,           <span class="hljs-comment"># mini-batch size</span>
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,     <span class="hljs-comment"># apply early stopping</span>
    n_iter_no_change=<span class="hljs-number">50</span>,     <span class="hljs-comment"># stop the iteration if internal validation score doesn't improve for 50 epochs</span>
    validation_fraction=<span class="hljs-number">0.1</span>, <span class="hljs-comment"># proportion of training data for internal validation (default is 0.1)</span>
    tol=<span class="hljs-number">1e-4</span>,                <span class="hljs-comment"># tolerance for optimization</span>
    verbose=<span class="hljs-literal">False</span>,
)

<span class="hljs-comment"># training</span>
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-3">Results</h3>
<ul>
<li><p>Recall: <em>0.7830 - 0.6200 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.8208  - 0.6703 (from training to validation)</em></p>
</li>
</ul>
<p>The model showed strong performance during training, achieving a Recall <strong>of 78.30%</strong>. Its performance declined on the validation set.</p>
<p>This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.</p>
<h3 id="heading-leverage-keras-sequential-classifier">Leverage Keras Sequential Classifier</h3>
<p>For the sequential classifier, we can further enhance the classifier by:</p>
<ul>
<li><p>Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train​) to address dataset imbalance and promote faster convergence,</p>
</li>
<li><p>Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,</p>
</li>
<li><p>Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,</p>
</li>
<li><p>Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and</p>
</li>
<li><p>Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> SGD
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


<span class="hljs-comment"># calculates an initial bias for the output layer </span>
initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])


<span class="hljs-comment"># defines the model</span>
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>), <span class="hljs-comment"># 10% of the neurons in that layer randomly dropped out</span>
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, <span class="hljs-comment"># binary classification</span>
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) <span class="hljs-comment"># to address the imbalanced datasets</span>
])



<span class="hljs-comment"># compiles the model with the SGD optimizer</span>
opt = SGD(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_sgd.compile(
    optimizer=opt, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>,
    metrics=[
        <span class="hljs-string">'accuracy'</span>, <span class="hljs-comment"># add several metrics to return</span>
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)


<span class="hljs-comment"># defines early stopping to prevent overfitting</span>
early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,  <span class="hljs-comment"># monitor recall </span>
    mode=<span class="hljs-string">'max'</span>,         <span class="hljs-comment"># maximize recall</span>
    patience=<span class="hljs-number">50</span>,        <span class="hljs-comment"># stop after 50 epochs without loss improvement</span>
    min_delta=<span class="hljs-number">1e-4</span>,     <span class="hljs-comment"># minimum change to be considered an improvement (tol)</span>
    verbose=<span class="hljs-number">0</span>
)


<span class="hljs-comment"># compute the class weight</span>
class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


<span class="hljs-comment"># train the model</span>
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val), <span class="hljs-comment"># use our external val set</span>
    callbacks=[early_stopping_callback], <span class="hljs-comment"># early stopping to prevent overfitting</span>
    class_weight=class_weights_dict, <span class="hljs-comment"># penarlize more misclassification on minority class</span>
    verbose=<span class="hljs-number">0</span>
)

<span class="hljs-comment"># evaluate</span>
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)

<span class="hljs-comment"># display model summary</span>
model_keras_sgd.summary()
</code></pre>
<h3 id="heading-results-4">Results</h3>
<ul>
<li><p>Recall: <em>0.7125 — 0.7250 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7607 — 0.7545 (from training to validation)</em></p>
</li>
</ul>
<p>Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.</p>
<p>It suggests that the regularization techniques are likely effective in preventing significant overfitting.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441165170/4e0528e3-514a-454c-b52a-2a0318ba405a.png" alt="Image: Summary of the Keras Sequential Model with SGD Optimizer" class="image--center mx-auto" width="1668" height="512" loading="lazy"></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</h2>
<h3 id="heading-custom-classifier-1">Custom Classifier</h3>
<p>This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:</p>
<pre><code class="lang-python"><span class="hljs-comment"># apply Adam updates for output layer parameters</span>
<span class="hljs-comment"># 1) weights (w)</span>
self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

<span class="hljs-comment"># 2) bias (b)</span>
self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
</code></pre>
<p>Following the principles of forward and backward passes, we construct the final classifier by initializing it with <code>beta1</code> and <code>beta2</code>, built upon an <code>MLP_SGD</code> architecture:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_Adam</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span>,
                 beta1=<span class="hljs-number">0.9</span>, beta2=<span class="hljs-number">0.999</span>, epsilon=<span class="hljs-number">1e-8</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        <span class="hljs-comment"># Adam optimizer internal states for each parameter (weights and biases)</span>
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
            self.v_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-comment"># global time step for Adam bias correction</span>
        t = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># Mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += <span class="hljs-number">1</span>

                <span class="hljs-comment"># 1. forward pass</span>
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>] <span class="hljs-comment"># Output of the network</span>

                <span class="hljs-comment"># 2. backpropagation</span>
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>] <span class="hljs-comment"># Average over batch</span>
                grad_b_output = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                <span class="hljs-comment"># apply Adam updates to weights</span>
                self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
                self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
                m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                <span class="hljs-comment"># apply Adam updates to bias</span>
                self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
                self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
                m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                <span class="hljs-comment"># Propagate gradients backward through hidden layers</span>
                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    grad_b_hidden = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    <span class="hljs-comment"># apply Adam updates to weights</span>
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_hidden ** <span class="hljs-number">2</span>)
                    m_w_hat = self.m_weights[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    <span class="hljs-comment"># apply Adam updates to bias</span>
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_hidden ** <span class="hljs-number">2</span>)
                    m_b_hat = self.m_biases[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten()
</code></pre>
<h3 id="heading-training-prediction-1">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python">mlp_adam = MLP_Adam(hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">10</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">500</span>, batch_size=<span class="hljs-number">32</span>)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(<span class="hljs-string">f"\nMLP (Custom Adam) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom Adam) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-5">Results</h3>
<ul>
<li><p>Recall: <em>0.9870–0.6150 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.9811–0.6474 (from training to validation)</em></p>
</li>
</ul>
<p>While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.</p>
<p><strong>Loss History</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442341394/3183a9b1-5df0-4f74-9473-6b5b595dc9c0.png" alt="Loss by epoch, middle: weights history by epoch, right: bias history by epoch (source: Kuriko Iwai)" class="image--center mx-auto" width="1676" height="456" loading="lazy"></p>
<p>We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442311514/34f004c9-bf1d-41e5-a0af-08c62802b78c.png" alt="Decision Boundary of MLP with Adam Optimizer (source: Kuriko Iwai)" class="image--center mx-auto" width="1770" height="916" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier-1">Leverage SckitLearn’s MCP Classifier</h3>
<p>We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:</p>
<pre><code class="lang-python">model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'adam'</span>,             <span class="hljs-comment"># update the optimizer from SGD to Adam</span>
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    alpha=<span class="hljs-number">0.0001</span>,
    max_iter=<span class="hljs-number">3000</span>,
    batch_size=<span class="hljs-number">16</span>,
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,
    n_iter_no_change=<span class="hljs-number">50</span>,
    validation_fraction=<span class="hljs-number">0.1</span>,
    tol=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-literal">False</span>,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-6">Results</h3>
<ul>
<li><p><em>Recall: 0.8975–0.6400 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8864 —  0.6305 (from training to validation)</em></p>
</li>
</ul>
<p>Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.</p>
<h3 id="heading-leverage-keras-sequential-classifier-1">Leverage Keras Sequential Classifier</h3>
<p>Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> Adam
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>)),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>, 
    metrics=[
        <span class="hljs-string">'accuracy'</span>,
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,
    mode=<span class="hljs-string">'max'</span>,
    patience=<span class="hljs-number">50</span>,
    min_delta=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-number">0</span>
)

class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=<span class="hljs-number">0</span>
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)


model_keras_adam.summary()
</code></pre>
<h3 id="heading-results-7">Results</h3>
<ul>
<li><p><em>Recall: 0.7995–0.7500 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8409–0.8065 (from training to validation)</em></p>
</li>
</ul>
<p>The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).</p>
<p>This indicates good generalization, with only minor performance degradation on unseen data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441767800/fe43f181-4323-461f-b56a-125fc78e9c84.png" alt="Image: Keras Sequential Model with Adam Optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="1484" height="542" loading="lazy"></p>
<h2 id="heading-final-results-generalization">Final Results: Generalization</h2>
<p>Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Custom classifiers</span>
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># MLPClassifer</span>
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># Keras Sequential</span>
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
</code></pre>
<p>Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an <strong>AUPRC (Area Under Precision-Recall Curve) of 0.72.</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748874699534/f0f008c4-9067-4e2a-b070-4bb5cbae8f23.png" alt="Precision-Recall Curves for Six Classifier Models (Comparing Custom, MLP, and Keras Sequential Classifiers with SGD and Adam Optimizers (Source: Kuriko Iwai)" class="image--center mx-auto" width="2160" height="426" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.</p>
<p>Our findings underscore that effective machine learning hinges on three critical factors:</p>
<ol>
<li><p><strong>robust data preprocessing</strong> (tailored to objectives and data distribution),</p>
</li>
<li><p><strong>judicious model selection</strong>, and</p>
</li>
<li><p><strong>strategic framework or library choices</strong>.</p>
</li>
</ol>
<h3 id="heading-choosing-the-right-framework"><strong>Choosing the right framework</strong></h3>
<p>Generally speaking, choose <code>MLPClassifier</code> when:</p>
<ul>
<li><p>You’re primarily working with <strong>tabular data,</strong></p>
</li>
<li><p>You want to prioritize <strong>simplicity, quick iteration, and seamless integration,</strong></p>
</li>
<li><p>You have simple, shallow architectures, and</p>
</li>
<li><p>You have a moderate dataset size (manageable on a CPU).</p>
</li>
</ul>
<p>Choose Keras <code>Sequential</code> when:</p>
<ul>
<li><p>You’re dealing with <strong>image, text, audio, or other sequential data,</strong></p>
</li>
<li><p>You’re building <strong>deep learning models</strong> such as CNNs, RNNs, LSTMs,</p>
</li>
<li><p>You need <strong>fine-grained control</strong> over the model architecture, training process, or custom components,</p>
</li>
<li><p>You need to leverage <strong>GPU acceleration</strong>,</p>
</li>
<li><p>You’re planning for <strong>production deployment</strong>, and</p>
</li>
<li><p>You want to experiment with more advanced deep learning techniques.</p>
</li>
</ul>
<h3 id="heading-limitation-of-mlps">Limitation of MLPs</h3>
<p>While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.</p>
<p>Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.</p>
<p>You can find more info about me on my <a target="_blank" href="https://kuriko.vercel.app/">Portfolio</a> / <a target="_blank" href="https://www.linkedin.com/in/k-i-i">LinkedIn</a> / <a target="_blank" href="https://github.com/versionhq/multi-agent-system">Github</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
