Kuriko - freeCodeCamp.org

How to Build End-to-End Machine Learning Lineage

Kuriko — Thu, 16 Oct 2025 13:43:13 +0000

Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance.

While many services for tracking ML lineage exist, creating a comprehensive and manageable lineage often proves complicated.

In this article, I’ll walk you through integrating a comprehensive ML lineage solution for an ML application deployed on serverless AWS Lambda, covering the end-to-end pipeline stages:

ETL pipeline
Data drift detection
Preprocessing
Model tuning
Risk and fairness evaluation.

What is Machine Learning Lineage?
What We’ll Build
- The System Architecture - AI Pricing for Retailers
- The ML Lineage
Workflow in Action
Step 1: Initiating a DVC Project
Step 2: The ML Lineage
Step 3: Deploying the DVC Project
Step 4: Configuring Scheduled Run with Prefect
Step 5: Deploying the Application
- Test in Local
Conclusion

Prerequisites:

Knowledge of key Machine Learning / Deep Learning concepts including the full lifecycle: data handling, model training, tuning, and validation.
Proficiency in Python, with experience using major ML libraries.
Basic understanding of DevOps principles.

Tools we’ll use:

Here is a summary of the tools we’re going to use to track the ML lineage:

DVC: An open-source version system for data. Used to track the ML lineage.
AWS S3: A secure object storage service from AWS. Used as a remote storage.
Evently AI: An open-source ML and LLM observability framework. Used to detect data drift.
Prefect: A workflow orchestration engine. Used to manage the schedule run of the lineage.

What is Machine Learning Lineage?

Machine learning (ML) lineage is a framework for tracking and understanding the complete lifecycle of a machine learning model.

It contains information at different levels such as:

Code: The scripts, libraries, and configurations for model training.
Data: The original data, transformations, and features.
Experiments: Training runs, hyperparameter tuning results.
Models: The trained models and their versions.
Predictions: The outputs of deployed models.

ML lineage is essential for multiple reasons:

Reproducibility: Recreate the same model and prediction for validation.
Root cause analysis: Trace back to the data, code, or configuration change when a model fails in production.
Compliance: Some regulated industries require proof of model training to ensure fairness, transparency, and adherence to laws like GDPR and the EU AI Act.

What We’ll Build

In this project, I’ll integrate an ML lineage into this price prediction system built on AWS Lambda architecture using DVC, an open-source version control system for ML applications.

The below diagram illustrates the system architecture and the ML lineage we’ll integrate:

Figure A: A comprehensive ML lineage for an ML application on serverless Lambda (Created by Kuriko IWAI)

The System Architecture: AI Pricing for Retailers

The system operates as a containerized, serverless microservice designed to provide optimal price recommendations to maximize retailer sales.

Its core intelligence comes from AI models trained on historical purchase data to predict the quantity of the product sold at various prices, allowing sellers to determine the best price.

For consistent deployment, the prediction logic and its dependencies are packaged into a Docker container image and stored in AWS ECR (Elastic Container Registry).

The prediction is then served by an AWS Lambda function, which retrieves and runs the container from ECR and exposes the result via AWS API Gateway for the Flask application to consume.

If you want to see how to build this from the ground up, you can follow along with my tutorial How to Build a Machine Learning System on Serverless Architecture.

The ML Lineage

In the system, GitHub handles the code lineage, while DVC captures the lineage of:

Data (blue boxes): ETL and preprocessing.
Experiments (light orange): Hyperparamters tuning and validation.
Models and Prediction (dark orange): Final model artifacts and prediction results.

DVC tracks the lineage through separate stages, from data extraction to fairness testing (yellow rows in Figure A).

For each stage, DVC uses an MD5 or SHA256 hash to track and push metadata like artifacts, metrics, and reports to its remote on AWS S3.

The pipeline incorporates Evently AI to handle data drift tests, which are essential for identifying shifts in data distributions that could compromise the model's generalization capabilities in production.

Only models that successfully pass both the data drift and fairness tests can serve predictions via the AWS API gateway (red box in Figure A).

Lastly, this entire lineage process is triggered weekly by the open-source workflow scheduler, Prefect.

Prefect prompts DVC to check for updates in data and scripts, and executes the full lineage process if changes are detected.

Workflow in Action

The building process involves five main steps:

Initiate a DVC project
Define the lineage stages with the DVC script dvc.yaml and corresponding Python script
Deploy the DVC project
Configure scheduled run with Prefect
Deploy the application

Let’s walk through each step together.

Step 1: Initiating a DVC Project

The first step is to initiate a DVC project:

$dvc init

This command automatically creates a .dvc directory at the root of the project folder:

.
.dvc/
│
└── cache/         # [.gitignore] store dvc caches (cached actual data files)
└── tmp/           # [.gitignore]
└── .gitignore     # gitignore cache, tmp, and config.local
└── config         # dvc config for production
└── config.local   # [.gitignore] dvc config for local

DVC maintains a fast, lightweight Git repository by separating the original data in large files from the repository.

The process involves caching the original data in the local .dvc/cache directory, creating a small .dvc metadata file which contains an MD5 hash and a link to the original data file path, pushing only the small metadata files to Git, and pushing the original data to the DVC remote.

Step 2: The ML Lineage

Next, we’ll configure the ML lineage with the following stages:

etl_pipeline: Extract, clean, impute the original data and perform feature engineering.
data_drift_check: Run data drift tests. If they fail, the system exits.
preprocess: Create training, validation, and test datasets.
tune_primary_model: Tune hyperparameters and train the model.
inference_primary_model: Perform inference on the test dataset.
assess_model_risk: Runs risk and fairness tests.

Each stage requires defining the DVC command and its corresponding Python script.

Let’s get started.

Stage 1: The ETL Pipeline

The first stage is to extract, clean, impute the original data, and perform feature engineering.

DVC Configuration

We’ll create the dvc.yaml file at the root of the project directory and add the etl_pipeline stage:

dvc.yaml

stages:
  etl_pipeline:
    # the main command dvc will run in this stage
    cmd: python src/data_handling/etl_pipeline.py

    # dependencies necessary to run the main command
    deps:
      - src/data_handling/etl_pipeline.py
      - src/data_handling/
      - src/_utils/

    # output paths for dvc to track
    outs:
      - data/original_df.parquet
      - data/processed_df.parquet

The dvc.yaml file defines a sequence of steps (stages) using sections like:

cmd: The shell command to be executed for that stage
deps: Dependencies that need to run the cmd
prams: Default parameters for the cmd defined in the params.yaml file
metrics: The metrics files to track
reports: The report files to track
plots: The DVC plot files for visualization
outs: The output files produced by the cmd, which DVC will track

The configuration helps DVC ensure reproducibility by explicitly listing dependencies, outputs, and the commands of each stage. It also helps it manage the lineage by establishing a Directed Acyclic Graph (DAG) of the workflow, linking each stage to the next.

Python Scripts

Next, let’s add Python scripts, ensuring the data is stored using the file paths specified in the outs section of the dvc.yaml file:

src/data_handling/etl_pipeline.py:

import os
import argparse

import src.data_handling.scripts as scripts
from src._utils import main_logger

def etl_pipeline():
    # extract the entire data
    df = scripts.extract_original_dataframe()

    # load perquet file
    ORIGINAL_DF_PATH = os.path.join('data', 'original_df.parquet')
    df.to_parquet(ORIGINAL_DF_PATH, index=False) # dvc tracked

    # transform
    df = scripts.structure_missing_values(df=df)
    df = scripts.handle_feature_engineering(df=df)

    PROCESSED_DF_PATH = os.path.join('data', 'processed_df.parquet')
    df.to_parquet(PROCESSED_DF_PATH, index=False) # dvc tracked
    return df

# for dvc execution
if __name__ == '__main__':  
    parser = argparse.ArgumentParser(description="run etl pipeline")
    parser.add_argument('--stockcode', type=str, default='', help="specific stockcode to process. empty runs full pipeline.")
    parser.add_argument('--impute', action='store_true', help="flag to create imputation values")
    args = parser.parse_args()

    etl_pipeline(stockcode=args.stockcode, impute_stockcode=args.impute)

Outputs

The original and structured data in Pandas’ DataFrames are stored in the DVC cache:

data/original_df.parquet
data/processed_df.parquet

Stage 2: The Data Drift Check

Before jumping into preprocessing, we’ll run data drift tests to ensure any notable drift is in the data. To do this, we’ll use EventlyAI, an open-source ML and LLM observability framework.

What is Data Drift?

Data drift refers to any changes in the statistical properties like the mean, variance, or distribution of the data that the model is trained on.

There are three main types of data drift:

Covariate Drift (Feature Drift): A change in the input feature distribution.
Prior Probability Drift (Label Drift): A change in the target variable distribution.
Concept Drift: A change in the relationship between the input data and the target variable.

Data drift compromises the model's generalization capabilities over time, making its detection after deployment crucial.

DVC Configuration

We’ll add the data_drift_check stage right after the etl_pipeline stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
     # the main command dvc will run in this stage
    cmd: >
      python src/data_handling/report_data_drift.py
      data/processed/processed_df.csv 
      data/processed_df_${params.stockcode}.parquet
      reports/data_drift_report_${params.stockcode}.html
      metrics/data_drift_${params.stockcode}.json
      ${params.stockcode}

    # default values to the parameters (defined in the param.yaml file)
    params:
      - params.stockcode

    # dependencies necessary to run the main command
    deps:
      - src/data_handling/report_data_drift.py
      - src/

    # output file pathes for dvc to track
    plots:
      - reports/data_drift_report_${params.stockcode}.html:

    metrics:
      - metrics/data_drift_${params.stockcode}.json:
          type: json

Then, add default values to the parameters passed to the DVC command:

params.yaml:

params:
  stockcode:  OF CHOICE>

Python Scripts

After generating an API token from the EventlyAI workplace, we’ll add a Python script to detect data drift and store the results in the metrics variable:

src/data_handling/report_data_drift.py:

import os
import sys
import json
import pandas as pd
import datetime
from dotenv import load_dotenv

from evidently import Dataset, DataDefinition, Report
from evidently.presets import DataDriftPreset
from evidently.ui.workspace import CloudWorkspace

import src.data_handling.scripts as scripts
from src._utils import main_logger


if __name__ == '__main__':
    # initiate evently cloud workspace
    load_dotenv(override=True)
    ws = CloudWorkspace(token=os.getenv('EVENTLY_API_TOKEN'), url='https://app.evidently.cloud')

    # retrieve evently project
    project = ws.get_project('EVENTLY AI PROJECT ID')

    # retrieve paths from the command line args
    REFERENCE_DATA_PATH = sys.argv[1]
    CURRENT_DATA_PATH = sys.argv[2]
    REPORT_OUTPUT_PATH = sys.argv[3]
    METRICS_OUTPUT_PATH = sys.argv[4]
    STOCKCODE = sys.argv[5]

    # create folders if not exist
    os.makedirs(os.path.dirname(REPORT_OUTPUT_PATH), exist_ok=True)
    os.makedirs(os.path.dirname(METRICS_OUTPUT_PATH), exist_ok=True)

    # extract datasets
    reference_data_full = pd.read_csv(REFERENCE_DATA_PATH)
    reference_data_stockcode = reference_data_full[reference_data_full['stockcode'] == STOCKCODE]
    current_data_stockcode = pd.read_parquet(CURRENT_DATA_PATH)

    # define data schema
    nums, cats = scripts.categorize_num_cat_cols(df=reference_data_stockcode)
    for col in nums: current_data_stockcode[col] = pd.to_numeric(current_data_stockcode[col], errors='coerce')

    schema = DataDefinition(numerical_columns=nums, categorical_columns=cats)

    # define evently dataset w/ the data schema
    eval_data_1 = Dataset.from_pandas(reference_data_stockcode, data_definition=schema)
    eval_data_2 = Dataset.from_pandas(current_data_stockcode, data_definition=schema)

    # execute drift detection
    report = Report(metrics=[DataDriftPreset()])
    data_eval = report.run(reference_data=eval_data_1, current_data=eval_data_2)
    data_eval.save_html(REPORT_OUTPUT_PATH)

    # create metrics for dvc tracking
    report_dict = json.loads(data_eval.json())
    num_drifts = report_dict['metrics'][0]['value']['count']
    shared_drifts = report_dict['metrics'][0]['value']['share']
    metrics = dict(
        drift_detected=bool(num_drifts > 0.0), num_drifts=num_drifts, shared_drifts=shared_drifts,
        num_cols=nums,
        cat_cols=cats,
        stockcode=STOCKCODE,
        timestamp=datetime.datetime.now().isoformat(),
    )

    # load metrics file
    with open(METRICS_OUTPUT_PATH, 'w') as f:
        json.dump(metrics, f, indent=4)
        main_logger.info(f'... drift metrics saved to {METRICS_OUTPUT_PATH}... ')

    # stop the system if data drift is found
    if num_drifts > 0.0: sys.exit('❌ FATAL: data drift detected. stopping pipeline')

If data drift is found, the script immediately exits using the final sys.exit command.

Outputs

The script generates two files that DVC will track:

reports/data_drift_report.html: The data drift report in a HTML file.
metrics/data_drift.json: The data drift metics in a JSON file including drift results along with feature columns and a timestamp:

metrics/data_drift.json:

{
    "drift_detected": false,
    "num_drifts": 0.0,
    "shared_drifts": 0.0,
    "num_cols": [
        "invoiceno",
        "invoicedate",
        "unitprice",
        "product_avg_quantity_last_month",
        "product_max_price_all_time",
        "unitprice_vs_max",
        "unitprice_to_avg",
        "unitprice_squared",
        "unitprice_log"
    ],
    "cat_cols": [
        "stockcode",
        "customerid",
        "country",
        "year",
        "year_month",
        "day_of_week",
        "is_registered"
    ],
    "timestamp": "2025-10-07T00:24:29.899495"
}

The drift test results are also available on the Evently workplace dashboard for further analysis:

Figure B. Screenshot of the Evently workspace dashboard

Stage 3: Preprocessing

If no data drift is detected, the linage moves onto the preprocessing stage.

DVC Configuration

We’ll add the preprocess stage right after the data_drift_check stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    cmd: >
      python src/data_handling/preprocess.py --target_col ${params.target_col} --should_scale ${params.should_scale} --verbose ${params.verbose}

    deps:
      - src/data_handling/preprocess.py
      - src/data_handling/
      - src/_utils

    # params from params.yaml
    params:
      - params.target_col
      - params.should_scale
      - params.verbose

    outs:
      # train, val, test datasets
      - data/x_train_df.parquet
      - data/x_val_df.parquet
      - data/x_test_df.parquet
      - data/y_train_df.parquet
      - data/y_val_df.parquet
      - data/y_test_df.parquet

      # preprocessed input datasets
      - data/x_train_processed.parquet
      - data/x_val_processed.parquet
      - data/x_test_processed.parquet

      # trained preprocessor and human readable feature names for shap analysis
      - preprocessors/column_transformer.pkl
      - preprocessors/feature_names.json

And then add default values of the parameters used in the cmd:

params.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

Python Scripts

Next, we’ll add a Python script to create training, validation, and test datasets and preprocess input data:

import os
import argparse
import json
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import src.data_handling.scripts as scripts
from src._utils import main_logger

def preprocess(stockcode: str = '', target_col: str = 'quantity', should_scale: bool = True, verbose: bool = False):
    # initiate metrics to track (dvc)
    DATA_DRIFT_METRICS_PATH = os.path.join('metrics', f'data_drift_{args.stockcode}.json')

    if os.path.exists(DATA_DRIFT_METRICS_PATH):
        with open(DATA_DRIFT_METRICS_PATH, 'r') as f:
            metrics = json.load(f)
    else: metrics = dict()

    # load processed df from dvc cache
    PROCESSED_DF_PATH = os.path.join('data', 'processed_df.parquet')
    df = pd.read_parquet(PROCESSED_DF_PATH)

    # categorize num and cat columns
    num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
    if verbose: main_logger.info(f'num_cols: {num_cols} \ncat_cols: {cat_cols}')

    # structure cat cols
    if cat_cols:
        for col in cat_cols: df[col] = df[col].astype('string')

    # initiate preprocessor (either load from the dvc cache or create from scratch)
    PREPROCESSOR_PATH = os.path.join('preprocessors', 'column_transformer.pkl')
    try:
        preprocessor = joblib.load(PREPROCESSOR_PATH)
    except:
        preprocessor = scripts.create_preprocessor(num_cols=num_cols if should_scale else [], cat_cols=cat_cols)

    # creates train, val, test datasets
    y = df[target_col]
    X = df.copy().drop(target_col, axis='columns')

    # split
    test_size, random_state = 50000, 42
    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=False)
    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state, shuffle=False)

    # store train, val, test datasets (dvc track)
    X_train.to_parquet('data/x_train_df.parquet', index=False)
    X_val.to_parquet('data/x_val_df.parquet', index=False)
    X_test.to_parquet('data/x_test_df.parquet', index=False)
    y_train.to_frame(name=target_col).to_parquet('data/y_train_df.parquet', index=False)
    y_val.to_frame(name=target_col).to_parquet('data/y_val_df.parquet', index=False)
    y_test.to_frame(name=target_col).to_parquet('data/y_test_df.parquet', index=False)

    # preprocess
    X_train = preprocessor.fit_transform(X_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    # store preprocessed input data (dvc track)
    pd.DataFrame(X_train).to_parquet(f'data/x_train_processed.parquet', index=False)
    pd.DataFrame(X_val).to_parquet(f'data/x_val_processed.parquet', index=False)
    pd.DataFrame(X_test).to_parquet(f'data/x_test_processed.parquet', index=False)

    # save feature names (dvc track) for shap
    with open('preprocessors/feature_names.json', 'w') as f:
        feature_names = preprocessor.get_feature_names_out()
        json.dump(feature_names.tolist(), f)

    return  X_train, X_val, X_test, y_train, y_val, y_test, preprocessor


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='run data preprocessing')
    parser.add_argument('--stockcode', type=str, default='', help='specific stockcode')
    parser.add_argument('--target_col', type=str, default='quantity', help='the target column name')
    parser.add_argument('--should_scale', type=bool, default=True, help='flag to scale numerical features')
    parser.add_argument('--verbose', type=bool, default=False, help='flag for verbose logging')
    args = parser.parse_args()

    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = preprocess(
        target_col=args.target_col,
        should_scale=args.should_scale,
        verbose=args.verbose,
        stockcode=args.stockcode,
    )

Outputs

This stage generates the necessary datasets for both model training and inference:

Input features:

data/x_train_df.parquet
data/x_val_df.parquet
data/x_test_df.parquet

Preprocessed input features:

data/x_train_processed_df.parquet
data/x_val_processed_df.parquet
data/x_test_processed_df.parquet

Target variables:

data/y_train_df.parquet
data/y_val_df.parquet
data/y_test_df.parquet

The preprocessor and human-readable feature names are also stored in cache for inference and SHAP feature impact analysis later:

preprocessors/column_transformer.pk
preprocessors/feature_names.json

Lastly, DVC adds the preprocess_status , x_train_processed_path, and preprocessor_path to the data summary metrics file data.json created in Step 2 to track the end-to-end process of Steps 2 and 3:

metrics/data.json:

{
    "drift_detected": false,
    "num_drifts": 0.0,
    "shared_drifts": 0.0,
    "num_cols": [
        "invoiceno",
        "invoicedate",
        "unitprice",
        "product_avg_quantity_last_month",
        "product_max_price_all_time",
        "unitprice_vs_max",
        "unitprice_to_avg",
        "unitprice_squared",
        "unitprice_log"
    ],
    "cat_cols": [
        "stockcode",
        "customerid",
        "country",
        "year",
        "year_month",
        "day_of_week",
        "is_registered"
    ],
    "timestamp": "2025-10-07T00:24:29.899495",

    # updates
    "preprocess_status": "completed",
    "x_train_processed_path": "data/x_train_processed_85123A.parquet",
    "preprocessor_path": "preprocessors/column_transformer.pkl"
}

Next, let’s move onto the model/experiment lineage.

Stage 4: Tuning the Model

Now that we’ve created the datasets, we’ll tune and train the primary model. It’s a multi-layered feedforward network on PyTorch, using training and validation datasets created in the preprocess stage.

DVC Configuration

First, we’ll add the tuning_primary_model stage right after the preprocess stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    cmd: >
      python src/model/torch_model/main.py
      data/x_train_processed_${params.stockcode}.parquet
      data/x_val_processed_${params.stockcode}.parquet
      data/y_train_df_${params.stockcode}.parquet
      data/y_val_df_${params.stockcode}.parquet
      ${tuning.should_local_save}
      ${tuning.grid}
      ${tuning.n_trials}
      ${tuning.num_epochs}
      ${params.stockcode}

    deps:
      - src/model/torch_model/main.py
      - src/data_handling/
      - src/model/
      - src/_utils/

    params:
      - params.stockcode
      - tuning.n_trials
      - tuning.grid
      - tuning.should_local_save

    outs:
      - models/production/dfn_best_${params.stockcode}.pth # dvc track

    metrics:
      - metrics/dfn_val_${params.stockcode}.json: # dvc track

Then we’ll add default values to the parameters:

params.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

tuning:
  n_trials: 100
  num_epochs: 3000
  should_local_save: False
  grid: False

Python Scripts

Next, we’ll add the Python scripts to tune the model using Bayesian optimization and then train the optimal model on the complete X_train and y_train datasets created in the preprocess stage.

src/model/torch_model/main.py:

import os
import sys
import json
import datetime
import pandas as pd
import torch
import torch.nn as nn

import src.model.torch_model.scripts as scripts


def tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode: str = '',
        should_local_save: bool = True,
        grid: bool = False,
        n_trials: int = 50,
        num_epochs: int = 3000
    ) -> tuple[nn.Module, dict]:

    # perform bayesian optimization
    best_dfn, best_optimizer, best_batch_size, best_checkpoint = scripts.bayesian_optimization(
        X_train, X_val, y_train, y_val, n_trials=n_trials, num_epochs=num_epochs
    )

    # save the model artifact (dvc track)
    DFN_FILE_PATH = os.path.join('models', 'production', f'dfn_best_{stockcode}.pth' if stockcode else 'dfn_best.pth')
    os.makedirs(os.path.dirname(DFN_FILE_PATH), exist_ok=True)
    torch.save(best_checkpoint, DFN_FILE_PATH)

    return best_dfn, best_checkpoint



def track_metrics_by_stockcode(X_val, y_val, best_model, checkpoint: dict, stockcode: str):
    MODEL_VAL_METRICS_PATH = os.path.join('metrics', f'dfn_val_{stockcode}.json')
    os.makedirs(os.path.dirname(MODEL_VAL_METRICS_PATH), exist_ok=True)

    # validate the tuned model
    _, mse, exp_mae, rmsle = scripts.perform_inference(model=best_model, X=X_val, y=y_val)
    model_version = f"dfn_{stockcode}_{os.getpid()}"
    metrics = dict(
        stockcode=stockcode,
        mse_val=mse,
        mae_val=exp_mae,
        rmsle_val=rmsle,
        model_version=model_version,
        hparams=checkpoint['hparams'],
        optimizer=checkpoint['optimizer_name'],
        batch_size=checkpoint['batch_size'],
        lr=checkpoint['lr'],
        timestamp=datetime.datetime.now().isoformat()
    )
    # store the validation results (dvc track)
    with open(MODEL_VAL_METRICS_PATH, 'w') as f:
        json.dump(metrics, f, indent=4)
        main_logger.info(f'... validation metrics saved to {MODEL_VAL_METRICS_PATH} ...')


if __name__ == '__main__':
    # fetch command arg values
    X_TRAIN_PATH = sys.argv[1]
    X_VAL_PATH = sys.argv[2]
    Y_TRAIN_PATH = sys.argv[3]
    Y_VAL_PATH = sys.argv[4]
    SHOULD_LOCAL_SAVE = sys.argv[5] == 'True'
    GRID = sys.argv[6] == 'True'
    N_TRIALS = int(sys.argv[7])
    NUM_EPOCHS = int(sys.argv[8])
    STOCKCODE = str(sys.argv[9])

    # extract training and validation datasets from dvc cache
    X_train, X_val = pd.read_parquet(X_TRAIN_PATH), pd.read_parquet(X_VAL_PATH)
    y_train, y_val = pd.read_parquet(Y_TRAIN_PATH), pd.read_parquet(Y_VAL_PATH)

    # tuning
    best_model, checkpoint = tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode=STOCKCODE, should_local_save=SHOULD_LOCAL_SAVE, grid=GRID, n_trials=N_TRIALS, num_epochs=NUM_EPOCHS
    )

    # metrics tracking
    track_metrics_by_stockcode(X_val, y_val, best_model=best_model, checkpoint=checkpoint, stockcode=STOCKCODE)

Outputs

The stage generates two files:

models/production/dfn_best.pth: Includes model artifacts and checkpoint like the optimal hyperparameter set.
metrics/dfn_val.json: Contains tuning results, model version, timestamp, and validation results for MSE, MAE, and RMSLE:

metrics/dfn_val.json:

{
    "stockcode": "85123A",
    "mse_val": 0.6137686967849731,
    "mae_val": 9.092489242553711,
    "rmsle_val": 0.6953299045562744,
    "model_version": "dfn_85123A_35604",
    "hparams": {
        "num_layers": 4,
        "batch_norm": false,
        "dropout_rate_layer_0": 0.13765888061300502,
        "n_units_layer_0": 184,
        "dropout_rate_layer_1": 0.5509872409359128,
        "n_units_layer_1": 122,
        "dropout_rate_layer_2": 0.2408753527744403,
        "n_units_layer_2": 35,
        "dropout_rate_layer_3": 0.03451842588822594,
        "n_units_layer_3": 224,
        "learning_rate": 0.026240673135104406,
        "optimizer": "adamax",
        "batch_size": 64
    },
    "optimizer": "adamax",
    "batch_size": 64,
    "lr": 0.026240673135104406,
    "timestamp": "2025-10-07T00:31:08.700294"
}

Stage 5: Performing Inference

After the model tuning phase is complete, we’ll configure the test inference for a final evaluation.

The final evaluation uses the MSE, MAE, and RMSLE metrics, as well as SHAP for feature impact and interpretability analysis.

SHAP (SHapley Additive exPlanations) is a framework for quantifying how much each feature contributes to a model’s prediction by using the concept of Shapley values from game theory.

The SHAP values are leveraged for future EDA and feature engineering.

DVC Configuration

First, we’ll add the inference_primary_model stage to the DVC configuration.

This stage has the plots section where DVC will track and version the generated visualization files on the SHAP values.

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    ### 
  inference_primary_model:
    cmd: >
      python src/model/torch_model/inference.py
      data/x_test_processed_${params.stockcode}.parquet
      data/y_test_df_${params.stockcode}.parquet
      models/production/dfn_best_${params.stockcode}.pth
      ${params.stockcode}
      ${tracking.sensitive_feature_col}
      ${tracking.privileged_group}

    deps:
      - src/model/torch_model/inference.py
      - models/production/
      - src/

    params:
      - params.stockcode
      - tracking.sensitive_feature_col
      - tracking.privileged_group

    metrics:
      - metrics/dfn_inf_${params.stockcode}.json: # dvc track
          type: json

    plots:
      # shap summary / beeswarm plot for global interpretability
      - reports/dfn_shap_summary_${params.stockcode}.json:
          template: simple
          x: shap_value
          y: feature_name
          title: SHAP Beeswarm Plot

      # shap mean absolute vals - feature importance bar plot
      - reports/dfn_shap_mean_abs_${params.stockcode}.json:
          template: bar
          x: mean_abs_shap
          y: feature_name
          title: Mean Absolute SHAP Importance

    outs:
      - data/dfn_inference_results_${params.stockcode}.parquet
      - reports/dfn_raw_shap_values_${params.stockcode}.parquet # save raw shap vals for detailed analysis later

Python Scripts

Next, we’ll add scripts where the trained model performs inference:

src/model/torch_model/inference.py:

import os
import sys
import json
import datetime
import numpy as np
import pandas as pd
import torch
import shap

import src.model.torch_model.scripts as scripts
from src._utils import main_logger


if __name__ == '__main__':
    # load test dataset
    X_TEST_PATH = sys.argv[1]
    Y_TEST_PATH = sys.argv[2]
    X_test, y_test = pd.read_parquet(X_TEST_PATH), pd.read_parquet(Y_TEST_PATH)

    # create X_test w/ column names for shap analysis and sensitive feature tracking
    X_test_with_col_names = X_test.copy()
    FEATURE_NAMES_PATH = os.path.join('preprocessors', 'feature_names.json')
    try:
        with open(FEATURE_NAMES_PATH, 'r') as f: feature_names = json.load(f)
    except FileNotFoundError: feature_names = X_test.columns.tolist()
    if len(X_test_with_col_names.columns) == len(feature_names): X_test_with_col_names.columns = feature_names

    # reconstruct the optimal model tuned in the previous stage
    MODEL_PATH = sys.argv[3]
    checkpoint = torch.load(MODEL_PATH)
    model = scripts.load_model(checkpoint=checkpoint)

    # perform inference
    y_pred, mse, exp_mae, rmsle = scripts.perform_inference(model=model, X=X_test, y=y_test, batch_size=checkpoint['batch_size'])

    # create result df w/ y_pred, y_true, and sensitive features
    STOCKCODE = sys.argv[4]
    SENSITIVE_FEATURE = sys.argv[5]
    PRIVILEGED_GROUP = sys.argv[6]
    inference_df = pd.DataFrame(y_pred.cpu().numpy().flatten(), columns=['y_pred'])
    inference_df['y_true'] = y_test
    inference_df[SENSITIVE_FEATURE] = X_test_with_col_names[f'cat__{SENSITIVE_FEATURE}_{str(PRIVILEGED_GROUP)}'].astype(bool)
    inference_df.to_parquet(path=os.path.join('data', f'dfn_inference_results_{STOCKCODE}.parquet'))

    # record inference metrics
    MODEL_INF_METRICS_PATH = os.path.join('metrics', f'dfn_inf_{STOCKCODE}.json')
    os.makedirs(os.path.dirname(MODEL_INF_METRICS_PATH), exist_ok=True)
    model_version = f"dfn_{STOCKCODE}_{os.getpid()}"
    inf_metrics = dict(
        stockcode=STOCKCODE,
        mse_inf=mse,
        mae_inf=exp_mae,
        rmsle_inf=rmsle,
        model_version=model_version,
        hparams=checkpoint['hparams'],
        optimizer=checkpoint['optimizer_name'],
        batch_size=checkpoint['batch_size'],
        lr=checkpoint['lr'],
        timestamp=datetime.datetime.now().isoformat()
    )
    with open(MODEL_INF_METRICS_PATH, 'w') as f: # dvc track
        json.dump(inf_metrics, f, indent=4)
        main_logger.info(f'... inference metrics saved to {MODEL_INF_METRICS_PATH} ...')


    ## shap analysis
    # compute shap vals
    model.eval()

    # prepare backgdound data
    X_test_tensor = torch.from_numpy(X_test.values.astype(np.float32)).to(device_type)

    # take the small samples from x_test as background
    background = X_test_tensor[np.random.choice(X_test_tensor.shape[0], 100, replace=False)].to(device_type)

    # define deepexplainer
    explainer = shap.DeepExplainer(model, background)

    # compute shap vals
    shap_values = explainer.shap_values(X_test_tensor) # outputs = numpy array or tensor

    # convert shap array to pandas df
    if isinstance(shap_values, list): shap_values = shap_values[0]
    if isinstance(shap_values, torch.Tensor): shap_values = shap_values.cpu().numpy()
    shap_values = shap_values.squeeze(axis=-1) # type: ignore
    shap_df = pd.DataFrame(shap_values, columns=feature_names)

    # shap raw data (dvc track)
    RAW_SHAP_OUT_PATH = os.path.join('reports', f'dfn_raw_shap_values_{STOCKCODE}.parquet')
    os.makedirs(os.path.dirname(RAW_SHAP_OUT_PATH), exist_ok=True)
    shap_df.to_parquet(RAW_SHAP_OUT_PATH, index=False)
    main_logger.info(f'... shap values saved to {RAW_SHAP_OUT_PATH} ...')

    # bar plot of mean abs shap vals (dvc report)
    mean_abs_shap = shap_df.abs().mean().sort_values(ascending=False)
    shap_mean_abs_df = pd.DataFrame({'feature_name': feature_names, 'mean_abs_shap': mean_abs_shap.values })
    MEAN_ABS_SHAP_PATH = os.path.join('reports', f'dfn_shap_mean_abs_{STOCKCODE}.json')
    shap_mean_abs_df.to_json(MEAN_ABS_SHAP_PATH, orient='records', indent=4)

Outputs

This stage generates five output files:

data/dfn_inference_result_${params_stockcode}.parquet: Stores prediction results, labeled targets, and any columns with sensitive features like gender, age, income, and more. I’ll use this file for the fairness test in the last stage.
metrics/dfn_inf.json: Stores evaluation metrics and tuning results:

{
    "stockcode": "85123A",
    "mse_inf": 0.6841545701026917,
    "mae_inf": 11.5866117477417,
    "rmsle_inf": 0.7423332333564758,
    "model_version": "dfn_85123A_35834",
    "hparams": {
        "num_layers": 4,
        "batch_norm": false,
        "dropout_rate_layer_0": 0.13765888061300502,
        "n_units_layer_0": 184,
        "dropout_rate_layer_1": 0.5509872409359128,
        "n_units_layer_1": 122,
        "dropout_rate_layer_2": 0.2408753527744403,
        "n_units_layer_2": 35,
        "dropout_rate_layer_3": 0.03451842588822594,
        "n_units_layer_3": 224,
        "learning_rate": 0.026240673135104406,
        "optimizer": "adamax",
        "batch_size": 64
    },
    "optimizer": "adamax",
    "batch_size": 64,
    "lr": 0.026240673135104406,
    "timestamp": "2025-10-07T00:31:12.946405"
}

reports/dfn_shap_mean_abs.json: Stores the mean SHAP values:

[
    {
        "feature_name":"num__invoicedate",
        "mean_abs_shap":0.219255722
    },
    {
        "feature_name":"num__unitprice",
        "mean_abs_shap":0.1069829418
    },
    {
        "feature_name":"num__product_avg_quantity_last_month",
        "mean_abs_shap":0.1021453096
    },
    {
        "feature_name":"num__product_max_price_all_time",
        "mean_abs_shap":0.0855356899
    },
...
]

reports/dfn_shap_summary.json: Contains the data points necessary to draw the beeswarm/bar plots.
reports/dfn_raw_shap_values.parquet: Stores raw SHAP values.

Stage 6: Assessing Model Risk and Fairness

The last stage is to assess risk and fairness of the final inference results.

The Fairness Testing

Fairness testing in ML is the process of systematically evaluating a model’s predictions to ensure they are not unfairly biased toward specific groups defined by sensitive attributes like race and gender.

In this project, we’ll use the registration status is_registered column as a sensitive feature and make sure the Mean Outcome Difference (MOD) is within the specified threshold of 0.1.

The MOD is calculated as the absolute difference between the mean prediction values of the privileged (registered) and unprivileged (unregistered) groups.

DVC Configuration

First, we’ll add the assess_model_risk stage right after the inference_primary_model stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    ### 
  inference_primary_model:
    ###
  assess_model_risk:
    cmd: >
      python src/model/torch_model/assess_risk_and_fairness.py
      data/dfn_inference_results_${params.stockcode}.parquet
      metrics/dfn_risk_fairness_${params.stockcode}.json
      ${tracking.sensitive_feature_col}
      ${params.stockcode}
      ${tracking.privileged_group}
      ${tracking.mod_threshold}

    deps:
      - src/model/torch_model/assess_risk_and_fairness.py
      - src/_utils/
      - data/dfn_inference_results_${params.stockcode}.parquet # ensure the result df as dependency

    params:
      - params.stockcode
      - tracking.sensitive_feature_col
      - tracking.privileged_group
      - tracking.mod_threshold

    metrics:
      - metrics/dfn_risk_fairness_${params.stockcode}.json:
          type: json

Then we’ll add default values to the parameters:

param.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

tuning:
  n_trials: 100
  num_epochs: 3000
  should_local_save: False
  grid: False

# adding default values to the tracking metrics
tracking:
  sensitive_feature_col: "is_registered"
  privileged_group: 1 # member
  mod_threshold: 0.1

Python Script

The corresponding Python script contains the calculate_fairness_metrics function which performs the risk and fairness assessment:

src/model/torch_model/assess_risk_and_fairness.py:

import os
import json
import datetime
import argparse
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_log_error

from src._utils import main_logger


def calculate_fairness_metrics(
        df: pd.DataFrame,
        sensitive_feature_col: str,
        label_col: str = 'y_true',
        prediction_col: str = 'y_pred',
        privileged_group: int = 1,
        mod_threshold: float = 0.1,
    ) -> dict:

    metrics = dict()
    unprivileged_group = 0 if privileged_group == 1 else 1

    ## 1. risk assessment - predictive performance metrics by group
    for group, name in zip([unprivileged_group, privileged_group], ['unprivileged', 'privileged']):
        subset = df[df[sensitive_feature_col] == group]
        if len(subset) == 0: continue

        y_true = subset[label_col].values
        y_pred = subset[prediction_col].values

        metrics[f'mse_{name}'] = float(mean_squared_error(y_true, y_pred)) # type: ignore
        metrics[f'mae_{name}'] = float(mean_absolute_error(y_true, y_pred)) # type: ignore
        metrics[f'rmsle_{name}'] = float(root_mean_squared_log_error(y_true, y_pred)) # type: ignore

        # mean prediction (outcome disparity component)
        metrics[f'mean_prediction_{name}'] = float(y_pred.mean()) # type: ignore

    ## 2. bias assessment - fairness metrics
    # absolute mean error difference
    mae_diff = metrics.get('mae_unprivileged', 0) - metrics.get('mae_privileged', 0)
    metrics['mae_diff'] = float(mae_diff)

    # mean outcome difference
    mod = metrics.get('mean_prediction_unprivileged', 0) - metrics.get('mean_prediction_privileged', 0)
    metrics['mean_outcome_difference'] = float(mod)
    metrics['is_mod_acceptable'] = 1 if abs(mod) <= mod_threshold else 0

    return metrics


def main():
    parser = argparse.ArgumentParser(description='assess bias and fairness metrics on model inference results.')
    parser.add_argument('inference_file_path', type=str, help='parquet file path to the inference results w/ y_true, y_pred, and sensitive feature cols.')
    parser.add_argument('metrics_output_path', type=str, help='json file path to save the metrics output.')
    parser.add_argument('sensitive_feature_col', type=str, help='column name of sensitive features')
    parser.add_argument('stockcode', type=str)
    parser.add_argument('privileged_group', type=int, default=1)
    parser.add_argument('mod_threshold', type=float, default=.1)
    args = parser.parse_args()

    try:
        # load inf df
        df_inference = pd.read_parquet(args.inference_file_path)
        LABEL_COL = 'y_true'
        PREDICTION_COL = 'y_pred'
        SENSITIVE_COL = args.sensitive_feature_col

        # compute fairness metrics
        metrics = calculate_fairness_metrics(
            df=df_inference,
            sensitive_feature_col=SENSITIVE_COL,
            label_col=LABEL_COL,
            prediction_col=PREDICTION_COL,
            privileged_group=args.privileged_group,
            mod_threshold=args.mod_threshold,
        )

        # add items to metrics
        metrics['model_version'] = f'dfn_{args.stockcode}_{os.getpid()}'
        metrics['sensitive_feature'] = args.sensitive_feature_col
        metrics['privileged_group'] = args.privileged_group
        metrics['mod_threshold'] = args.mod_threshold
        metrics['stockcode'] = args.stockcode
        metrics['timestamp'] = datetime.datetime.now().isoformat()

        # load metrics (dvc track)
        with open(args.metrics_output_path, 'w') as f:
            json_metrics = { k: (v if pd.notna(v) else None) for k, v in metrics.items() }
            json.dump(json_metrics, f, indent=4)

    except Exception as e:
        main_logger.error(f'... an error occurred during risk and fairness assessment: {e} ...')
        exit(1)

if __name__ == '__main__':
    main()

Outputs

The final stage generates a metrics file which contains test results and model version:

metrics/dfn_risk_fairness.json:

{
    "mse_unprivileged": 3.5370739412593575,
    "mae_unprivileged": 1.48263614013523,
    "rmsle_unprivileged": 0.6080000224747837,
    "mean_prediction_unprivileged": 1.8507767915725708,
    "mae_diff": 1.48263614013523,
    "mean_outcome_difference": 1.8507767915725708,
    "is_mod_acceptable": 1,
    "model_version": "dfn_85123A_35971",
    "sensitive_feature": "is_registered",
    "privileged_group": 1,
    "mod_threshold": 0.1,
    "timestamp": "2025-10-07T00:31:15.998590"
}

That’s all for the lineage configuration. Now, we’ll test it in local.

Test in Local

We’ll run the entire ML lineage with this command:

$dvc repro -f

-f forces DVC to rerun all the stages with or without any updates.

The command will automatically create the dvc.lock file at the root of the project directory:

schema: '2.0'
stages:
  etl_pipeline_full:
    cmd: python src/data_handling/etl_pipeline.py
    deps:
    - path: src/_utils/
      hash: md5
      md5: ae41392532188d290395495f6827ed00.dir
      size: 15870
      nfiles: 10
    - path: src/data_handling/
      hash: md5
      md5: a8a61a4b270581a7c387d51e416f4e86.dir
      size: 95715
...

The dvc.lock file must be published in Git to make sure DVC will load the latest files:

$git add dvc.lock .dvc dvc.yaml params.yaml
$git commit -m'updated dvc config'
$git push

Step 3: Deploying the DVC Project

Next, we’ll deploy the DVC project to ensure the AWS Lambda function can access the cached files in production.

We’ll start by configuring the DVC remote where the cached files are stored.

DVC offers various storage types like AWS S3 and Google Cloud. We’ll use AWS S3 for this project but your choice depend on the project ecosystem, your familiarity with the tool, and any resource constraints.

First, we’ll create a new S3 bucket in the selected AWS region:

$aws s3 mb s3:///  --region

Make sure the IAM role has the following permissions: s3:ListBucket, s3:GetObject, s3:PutObject, and s3:DeleteObject.

Then, add theURI of the S3 bucket to the DVC remote:

$dvc remote add -d  ss3:///

Next, push the cache files to the DVC remote:

$dvc push

Now, all cache files are stored in the S3 bucket:

Figure C. Screenshot of the DVC remote in AWS S3 bucket

As shown in Figure A, this deployment step is necessary for the AWS Lambda function to access the DVC cache in production.

Step 4: Configuring Scheduled Run with Prefect

The next step is to configure the scheduled run of the entire lineage with Prefect.

Prefect is an open-source workflow orchestration tool for building, scheduling, and monitoring pipelines. It uses a concept called a work pool to effectively decouple the orchestration logic from the execution infrastructure.

Then, the work pool serves as a standardized base configuration by running a Docker container image to guarantee a consistent execution environment for all flows.

Configuring the Docker Image Registry

The first step is to configure the Docker image registry for the Prefect work pool:

For local deployment: A container registry in the Docker Hub.
For production deployment: AWS ECR.

For local deployment, we’ll first authenticate the Docker client:

$docker login

And grant a user permission to run Docker commands without sudo:

$sudo dscl . -append /Groups/docker GroupMembership $USER

For production deployment, we’ll create a new ECR:

$aws ecr create-repository --repository-name  --region

(Make sure the IAM role has access to this new ECR URI.)

Configure Prefect Tasks and Flows

Next, we’ll configure the Prefect task and flow in the project:

The Prefect task executes the dvc repro and dvc push commands
The Prefect flow weekly executes the Prefect task.

src/prefect_flows.py:

import os
import sys
import subprocess
from datetime import timedelta, datetime
from dotenv import load_dotenv
from prefect import flow, task
from prefect.schedules import Schedule
from prefect_aws import AwsCredentials

from src._utils import main_logger

# add project root to the python path - enabling prefect to find the script
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))

# define the prefect task
@task(retries=3, retry_delay_seconds=30)
def run_dvc_pipeline():
    # execute the dvc pipeline 
    result = subprocess.run(["dvc", "repro"], capture_output=True, text=True, check=True)

    # push the updated data
    subprocess.run(["dvc", "push"], check=True)


# define the prefect flow
@flow(name="Weekly Data Pipeline")
def weekly_data_flow():
    run_dvc_pipeline()

if __name__ == '__main__':
    # docker image registry (either docker hub or aws ecr)
    load_dotenv(override=True)
    ENV = os.getenv('ENV', 'production')
    DOCKER_HUB_REPO = os.getenv('DOCKER_HUB_REPO')
    ECR_FOR_PREFECT_PATH = os.getenv('S3_BUCKET_FOR_PREFECT_PATH')
    image_repo = f'{DOCKER_HUB_REPO}:ml-sales-pred-data-latest' if ENV == 'local' else f'{ECR_FOR_PREFECT_PATH}:latest'

    # define weekly schedule
    weekly_schedule = Schedule(
        interval=timedelta(weeks=1),
        anchor_date=datetime(2025, 9, 29, 9, 0, 0),
        active=True,
    )

    # aws credentials to access ecr
    AwsCredentials(
        aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
        aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
        region_name=os.getenv('AWS_REGION_NAME'),
    ).save('aws', overwrite=True)

    # deploy the prefect flow
    weekly_data_flow.deploy(
        name='weekly-data-flow',
        schedule=weekly_schedule, # schedule
        work_pool_name="wp-ml-sales-pred", # work pool where the docker image (flow) runs
        image=image_repo, # create a docker image at docker hub (local) or ecr (production)
        concurrency_limit=3,
        push=True # push the docker image to the image_repo
    )

Test in Local

Next, we’ll test the workflow locally with the Prefect server:

$uv run prefect server start

$export PREFECT_API_URL="http://127.0.0.1:4200/api"

Run the prefect_flows.py script:

$uv run src/prefect_flows.py

Upon the successful execution, the Prefect dashboard indicates the workflow is scheduled to run:

Figure D. As screenshot of the Prefect dashboard

Step 5: Deploying the Application

The final step is to deploy the entire application as a containerized Lambda by configuring the Dockerfile and the Flask application scripts.

The specific process in this final deployment step depends on the infrastructure.

But the common point is that DVC eliminates the need to store the large Parquet or CSV files directly in the feature store or model store because it caches them as lightweight hashed files.

So, first, we’ll simplify the loading logic of the Flask application script by using the dvc.api framework:

app.py:

### ... the rest components remain the same  ...

import dvc.api

DVC_REMOTE_NAME=


def configure_dvc_for_lambda():
    # set dvc directories to /tmp
    os.environ.update({
        'DVC_CACHE_DIR': '/tmp/dvc-cache',
        'DVC_DATA_DIR': '/tmp/dvc-data',
        'DVC_CONFIG_DIR': '/tmp/dvc-config',
        'DVC_GLOBAL_CONFIG_DIR': '/tmp/dvc-global-config',
        'DVC_SITE_CACHE_DIR': '/tmp/dvc-site-cache'
    })
    for dir_path in ['/tmp/dvc-cache', '/tmp/dvc-data', '/tmp/dvc-config']:
        os.makedirs(dir_path, exist_ok=True)


def load_x_test():
    global X_test
    if not os.environ.get('PYTEST_RUN', False):
        main_logger.info("... loading x_test ...")

        # config dvc directories
        configure_dvc_for_lambda()
        try:
            with dvc.api.open(X_TEST_PATH, remote=DVC_REMOTE_NAME, mode='rb') as fd:
                X_test = pd.read_parquet(fd)
                main_logger.info('✅ successfully loaded x_test via dvc api')
        except Exception as e:
            main_logger.error(f'❌ general loading error: {e}', exc_info=True)


def load_preprocessor():
    global preprocessor
    if not os.environ.get('PYTEST_RUN', False):
        main_logger.info("... loading preprocessor ...")
        configure_dvc_for_lambda()
        try:
            with dvc.api.open(PREPROCESSOR_PATH, remote=DVC_REMOTE_NAME, mode='rb') as fd:
                preprocessor = joblib.load(fd)
                main_logger.info('✅ successfully loaded preprocessor via dvc api')

        except Exception as e:
            main_logger.error(f'❌ general loading error: {e}', exc_info=True)

### ... the rest components remain the same  ...

Then, update the Dockerfile to enable Docker to correctly reference the DVC components:

Dockerfile.lambda.production:

# use an official python runtime
FROM public.ecr.aws/lambda/python:3.12

# set environment variables (adding dvc related env variables)
ENV JOBLIB_MULTIPROCESSING=0
ENV DVC_HOME="/tmp/.dvc"
ENV DVC_CACHE_DIR="/tmp/.dvc/cache"
ENV DVC_REMOTE_NAME="storage"
ENV DVC_GLOBAL_SITE_CACHE_DIR="/tmp/dvc_global"

# copy requirements file and install dependencies
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir dvc dvc-s3

# setup dvc
RUN dvc init --no-scm
RUN dvc config core.no_scm true

# copy the code to the lambda task root
COPY . ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]

Lastly, ensure the large files are ignored from the Docker container image:

.dockerignore:

### ... the rest components remain the same  ...

# dvc cache contains large files
.dvc/cache
.dvcignore

# add all folders that DVC will track
data/
preprocessors/
models/
reports/
metrics/

Test in Local

Finally, we’ll build and test the Docker image:

$docker build -t my-app -f Dockerfile.lambda.local .
$docker run -p 5002:5002 -e ENV=local my-app app.py

Upon the successful configuration, the waitress server will run the Flask application.

After confirming the changes, push the code to Git:

$git add .
$git commit -m'updated dockerfiles and flask app scripts'
$git push

This push command triggers the CI/CD pipeline via GitHub Actions, which generates a Docker container image and pushes it to AWS ECR.

And then after a successful pipeline flow and verification, we can manually run the deployment workflow using GitHub Actions.

And that’s it!

You can learn more here: Integrating the infrastructure CI/CD pipeline to an ML application

All code is available in my GitHub repository.

The mock app is available here.

Conclusion

Building robust ML applications requires comprehensive ML lineage to ensure reliability and traceability.

In this article, you learned how to build an ML lineage by integrating open-source services like DVC and Prefect.

In practice, initial planning matters. Specifically, defining how metrics are tracked and at which stages leads directly to a cleaner, more maintainable code structure and the extensibility in the future.

Moving forward, we can consider adding more stages to the lineage and integrating advanced logic for data drift detection or fairness tests.

This will further ensure continued model performance and data integrity in the production environment.

You can check out my Portfolio / Github.

All images, unless otherwise noted, are by the author.

How to Build a Machine Learning System on Serverless Architecture

Kuriko — Tue, 26 Aug 2025 16:23:28 +0000

Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks.

But a model isn’t truly valuable until it’s in production, serving real users and solving real problems.

In this article, you’ll learn how to ship a production-ready ML application built on serverless architecture.

Prerequisites
What We’re Building
The System Architecture
- Core AWS Resources in the Architecture
The Deployment Workflow in Action
Building a Client Application (Optional)
- The React Application
Final Results
Conclusion

Prerequisites

This project requires some basic experience with:

Machine Learning / Deep Learning: The full lifecycle, including data handling, model training, tuning, and validation.
Coding: Proficiency in Python, with experience using major ML libraries such as PyTorch and Scikit-Learn.
Full-stack deployment: Experience deploying applications using RESTful APIs.

What We’re Building

AI Pricing for Retailers

This project aims to help a middle-sized retailer compete with large players like Amazon.

Smaller companies often can’t afford significant price discounts, so they can face challenges finding optimal price points as they expand their product lines.

Our goal is to leverage AI models to recommend the best price for a selected product to maximize sales for the retailer, and display it on a client-side user interface (UI):

You can explore the UI from here.

The Models

I’ll train and tune multiple models so that when the primary model fails, a backup model gets loaded to serve predictions.

Primary Model: Multi-layered feedforward network (on the PyTorch library)
Backup Models (Backups): LightGBM, SVR, and Elastic Net (on the Scikit-Learn library)

The backup models are prioritized based on learning capabilities.

Tuning and Training

The primary model was trained on a dataset of around 500,000 samples (source) and fine-tuned using Optuna's Bayesian Optimization, with grid search available for further refinement.

The backups are also trained on the same samples and tuned using the Scikit-Optimize framework.

The Prediction

All models serve predictions on logged quantity values.

Logarithmic transformations of the quantity data make the distribution denser, which helps models learn patterns more effectively. This is because logarithms reduce the impact of extreme values, or outliers, and can help normalize skewed data.

Performance Validation

We’ll evaluate model performance using different metrics for the transformed and original data, with a lower value always indicating better performance.

Logged values: Mean Squared Error (MSE)
Actual values: Root Mean Squared Log Error (RMSLE) and Mean Absolute Error (MAE)

The System Architecture

We’re going to build a complete ecosystem around an AWS Lambda function to create a scalable ML system:

Fig. The system architecture (Created by Kuriko IWAI)

AWS Lambda is a serverless production where a service provider can run the application without managing servers. Once they upload the code, AWS takes on the responsibility of managing the underlying infrastructure.

In the serverless production, the code is deployed as a stateless function that runs only when it’s triggered by an event like HTTP requests or scheduled tasks.

This event-driven nature makes serverless production extremely efficient in resource allocation because:

There’s no server management: The cloud provider takes care of operational tasks.
You have automatic scaling: Serverless applications automatically scale up or down based on demand.
You have pay-per-use billing: Charged for the exact amount of compute resources the application consumes.

Note that other cloud ecosystems like Google Cloud Platform (GCP) and Microsoft Azure offer comprehensive alternatives to AWS. Which one you choose depends on your budget, project type, and familiarity with each ecosystem.

Core AWS Resources in the Architecture

The system architecture focuses on the following points:

The application is fully containerized on Docker for universal accessibility.
The container image is stored in AWS Elastic Container Registry (ECR).
The API Gateway’s REST API endpoints trigger an event to invoke the Lambda function.
The Lambda function loads the container image from ECR and perform inference.
Trained models, processors, and input features are stored in AWS S3 buckets.
A Redis client serves cached analytical data and past predictions stored in the ElastiCache.

And to build the system, we’ll use the following AWS resources:

Lamda: Serves a function to perform inference.
API Gateway: Routes API calls to the Lambda function.
S3 Storage: Serves feature store and model store.
ElastiCache: Store cached predictions and analytical data.
ECR: Stores Docker container images to allow Lambda to pull the image.

Each resource requires configuration. I’ll explore those details in the next section.

The Deployment Workflow in Action

The deployment workflow involves the following steps:

Draft data preparation, model training, and serialization scripts
Configure designated feature store and model store in S3
Create a Flask application with API endpoints
Publish a Docker image to ECR
Create a Lambda function
Configure related AWS resources

We’ll now walk through each of these steps to help you fully understand the process.

For your reference, here is the repository structure:

.
.venv/                  [.gitignore]    # stores uv venv
│
└── data/               [.gitignore]
│     └──raw/                           # stores raw data
│     └──preprocessed/                  # stores processed data after imputation and engineering
│
└── models/             [.gitignore]    # stores serialized model after training and tuning
│     └──dfn/                           # deep feedforward network
│     └──gbm/                           # light gbm
│     └──en/                            # elastic net
│     └──production/                    # models to be stored in S3 for production use
|
└── notebooks/                          # stores experimentation notebooks
│
└── src/                                # core functions
│     └──_utils/                        # utility functions
│     └──data_handling/                 # functions to engineer features
│     └──model/                         # functions to train, tune, validate models
│     │     └── sklearn_model
│     │     └── torch_model
│     │     └── ...
│     └──main.py                        # main script to run the inference locally
│
└──app.py                               # Flask application (API endpoints)
└──pyproject.toml                       # project configuration
└──.env                [.gitignore]     # environment variables
└──uv.lock                              # dependency locking
└──Dockerfile                           # for Docker container image
└──.dockerignore
└──requirements.txt
└──.python-version                      # python version locking (3.12)

Step 1: Draft Python Scripts

The first step is to draft Python scripts for data preparation, model training and tuning.

We’ll run these scripts in a batch process because these are resource-intensive and stateful tasks that aren’t suitable for serverless functions optimized for short-lived, stateless, and event-driven tasks.

Serverless functions also can experience cold starts. With heavy tasks in the function, the API gateway would timeout before serving predictions.

src/main.py

import os
import torch
import warnings
import pickle
import joblib
import numpy as np
import lightgbm as lgb
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from skopt.space import Real, Integer, Categorical
from dotenv import load_dotenv

import src.data_handling as data_handling
import src.model.torch_model as t
import src.model.sklearn_model as sk


if __name__ == '__main__': 
    load_dotenv(override=True)
    os.makedirs(PRODUCTION_MODEL_FOLDER_PATH, exist_ok=True)

    # create train, validation, test datasets
    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = data_handling.main_script()

    # store the trained preprocessor in local storage
    joblib.dump(preprocessor, PREPROCESSOR_PATH)

    # model tuning and training
    best_dfn_full_trained, checkpoint = t.main_script(X_train, X_val, y_train, y_val)

    # serialize the trained model
    torch.save(checkpoint, DFN_FILE_PATH)

    # svr
    best_svr_trained, best_hparams_svr = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[1]
    )
    if best_svr_trained is not None:
        with open(SVR_FILE_PATH, 'wb') as f:
            pickle.dump({ 'best_model': best_svr_trained, 'best_hparams': best_hparams_svr }, f)

    # elastic net
    best_en_trained, best_hparams_en = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[0]
    )
    if best_en_trained is not None:
        with open(EN_FILE_PATH, 'wb') as f:
            pickle.dump({ 'best_model': best_en_trained, 'best_hparams': best_hparams_en }, f)

    # light gbm
    best_gbm_trained, best_hparams_gbm = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[2]
    )

    if best_gbm_trained is not None:
        with open(GBM_FILE_PATH, 'wb') as f:
            pickle.dump({'best_model': best_gbm_trained, 'best_hparams': best_hparams_gbm }, f)

Run the script to train and serialize the models using the uv package management:

$uv venv
$source .venv/bin/activate
$uv run src/main.py

The main.py script includes several key components.

Scripts for Data Handling

These scripts involve loading original data, structure missing values, and engineer features necessary for the future prediction.

src/data_handling/main.py

import os
import joblib
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import src.data_handling.scripts as scripts
from src._utils import main_logger


# load and save the original data frame in parquet
df = scripts.load_original_dataframe()
df.to_parquet(ORIGINAL_DF_PATH, index=False)

# imputation
df = scripts.structure_missing_values(df=df)

# feature engineering
df = scripts.handle_feature_engineering(df=df)

# save processed df in csv and parquet
scripts.save_df_to_csv(df=df)
df.to_parquet(PROCESSED_DF_PATH, index=False)


# for preprocessing, classify numerical and categorical columns
num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
if cat_cols:
    for col in cat_cols: df[col] = df[col].astype('string')

# creates training, validation, and test datasets (test dataset is for inference only)
y = df[target_col]
X = df.copy().drop(target_col, axis='columns')
test_size, random_state = 50000, 42
X_tv, X_test, y_tv, y_test = train_test_split(
    X, y, test_size=test_size, random_state=random_state
)
X_train, X_val, y_train, y_val = train_test_split(
    X_tv, y_tv, test_size=test_size, random_state=random_state
)

# transform the input datasets
X_train, X_val, X_test, preprocessor = scripts.transform_input(
    X_train, X_val, X_test, num_cols=num_cols, cat_cols=cat_cols
)

# retrain and serialize the preprocessor
if preprocessor is not None: preprocessor.fit(X)
joblib.dump(preprocessor, PREPROCESSOR_PATH)

Scripts for Model Training and Tuning (PyTorch Model)

The scripts involve initiating the model, searching optimal neural architecture and hyperparameters, and serializing the fully-trained model so that the system can load the trained model when performing inference.

Because the primary model is built on PyTorch and the backups use Scikit-Learn, we’re drafting the scripts separately.

1. PyTorch Models

The training script contains training the model with the validation over a subset of training data.

It contains the early stopping logic when the loss history is not improved for a given consecutive epochs (that is, 10 epochs).

src/model/torch_model/scripts/training.py

import torch
import torch.nn as nn
import optuna # type: ignore
from sklearn.model_selection import train_test_split

from src._utils import main_logger

# device
device_type = device_type if device_type else 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
device = torch.device(device_type)

# gradient scaler for stability (only applicable for cuba)
scaler = torch.GradScaler(device=device_type) if device_type == 'cuba' else None

# start training
best_val_loss = float('inf')
epochs_no_improve = 0
for epoch in range(num_epochs):
    model.train()
    for batch_X, batch_y in train_data_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()

        try:
            # pytorch's AMP system automatically handles the casting of tensors to Float16 or Float32
            with torch.autocast(device_type=device_type):
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)

                # break the training loop when models return nan or inf
                if torch.any(torch.isnan(outputs)) or torch.any(torch.isinf(outputs)):
                    main_logger.error(
                        'pytorch model returns nan or inf. break the training loop.'
                    )
                    break

            # create scaled gradients of losses
            if scaler is not None:
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)  # cliping grad
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                scaler.step(optimizer)  # unscales the gradients
                scaler.update()  # updates the scale

            else:
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # cliping grad
                optimizer.step()

        except:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()


    # run validation on a subset of the training dataset
    model.eval()
    val_loss = 0.0

    # switch the torch mode
    with torch.inference_mode():
        for batch_X_val, batch_y_val in val_data_loader:
            batch_X_val, batch_y_val = batch_X_val.to(device), batch_y_val.to(device)
            outputs_val = model(batch_X_val)
            val_loss += criterion(outputs_val, batch_y_val).item()

    val_loss /= len(val_data_loader)

    # check if early stop
    if val_loss < best_val_loss - min_delta:
        best_val_loss = val_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1
        if epochs_no_improve >= patience:
            main_logger.info(f'early stopping at epoch {epoch + 1}')
            break

The tuning script uses the study component from the Optuna library to run the Bayesian Optimization.

The study component choose a neural architecture and hyperparameter set to test from the global search space.

Then, it builds, trains, and validates the model to find the optimal neural architecture that can minimize the loss (MSE, for instance).

src/model/torch_model/scripts/tuning.py

import itertools
import pandas as pd
import numpy as np
import optuna
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

from src.model.torch_model.scripts.pretrained_base import DFN
from src.model.torch_model.scripts.training import train_model
from src._utils import main_logger

# device
device_type = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
device = torch.device(device_type)

# loss function
criterion = nn.MSELoss()

# define objective function for optuna
def objective(trial):
    # model
    num_layers = trial.suggest_int('num_layers', 1, 20)
    batch_norm = trial.suggest_categorical('batch_norm', [True, False])
    dropout_rates = []
    hidden_units_per_layer = []
    for i in range(num_layers):
        dropout_rates.append(trial.suggest_float(f'dropout_rate_layer_{i}', 0.0, 0.6))
        hidden_units_per_layer.append(trial.suggest_int(f'n_units_layer_{i}', 8, 256)) # hidden units per layer

    model = DFN(
        input_dim=X_train.shape[1],
        num_layers=num_layers,
        dropout_rates=dropout_rates,
        batch_norm=batch_norm,
        hidden_units_per_layer=hidden_units_per_layer
    ).to(device)

    # optimizer
    learning_rate = trial.suggest_float('learning_rate', 1e-10, 1e-1, log=True)
    optimizer_name = trial.suggest_categorical('optimizer', ['adam', 'rmsprop', 'sgd', 'adamw', 'adamax', 'adadelta', 'radam'])
    optimizer = _handle_optimizer(optimizer_name=optimizer_name, model=model, lr=learning_rate)

    # data loaders
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128, 256])
    test_size = 10000 if len(X_train) > 15000 else int(len(X_train) * 0.2)
    X_train_search, X_val_search, y_train_search, y_val_search = train_test_split(X_train, y_train, test_size=test_size, random_state=42)
    train_data_loader = create_torch_data_loader(X=X_train_search, y=y_train_search, batch_size=batch_size)
    val_data_loader = create_torch_data_loader(X=X_val_search, y=y_val_search, batch_size=batch_size)

    # training
    num_epochs = 3000 # ensure enough epochs (early stopping would stop the loop when overfitting)
    _, best_val_loss = train_model(
        train_data_loader=train_data_loader,
        val_data_loader=val_data_loader,
        model=model,
        optimizer=optimizer,
        criterion = criterion,
        num_epochs=num_epochs,
        trial=trial,
    )
    return best_val_loss


# start to optimize hyperparameters and architecture
study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=50, timeout=600)

# best 
best_trial = study.best_trial
best_hparams = best_trial.params

# construct the model based on the tuning results
best_lr = best_hparams['learning_rate']
best_batch_size = best_hparams['batch_size']
input_dim = X_train.shape[1]
best_model = DFN(
    input_dim=input_dim,
    num_layers=best_hparams['num_layers'],
    hidden_units_per_layer=[v for k, v in best_hparams.items() if 'n_units_layer_' in k],
    batch_norm=best_hparams['batch_norm'],
    dropout_rates=[v for k, v in best_hparams.items() if 'dropout_rate_layer_' in k],
).to(device)

# construct an optimizer based on the tuning results
best_optimizer_name = best_hparams['optimizer']
best_optimizer = _handle_optimizer(
    optimizer_name=best_optimizer_name, model=best_model, lr=best_lr
)

# create torch data loaders
train_data_loader = create_torch_data_loader(
    X=X_train, y=y_train, batch_size=best_batch_size
)
val_data_loader = create_torch_data_loader(
    X=X_val, y=y_val, batch_size=best_batch_size
)

# retrain the best model with full training dataset applying the optimal batch size and optimizer
best_model, _ = train_model(
    train_data_loader=train_data_loader,
    val_data_loader=val_data_loader,
    model=best_model,
    optimizer=best_optimizer,
    criterion = criterion,
    num_epochs=1000
)

# create a checkpoint for serialization (reconstruct the model using the checkpoint)
checkpoint = {
    'state_dict': best_model.state_dict(),
    'hparams': best_hparams,
    'input_dim': X_train.shape[1],
    'optimizer': best_optimizer,
    'batch_size': best_batch_size
}

# serialize the model w/ checkpoint
torch.save(checkpoint, FILE_PATH)

2. Scikit-Learn Models (Backups)

For Scikit-Learn models, we’ll run k-fold cross validation during training to prevent overfitting.

K-fold cross-validation is a technique for evaluating a machine learning model's performance by training and testing it on different subsets of training data.

We define the run_kfold_validation function where the model is trained and validated using 5-fold cross-validation.

src/model/sklearn_model/scripts/tuning.py

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def run_kfold_validation(
        X_train,
        y_train,
        base_model,
        hparams: dict,
        n_splits: int = 5, # the number of folds 
        early_stopping_rounds: int = 10,
        max_iters: int = 200
    ) -> float:

    mses = 0.0

    # create k-fold component
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for fold, (train_index, val_index) in enumerate(kf.split(X_train)):
        # create a subset of training and validation datasets from the entire training data
        X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

        # reconstruct a model
        model = base_model(**hparams)

        # start the cross validation
        best_val_mse = float('inf')
        patience_counter = 0
        best_model_state = None
        best_iteration = 0

        for iteration in range(max_iters):
            # train on a subset of the training data
            try:
                model.train_one_step(X_train_fold, y_train_fold, iteration)
            except:
                model.fit(X_train_fold, y_train_fold)

            # make a prediction on validation data 
            y_pred_val_kf = model.predict(X_val_fold)

            # compute validation loss (MSE)
            current_val_mse = mean_squared_error(y_val_fold, y_pred_val_kf)

            # check if epochs should be stopped (early stopping)
           if current_val_mse < best_val_mse:
                best_val_mse = current_val_mse
                patience_counter = 0
                best_model_state = model.get_params()
                best_iteration = iteration
           else:
                patience_counter += 1

           # execute early stopping when patience_counter exceeds early_stopping_rounds
           if patience_counter >= early_stopping_rounds:
                main_logger.info(f"Fold {fold}: Early stopping triggered at iteration {iteration} (best at {best_iteration}). Best MSE: {best_val_mse:.4f}")
                break


        # after training epochs, reconstruct the best performing model 
        if best_model_state: model.set_params(**best_model_state)

        # make prediction
        y_pred_val_kf = model.predict(X_val_fold)

        # add MSEs
        mses += mean_squared_error(y_pred_val_kf, y_val_fold)

    # compute the final loss (avarage of MSEs across folds)
    ave_mse = mses / n_splits
    return ave_mse

Then, for the tuning script, we use the gp_minimize function from the Scikit-Optimize library.

The gp_minimize function is used to tune hyperparameters with Bayesian optimization.

This function intelligently searches the best hyperparameter set that can minimize the model's error, which is calculated using the run_kfold_validation function defined earlier.

The best-performing hyperparameters are then used to reconstruct and train the final model.

src/model/sklearn_model/scripts/tuning.py

from functools import partial
from skopt import gp_minimize


# define the objective function for Bayesian Optimization using Scikit-Optimize
def objective(params, X_train, y_train, base_model, hparam_names):
    hparams = {item: params[i] for i, item in enumerate(hparam_names)}
    ave_mse = run_kfold_validation(X_train=X_train, y_train=y_train, base_model=base_model, hparams=hparams)
    return ave_mse

# create the search space
hparam_names = [s.name for s in space]
objective_partial = partial(objective, X_train=X_train, y_train=y_train, base_model=base_model, hparam_names=hparam_names)

# search the optimal hyperparameters
results = gp_minimize(
    func=objective_partial,
    dimensions=space,
    n_calls=n_calls,
    random_state=42,
    verbose=False,
    n_initial_points=10,
)
# results
best_hparams = dict(zip(hparam_names, results.x))
best_mse = results.fun

# reconstruct the model with the best hyperparameters
best_model = base_model(**best_hparams)

# retrain the model with full training dataset
best_model.fit(X_train, y_train)

Step 2: Configure Feature/Model Stores in S3

The trained models and processed data are stored in the S3 bucket as a Parquet file.

We’ll draft the s3_upload function where the Boto3 client, a low-level interface to an AWS service, initiates the connection to S3:

import os
import boto3
from dotenv import load_dotenv

from src._utils import main_logger

def s3_upload(file_path: str):
    # initiate the boto3 client
    load_dotenv(override=True)
    S3_BUCKET_NAME = os.environ.get('S3_BUCKET_NAME') # the bucket created in s3
    s3_client = boto3.client('s3', region_name=os.environ.get('AWS_REGION_NAME')) # your default region

    if s3_client:
        # create s3 key and upload the file to the bucket
        s3_key = file_path if file_path[0] != '/' else file_path[1:]
        s3_client.upload_file(file_path, S3_BUCKET_NAME, s3_key)
        main_logger.info(f"file uploaded to s3://{S3_BUCKET_NAME}/{s3_key}")
    else:
        main_logger.error('failed to create an S3 client.')

Model Store

Trained PyTorch models are serialized (converted) into .pth files.

Then, these files are uploaded to the S3 bucket, enabling the system to load the trained model when it performs inference in production.

import torch

from src._utils import s3_upload

# model serialization, store in local
torch.save(trained_model.state_dict(), MODEL_FILE_PATH)

# upload to s3 model store
s3_upload(file_path=MODEL_FILE_PATH)

Feature Store

The processed data is converted into a CSV and Parquet file format.

Then, the Parquet files are uploaded to the S3 bucket, enabling the system to load the lightweight data when it creates prediction data to perform inference in production.

from src._utils import s3_upload

# store csv and parquet files in local
df.to_csv(file_path, index=False)
df.to_parquet(DATA_FILE_PATH, index=False)

# store in s3 feature store
s3_upload(file_path=DATA_FILE_PATH)

# trained preprocessor is also stored to transform the prediction data
s3_upload(file_path=PROCESSOR_PATH)

Step 3: Create a Flask Application with API Endpoints

Next, we’ll create a Flask application with API endpoints.

Flask needs to configure Python scripts in the app.py file located at the root of the project repository.

As showed in the code snippets, the app.py file needs to contain the components in order of:

AWS Boto3 client setup,
Flask app configuration and API endpoint setup,
Loading the trained preprocessor, processed input data X_test, and trained models,
Invoke the Lambda function via API Gateway, and
The local test section.

Note that X_test should never be used during model training to avoid data leakage.

app.py

from flask import Flask
from flask_cors import cross_origin
from waitress import serve
from dotenv import load_dotenv

from src._utils import main_logger

# global variables (will be loaded from the S3 buckets)
_redis_client = None
X_test = None
preprocessor = None
model = None
backup_model = None

# load env if local else skip (lambda refers to env in production)
AWS_LAMBDA_RUNTIME_API = os.environ.get('AWS_LAMBDA_RUNTIME_API', None)
if AWS_LAMBDA_RUNTIME_API is None: load_dotenv(override=True)


#### <---- 1. AWS BOTO3 CLIENT ---->
# boto3 client 
S3_BUCKET_NAME = os.environ.get('S3_BUCKET_NAME', 'ml-sales-pred')
s3_client = boto3.client('s3', region_name=os.environ.get('AWS_REGION_NAME', 'us-east-1'))
try:
    # test connection to boto3 client
    sts_client = boto3.client('sts')
    identity = sts_client.get_caller_identity()
    main_logger.info(f"Lambda is using role: {identity['Arn']}")
except Exception as e:
    main_logger.error(f"Lambda credentials/permissions error: {e}")

#### <---- 2. FLASK CONFIGURATION & API ENDPOINTS ---->
# configure the flask app
app = Flask(__name__)
app.config['CORS_HEADERS'] = 'Content-Type'

# add a simple API endpoint to serve the prediction by price point to test
@app.route('/v1/predict-price/', methods=['GET', 'OPTIONS'])
@cross_origin(origins=origins, methods=['GET', 'OPTIONS'], supports_credentials=True)
def predict_price(stockcode):
    df_stockcode = None

    # fetch request params
    data = request.args.to_dict()

    try:
        # fetch cache
        if _redis_client is not None:
            # returns cached prediction results if any without performing inference
            cached_prediction_result = _redis_client.get(cache_key_prediction_result_by_stockcode)
            if cached_prediction_result: 
                return jsonify(json.loads(json.dumps(cached_prediction_result)))

            # historical data of the selected product
            cached_df_stockcode = _redis_client.get(cache_key_df_stockcode)
            if cached_df_stockcode: df_stockcode = json.loads(json.dumps(cached_df_stockcode))


        # define the price range to make predictions. can be a request param, or historical min/max prices
        min_price = float(data.get('unitprice_min', df_stockcode['unitprice_min'][0]))
        max_price = float(data.get('unitprice_max', df_stockcode['unitprice_max'][0]))

        # create bins in the price range. when the number of the bins increase, the prediction becomes more smooth, but requires more computational cost
        NUM_PRICE_BINS = int(data.get('num_price_bins', 100))
        price_range = np.linspace(min_price, max_price, NUM_PRICE_BINS)

        # create a prediction dataset by merging X_test (dataset never used in model training) and df_stockcode
        price_range_df = pd.DataFrame({ 'unitprice': price_range })
        test_sample = X_test.sample(n=1000, random_state=42)
        test_sample_merged = test_sample.merge(price_range_df, how='cross') if X_test is not None else price_range_df
        test_sample_merged.drop('unitprice_x', axis=1, inplace=True)
        test_sample_merged.rename(columns={'unitprice_y': 'unitprice'}, inplace=True)

        # preprocess the dataset
        X = preprocessor.transform(test_sample_merged) if preprocessor else test_sample_merged

        # perform inference
        y_pred_actual = None
        epsilon = 0
        # try using the primary model
        if model:
            input_tensor = torch.tensor(X, dtype=torch.float32)
            model.eval()
            with torch.inference_mode():
                y_pred = model(input_tensor)
                y_pred = y_pred.cpu().numpy().flatten()
                y_pred_actual = np.exp(y_pred + epsilon)

        # if not, use backups
        elif backup_model:
            y_pred = backup_model.predict(X)
            y_pred_actual = np.exp(y_pred + epsilon)


        # finalize the outcome for client app
        df_ = test_sample_merged.copy()
        df_['quantity'] = np.floor(y_pred_actual) # quantity must be an integer
        df_['sales'] = df_['quantity'] * df_['unitprice'] # compute sales
        df_ = df_.sort_values(by='unitprice')

        # aggregate the results by the unitprice in the price range
        df_results = df_.groupby('unitprice').agg(
            quantity=('quantity', 'median'),
            quantity_min=('quantity', 'min'),
            quantity_max=('quantity', 'max'),
            sales=('sales', 'median'),
        ).reset_index()

        # find the optimal price point
        optimal_row = df_results.loc[df_results['sales'].idxmax()]
        optimal_price = optimal_row['unitprice']
        optimal_quantity = optimal_row['quantity']
        best_sales = optimal_row['sales']

        all_outputs = []
        for _, row in df_results.iterrows():
            current_output = {
                "stockcode": stockcode,
                "unit_price": float(row['unitprice']),
                'quantity': int(row['quantity']),
                'quantity_min': int(row['quantity_min']),
                'quantity_max': int(row['quantity_max']),
                "predicted_sales": float(row['sales']),
            }
            all_outputs.append(current_output)

        # store the prediction results in cache
        if all_outputs and _redis_client is not None:
             serialized_data = json.dumps(all_outputs)
            _redis_client.set(
                cache_key_prediction_result_by_stockcode, 
                serialized_data,
                ex=3600     # expire in an hour
            )

        # return a list of all outputs
        return jsonify(all_outputs)

    except Exception as e: return jsonify([])


# request header management (for the process from API gateway to the Lambda)
@app.after_request
def add_header(response):
    response.headers['Cache-Control'] = 'public, max-age=0'
    response.headers['Access-Control-Allow-Origin'] = CLIENT_A
    response.headers['Access-Control-Allow-Headers'] = 'Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token,Origin'
    response.headers['Access-Control-Allow-Methods'] = 'GET, POST, OPTIONSS'
    response.headers['Access-Control-Allow-Credentials'] = 'true'
    return response

#### <---- 3. LOADING PROCESSOR, DATASET, AND MODELS ---->
load_processor()
load_x_test()
load_model()

#### <---- 4. INVOKE LAMBDA ---->
def handler(event, context):
    logger.info("lambda handler invoked.")
    try:
        # connecting the redis client after the lambda is invoked
        get_redis_client()
    except Exception as e:
        logger.critical(f"failed to establish initial Redis connection in handler: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps({'error': 'Failed to initialize Redis client. Check environment variables and network config.'})
        }

    # use the awsgi package to convert JSON to WSGI
    return awsgi.response(app, event, context)


#### <---- 5. FOR LOCAL TEST ---->
# serve the application locally on WSGI server, waitress
# lambda will ignore this section.
if __name__ == '__main__':   
    if os.getenv('ENV') == 'local':
        main_logger.info("...start the operation (local)...")
        serve(app, host='0.0.0.0', port=5002)
    else:
        app.run(host='0.0.0.0', port=8080)

I’ll test the endpoint locally using the uv package manager:

$uv run app.py --cache-clear

$curl http://localhost:5002/v1/predict-price/{STOCKCODE}

The system provided a list of sales predictions for each price point:

Fig. Screenshot of the Flask app local response

Key Points on Flask App Configuration

There are various points you should take into consideration when configuring a Flask application with Lambda. Let’s go over them now:

1. A Few API Endpoints Per Container

Adding many API endpoints to a single serverless instance can lead to monolithic function concern where issues in one endpoint impact others.

In this project, we’ll focus on a single endpoint per container – and if needed, we can add separate Lambda functions to the system.

2. Understanding the `handler` Function and the role of AWSGI

The handler function is invoked every time the Lambda function receives a client request from the API Gateway.

The function takes the event argument that includes the request details in a JSON dictionary and passes it to the Flask application.

AWSGI acts as an adapter, translating a Lambda event in JSON format into a WSGI request that a Flask application can understand, and converts the application’s response back into a JSON format that Lambda and API Gateway can process.

3. Using Cache Storage

The get_redis_client function is called once the handler function is called by the API Gateway. This allows the Flask application to store or fetch a cache from the Redis client:

import redis
import redis.cluster
from redis.cluster import ClusterNode

_redis_client = None

def get_redis_client():
    global _redis_client
    if _redis_client is None:
        REDIS_HOST = os.environ.get("REDIS_HOST")
        REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
        REDIS_TLS = os.environ.get("REDIS_TLS", "true").lower() == "true"
        try:
            startup_nodes = [ClusterNode(host=REDIS_HOST, port=REDIS_PORT)]
            _redis_client = redis.cluster.RedisCluster(
                startup_nodes=startup_nodes,
                decode_responses=True,
                skip_full_coverage_check=True,
                ssl=REDIS_TLS,                  # elasticache has encryption in transit: enabled -> must be true
                ssl_cert_reqs=None,
                socket_connect_timeout=5,
                socket_timeout=5,
                health_check_interval=30,
                retry_on_timeout=True,
                retry_on_error=[
                    redis.exceptions.ConnectionError,
                    redis.exceptions.TimeoutError
                ],
                max_connections=10,            # limit connections for Lambda
                max_connections_per_node=2     # limit per node
            )
            _redis_client.ping()
            main_logger.info("successfully connected to ElastiCache Redis Cluster (Configuration Endpoint)")
        except Exception as e:
            main_logger.error(f"an unexpected error occurred during Redis Cluster connection: {e}", exc_info=True)
            _redis_client = None
    return _redis_client

4. Handling Heavy Tasks Outside of the `handler` Function

Serverless functions can experience a cold start duration.

While a Lambda function can run for up to 15 minutes, its associated API Gateway has a timeout of 29 seconds (29,000 ms) for a RESTful API.

So, any heavy tasks like loading preprocessors, input data, or models should be performed once outside of the handler function, ensuring they are ready before the API endpoint is called.

Here are the loading functions called in app.py.

app.py

import joblib

from src._utils import s3_load, s3_load_to_temp_file

preprocessor = None
X_test = None
model = None
backup_model = None


# load processor
def load_preprocessor():
    global preprocessor
    preprocessor_tempfile_path = s3_load_to_temp_file(PREPROCESSOR_PATH)
    preprocessor = joblib.load(preprocessor_tempfile_path)
    os.remove(preprocessor_tempfile_path)


# load input data
def load_x_test():
    global X_test
    x_test_io = s3_load(file_path=X_TEST_PATH)
    X_test = pd.read_parquet(x_test_io)


# load model
def load_model():
    global model, backup_model
    # try loading & reconstructing the primary model
    try:
        # first load io file from the s3 bucket
        model_data_bytes_io_ = s3_load(file_path=DFN_FILE_PATH)
        # convert to checkpoint dictionary (containing hyperparameter set)
        checkpoint_ = torch.load(
            model_data_bytes_io_, 
            weights_only=False, 
            map_location=device
        )
        # reconstruct the model
        model = t.scripts.load_model(checkpoint=checkpoint_, file_path=DFN_FILE_PATH)
        # set the model evaluation mode
        model.eval()

    # else, backup model
     except:
        load_artifacts_backup_model()

Step 4: Publish a Docker Image to ECR

After configuring the Flask application, we’ll containerize the entire application on Docker.

Containerization makes a package of the application, including models, its dependencies, and configuration in machine learning context, as a container.

Docker creates a container image based on the instructions defined in a Dockerfile, and the Docker engine uses the image to run the isolated container.

In this project, we’ll upload the Docker container image to ECR, so the Lambda function can access it in production.

After this, we’ll define the .dockerignore file to optimize the container image:

.dockerignore

# any irrelevant data
__pycache__/
.ruff_cache/
.DS_Store/
.venv/
dist/
.vscode
*.psd
*.pdf
[a-f]*.log
tmp/
awscli-bundle/

# add any experimental models, unnecessary data
dfn_bayesian/
dfn_grid/
data/
notebooks/

Dockerfile

# serve from aws ecr 
FROM public.ecr.aws/lambda/python:3.12

# define a working directory in the container
WORKDIR /app

# copy the entire repository (except .dockerignore) into the container at /app
COPY . /app/

# install dependencies defined in the requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# define commands
ENTRYPOINT [ "python" ]
CMD [ "-m", "awslambdaric", "app.handler" ]

Test in Local

Next, we’ll test the Docker image by building the container named my-app locally:

$docker build -t my-app -f Dockerfile .

Then, we’ll run the container with the waitress server in local:

$docker run -p 5002:5002 -e ENV=local my-app app.py

The -e ENV=local flag sets the environment variable inside the container, which will trigger the waitress.serve() call in the app.py.

In the terminal, you’ll find a message saying the following:

You can also call the endpoint created to see the results returned:

$uv run app.py --cache-clear

$curl http://localhost:5002/v1/predict-price/{STOCKCODE}

Publish the Docker Image to ECR

To publish the Docker image, we first need to configure the default AWS credentials and region:

From the AWS account console, issue an access token and check the default region.
Store them in the ~/aws/credentials and ~/aws/config files:

~/aws/credentials

[default] 
aws_secret_access_key=
aws_access_key_id=

~/aws/config

[default]
region=

After the configuration, we’ll publish the Docker image to ECR.

# authenticate the docker client to ECR
$aws ecr get-login-password --region  | docker login --username AWS --password-stdin .dkr.ecr..amazonaws.com

# create repository
$aws ecr create-repository --repository-name  --region 

# tag the docker image
$docker tag :  .dkr.ecr..amazonaws.com/:

# push
$docker push .dkr.ecr..amazonaws.com/:

Here’s what’s going on:

: Your default AWS region (for example, us-east-1 ).
: 12-digit AWS account ID.
: Your desired repository name.
: Your desired tag name (for example, v1.0).

Now, the Docker image is stored in ECR with the tag:

Fig. Screenshot of the AWS ECR console

Step 5: Create a Lambda Function

Next, we’ll create a Lambda function.

From the Lambda console, choose:

The Container Image option,
The container image URL from the pull down list,
A function name of our choice, and
An architecture type (arm64 is recommended for a better price-performance).

Fig. Screenshot of AWS Lambda function configuration

The Lambda function my-app was successfully launched.

Connect the Lambda function to API Gateway

Next, we’ll add API gateway as an event trigger to the Lambda function.

First, visit the API Gateway console and create REST API methods using the ARN of the Lambda function (press enter or click to view image in full size):

Fig. Screenshot of the AWS API Gateway configuration

Then, add resources to the created API gateway to create an endpoint:
API Gateway > APIs > Resources > Create Resource

Align the resource endpoint with the API endpoint defined in the app.py.
Configure CORS (for example, accept specific origins).
Deploy the resource to the stage.

Going back to the Lambda console, you’ll find the API Gateway is connected as an event trigger:
Lambda > Function > my-app (your function name)

Fig. Screenshot of the AWS Lambda dashboard

Step 6: Configure AWS Resources

Lastly, we’ll configure the related AWS resources to make the system work in production.

This process involves the following steps:

1. The IAM Role: Controls Who to Access Resources

AWS requires IAM roles to grant temporary, secure permissions to users, mitigating security risks related to long-term credentials like passwords.

The IAM role leverages policies to grant accesses to the selected service. Policies can be issued by AWS or customized by the user by defining the inline policy.

It is important to avoid overly permissive access rights for the IAM role.

In the Lambda function console, check the execution role:
Lambda > Function > > Permission > The execution role.
Set up the following policies to allow the Lambda’s IAM role to handle necessary operations:
- Lambda AWSLambdaExecute: Allows executing the function.
- EC2 Inline policy: Allows controlling the security group and the VPC of the Lambda function.
- ECR AmazonElasticContainerRegistryPublicFullAccess + Inline policy: Allows storing and pulling the Docker image.
- ElastiCache AmazonElastiCacheFullAccess + Inline policy: Allows storing and pulling caches.
- S3: AmazonS3ReadOnlyAccess + Inline policy: Allows reading and storing contents.

Now, the IAM role can access these resources and perfo the allowed actions.

2. The Security Group: Controls Network Traffic

A security group is a virtual firewall that controls inbound and outbound network traffic for AWS resources.

It uses stateful (allowing return traffic automatically) “allow-only” rules based on protocol, port, and IP address, where it denies all traffic by default.

Create a new security group for the Lambda function:
EC2 > Security Groups >

Now, we’ll want to setup inbound / outbound traffic rules.

The inbound rules:

S3 → Lambda:Type*: HTTPS /* Protocol*: TCP /* Port range*: 443 / Source: Custom**
ElastiCache → Lambda:Type*: Custom TCP /* Port range*: 6379 / Source: Custom**

*Choose the created security group for the Lambda function as a custom source.

The outbound rules:

Lambda → Internet: Type*: HTTPS /* Protocol*: TCP /* Port range*: 443 /* Destination*: 0.0.0.0/0*
ElastiCache → Internet: Type*: All Traffic /* Destination*: 0.0.0.0/0*

3. The Virtual Private Cloud (VPC)

A Virtual Private Cloud (VPC) provides a logically isolated private network for the AWS resources, acting as our own private data center within AWS.

AWS can create a Hyperplane ENI (Elastic Network Interface) for the Lambda function and its connected resources in the subnets of the VPC.

Though it’s optional, we’ll use the VPC to connect the Lambda function to the S3 storage and ElastiCache.

This process involves:

Creating a VPC endpoint from the VPC console:VPC > Create VPC.
Creating an STS (Security Token Service) endpoint:
VPC > PrivateLink and Lattice > Endpoints > Create Endpoint >
- Type*: AWS Service*
- Service name*: com.amazonaws..sts*
- Type*: Interface*
- VPC: Select the VPC created earlier.
- Subnets*: Select all subnets.*
- Security groups*: Select the security group of the Lambda function.*
- Policy*: Full access*
- Enable DNS names

The VPC must have a dedicated endpoint for STS to receive temporary credentials from STS.

Create an S3 endpoint in the VPC:
VPC > PrivateLink and Lattice > Endpoints > Create Endpoint >
- Type*: AWS Service*
- Service name*: com.amazonaws..s3*
- Type*: Gateway*
- VPC: Select the VPC created earlier.
- Subnets*: Select all subnets.*
- Security groups*: Select the security group of the Lambda function.*
- Policy*: Full access*

Lastly, check the security group of the Lambda function and ensure that its VPC ID directs to the VPC created: EC2 > Security Group > > VPC ID.

That’s all for the deployment flow.

We can now test the API endpoint in production. Copy the Invoke URL of the deployed API endpoint: API Gateway > APIs > Stages > Invoke URL. Then call the API endpoint and check if it responds predictions:

$curl -H 'Authorization: Bearer YOUR_API_TOKEN' -H 'Accept: application/json' \
     '/'

For logging and debugging, we’ll use the LiveTail of CloudWatch: CloudWatch > LiveTail.

Building a Client Application (Optional)

For full-stack deployment, we’ll build a simple React application to display the prediction using the recharts library for visualization.

Other options for quick frontend deployment include Streamlit or Gradio.

The React Application

The React application creates a web page that fetches and visualizes sales predictions from an external API, recommending an optimal price point.

The app uses useState to manage its data and state, including the selected product, the list of sales predictions, and the loading/error status.

When the user initiates a request, a useEffect hook triggers a fetch request to a Flask backend. It handles the API response as a data stream, processing it line by line to progressively update the predictions.

The AreaChart from the recharts library then visualizes this data. The X-axis represents the price and the Y-axis represents the sales. The chart updates in real-time as the data streams in. Finally, the app displays the optimal price once all the predictions are received.

App.js: (in a separate React app)

import { useState, useEffect } from "react"
import { AreaChart, Area, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer, ReferenceLine } from 'recharts'


function App() {
  // state
  const [predictions, setPredictions] = useState([])
  const [start, setStart] = useState(false)
  const [isLoading, setIsLoading] = useState(false)

  // product data
  let selectedStockcode = '85123A'
  let selectedProduct = productOptions.filter(item => item.id === selectedStockcode)[0]

  // api endpoint
  const flaskBackendUrl = "YOUR FLASK BACKEND URL"

  // create chart data to display
  const chartDataSales = predictions && predictions.length > 0
    ? predictions
      .map(item => ({
        price: item.unit_price,
        sales: item.predicted_sales,
        volume: item.unit_price !== 0 ? item.predicted_sales / item.unit_price : 0
      }))
      .sort((a, b) => a.price - b.price)
    : [...selectedProduct['histPrices']]

  // optimal price to display
  const optimalPrice = predictions.length > 0
    ? predictions.sort((a, b) => b.predicted_sales - a.predicted_sales)[0]['unit_price']
    : 0

  // fetch prediction results
  useEffect(() => {
    const handlePrediction = async () => {
      setIsLoading(true)
      setPredictions([])
      const errorPrices = selectedProduct['errorPrices']

      await fetch(flaskBackendUrl)
        .then(res => {
          if (res.status !== 200) { setPredictions(errorPrices); setIsLoading(false); setStart(false) }
          else return Promise.resolve(res.clone().json())
        })
        .then(res => {
          if (res && res.length > 0) setPredictions(res)
          else setPredictions(errorPrices)
          setIsLoading(false); setStart(false)
        })
        .catch(err => { setPredictions(errorPrices); setIsLoading(false); setStart(false) })
        .finally(setStart(false))
    }

    if (start) handlePrediction()
    if (predictions && predictions.length > 0) setStart(false)
  }, [flaskBackendUrl, start])


  // render
  if (isLoading) return <Loading />
  return (
    <div>
      <ResponsiveContainer width="100%" height="100%">
        <AreaChart
          key={chartDataSales.length}
          data={chartDataSales.sort(data => data.unit_price)}
          margin={{ top: 10, right: 30, left: 0, bottom: 0 }}
        >
          <CartesianGrid strokeDasharray="3 3" strokeOpacity={0.6} />

          <XAxis
            dataKey="price"
            label={{ value: "Unit Price ($)", position: "insideBottom", offset: 0, fontSize: 12, marginTop: 10 }}
            tickFormatter={(tick) => `$${parseFloat(tick).toFixed(2)}`}
            tick={{ fontSize: 12 }}
            padding={{ left: 20, right: 20 }}
          />

          <YAxis
            label={{ value: "Predicted Sales ($)", angle: -90, position: "insideLeft", fontSize: 12 }}
            tick={{ fontSize: 12 }}
            tickFormatter={(tick) => `$${tick.toLocaleString()}`}
          />

          {/* tooltips with the prediction result data */}
          <Tooltip
            contentStyle={{
              borderRadius: '8px',
              padding: '10px',
              boxShadow: '0px 0px 15px rgba(0,0,0,0.5)'
            }}
            formatter={(value, name) => {
              if (name === 'sales') {
                return [`$${value.toFixed(4)}`, 'Predicted Sales']
              }
              if (name === 'volume') {
                return [`${value.toFixed(0)}`, 'Volume']
              }
              return value
            }}
            labelFormatter={(label) => `Price: $${label.toFixed(2)}`}
          />

          {/* chart area = sales */}
          <Area
            type="monotone"
            dataKey="sales"
            fillOpacity={1}
            fill="url(#colorSales)"
          />

          {/* vertical line for the optimal price */}
          {optimalPrice &&
            <ReferenceLine
              x={optimalPrice}
              strokeDasharray="4 4"
              ifOverflow="visible"
              label={{
                value: `Optimal Price: $${optimalPrice !== null && optimalPrice > 0 ? Math.ceil(optimalPrice * 10000) / 10000 : ''}`,
                position: "right",
                fontSize: 12,
                offset: 10
              }}
            />
          }
        AreaChart>
      ResponsiveContainer>

      {optimalPrice && <p>Optimal Price: $ {Math.ceil(optimalPrice * 10000) / 10000}p>}

    div>
  )
}

export default App

Final Results

Now, the application is ready to serve.

You can explore the UI from here.

All code (backend) is available in my Github Repo.

Conclusion

Building a machine learning system requires thoughtful project scoping and architecture design.

In this article, we built a dynamic pricing system as a simple single interface on containerized serverless architecture.

Moving forward, we’d need to consider potential drawbacks of this minimal architecture:

Increase in cold start duration: The WSGI adapter awsgi layer adds a small overhead. Loading a larger container image takes longer time.
Monolithic function: Adding endpoints to the Lambda function can lead to a monolithic function where an issue in one endpoint impacts others.
Less granular observability: AWS CloudWatch cannot provide individual invocation/error metrics per API endpoint without custom instrumentation.

To scale the application effectively, extracting functionalities into a new microservice can be a good strategy to the next step.

I’m Kuriko IWAI, and you can find more of my work and learn more about me here:

Portfolio / LinkedIn / Github

All images, unless otherwise noted, are by the author. This application utilizes synthetic dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This information about AWS is current as of August 2025 and is subject to change.

Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

Kuriko — Fri, 30 May 2025 18:21:29 +0000

The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design.

In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:

Custom classifier
Scikit-learn’s MLPClassifier
Keras Sequential classifier using SGD and Adam optimizers.

This will help you learn about their various use cases and how they work.

What is a Perceptron?
How to Build a Single-Layered Classifier
What is a Multi-Layer Perceptron?
How to Build Multi-Layered Perceptrons
Understanding Optimizers
How to Build an MLP Classifier with SGD Optimizer
How to Build an MLP Classifier with Adam Optimizer
Final Results: Generalization
Conclusion

Prerequisites

Mathematics (Calculus, Linear Algebra, Statistics)
Coding in Python
Basic understanding of Machine Learning concepts

What is a Perceptron?

A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.

A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.

But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.

The perceptron consists of four main parts:

Input layer: Takes the initial numerical values into the system for further processing.
Weights: Combines input values with weights (and bias terms).
Activation function: Determines whether the neuron should fire based on the threshold value.
Output layer: Produces classification result.

It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.

So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.

Applications of Perceptrons

Perceptrons are applied to tasks such as:

Image classification: Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.
Linear regression: Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.

How the Activation Function Works

For a single perceptron used for binary classification, the most common activation function is the step function (also known as the threshold function):

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq \theta \\ \\ 0 &\text{if } z < \theta \end{cases}$$

where:

ϕ(z): the output of the activation function.
z: the weighted sum of the inputs plus the bias:

$$z = \sum_{i=1}^m w_i x_i + b$$

(xi: input values, w: weight associated with each input, b: bias terms)

θ is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.

In that case, the formula becomes:

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq 0 \\ \\ 0 &\text{if } z < 0 \end{cases}$$

When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.

This occurs when the weighted sum is greater than zero, leading the perceptron to predict the input is in this binary class.

While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.

In modern implementations, we can use other activation functions like the sigmoid function:

$$\sigma (z) = \frac {1} {1 + e^{-z}}$$

The sigmoid function also outputs zero or one depending on the weighted sum (z).

How the Loss Function Works

The loss function is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.

Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.

In a binary classification task, the model may adopt the hinge loss function to penalize misclassifications by incurring an additional cost for incorrect predictions:

$$L(y, h(x)) = max(0, 1- y*h(x))$$

(h(x): prediction label, y: true label)

How to Build a Single-Layered Classifier

Now, let’s build a simple single-layer perceptron for binary classification.

1. Custom Classifier

Initialize the classifier

We’ll first initialize the classifier with weights, bias, number of epochs (n_iterations), and learning_rates.

def __init__(self, learning_rate=0.01, n_iterations=1000):
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = None
    self.bias = None

Define the activation function

Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the threshold is set to zero.

def _step_function(self, x, threshold: int = 0):
     return np.where(x > threshold, 1, 0)

Train the model

Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: weights and bias.

This process is controlled by a specified number of training epochs defined by n_iterations.

In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined learning_rate.

def fit(self, X, y):
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in range(self.n_iterations):
        for i in range(n_samples):
            # compute weighted sum (z)
            z = np.dot(X[i], self.weights) + self.bias

            # apply the activation function
            y_pred = self._step_function(z)

            # update weights and bias
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)

How the weights work in the iteration loop

The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.

Its iterative update in the for loop aims to reduce classification errors such that:

$$\begin {align*} w_j &:= w_j + \Delta w_j \\ & := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &= \begin{cases} w_j &\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

(w_j: j-th weight, η: learning rate, (yi−y^i): error)

This means that:

When the prediction is correct, the error is zero, so the weight is unchanged.
When the prediction is too low (yi=1 and y^i=0), the weight is adjusted to the same direction to increase the weighted sum.
When the prediction is too high (yi=0 and y^i=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.

How the bias terms work in the iteration loop

The bias determines the decision boundary’s intercept (position from the origin).

Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:

$$\begin {align*} b &:= b + \Delta b \\ & := b + \eta (y_i - \hat y_i) \\ &= \begin{cases} b &\text{(a) } y_i - \hat y_i = 0\\ b + \eta &\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.

Make a prediction

Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):

def predict(self, X):
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      return predictions

The entire classifier looks like this:

import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None

    def _step_function(self, x, threshold: int = 0):
        return np.where(x > threshold, 1, 0)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iterations):
            for i in range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        return self

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        return y_pred

Simulate with synthetic datasets

First, we generated a synthetic linearly separable dataset using make_blob and computed a decision boundary, then train the classifier we created.

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import numpy as np

# create a mock dataset
X, y = make_blobs(n_features=2, centers=2, n_samples=1000, random_state=12)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the model
perceptron = Perceptron(learning_rate=0.1, n_iterations=1000).fit(X_train, y_train)

# make a prediction
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

# evaluate the results
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"Accuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The classifier generated a clear, highly accurate linear decision boundary.

Accuracy (Train): 0.981
Accuracy (Test): 0.975

2. Leverage SckitLearn’s MCP Classifier

For our convenience, we’ll use sckit-learn’s build-in classifier ( MCPClassifier) to build a similar, yet more robust classifier:

model = MLPClassifier(
    hidden_layer_sizes=(), # intentionally set empty to create a single layer perceptron
    activation='logistic', # choosing a sigmoid function as an activation function
    solver='sgd', # choosing SGD optimizer
    max_iter=1000,
    random_state=42, 
    learning_rate='constant', 
    learning_rate_init=0.1
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"MCPClassifier\nAccuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.

Accuracy (Train): 0.985
Accuracy (Test): 0.995

Limitations of Single-Layer Perceptrons

Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.

Unlike more general neural networks, single-layer perceptrons use a step function as their activation.

Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).

This fundamental property precludes the use of gradient-based optimization algorithms such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.

In contrast, most neural networks employ differentiable activation functions (for example, sigmoid, ReLU) and loss functions (for example, MSE, Cross-Entropy) for effective optimization.

Other challenges of a single-layer perceptron include:

Limited to linear separability: Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.
Lack of depth: Being single-layered, they cannot learn complex hierarchical representations.
Limited optimizer options: As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.

So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.

What is a Multi-Layer Perceptron?

An MLP is a class of feedforward artificial neural network that consists of at least three layers of nodes:

an input layer,
one or more hidden layers, and
an output layer.

Except for the input nodes, each node is a neuron that uses a nonlinear activation function.

MLPs are widely used for classification problems as well as regression:

Classification tasks: MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.
Regression analysis: They are also applied in regression problems where the relationship between input and output is complex.

How to Build Multi-Layered Perceptrons

Let’s handle a binary classification task using a standard MLP architecture.

Outline of the Project

Objective

Detect fraudulent transactions

Evaluation Metrics

Considering the cost of misclassification, we’ll prioritize improving Recall and Precision scores
Then check the accuracy of classification with Accuracy Score (TP + TN / (TP + TN + FP + FN ))

Cost of Misclassification (from high to low):

False Negative (FN): The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)
False Positive (FP): The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)
True Positive (TP): The model correctly identifies a fraudulent transaction as fraud.
True Negative (TN): The model correctly identifies a non-fraudulent transaction as non-fraud.

Planning an MLP Architecture

In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.

Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.

During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.

Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using image source)

Especially in deeper network, ReLU is advantageous in preventing vanishing gradient problems where gradients become extremely small as they are backpropagated from the output layers.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

Preprocessing the Datasets

First, we consolidate three datasets – transaction, customer, and credit card – into a single DataFrame, independently sanitizing numerical and categorical data:

import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# download the raw data to local
import kagglehub
path = kagglehub.dataset_download("computingvictor/transactions-fraud-datasets")
dir = f'{path}/gd_card_flaud_demo'

def sanitize_df(amount_str):
    """Removes '$' and converts the string to a float."""
    if isinstance(amount_str, str):
        return float(amount_str.replace('$', ''))
    return amount_str

# load transaction data
trx_df = pd.read_csv(f'{dir}/transactions_data.csv')

# sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)
trx_df = trx_df[trx_df['errors'].isna()]
trx_df = trx_df.drop(columns=['merchant_city','merchant_state', 'date', 'mcc', 'errors'], axis='columns')
trx_df['amount'] = trx_df['amount'].apply(sanitize_df)

# merge the dataframe with fraud transaction flag.
with open(f'{dir}/train_fraud_labels.json', 'r') as fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get('target', {})
fraud_labels_series = pd.Series(fraud_labels_dict, name='is_fraud')
fraud_labels_series.index = fraud_labels_series.index.astype(int) # convert the datatype from string to integer
merged_df = pd.merge(trx_df, fraud_labels_series, left_on='id', right_index=True, how='left')
merged_df.fillna({'is_fraud': 'No'}, inplace=True)
merged_df['is_fraud'] = merged_df['is_fraud'].map({'Yes': 1, 'No': 0})

# load card data
card_df = pd.read_csv(f'{dir}/cards_data.csv')
card_df = card_df.drop(columns=['client_id', 'acct_open_date', 'card_number', 'expires', 'cvv'], axis='columns')
card_df['credit_limit'] = card_df['credit_limit'].apply(sanitize_df)

# merge transaction and card data
merged_df = pd.merge(left=merged_df, right=card_df, left_on='card_id', right_on='id', how='inner')
merged_df = merged_df.drop(columns=['id_y', 'card_id'], axis='columns')

# converts categorical variables into a new binary column (0 or 1)
categorical_cols = merged_df.select_dtypes(include=['object']).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False, dtype=float) 
df = df.dropna().drop(['client_id', 'id_x'], axis=1)
print('\nDataFrame: \n', df.head(n=3))

DataFrame:

Our DataFrame shows an extremely skewed data distribution with:

Fraud samples: 1,191
Non-fraud samples: 11,477,397

For classification tasks, it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact on classification model performance, especially regarding the minority class.

For our data, we’ll:

split the 1,191 fraud samples into training, validation, and test sets,
add an equal number of randomly chosen non-fraud samples from the DataFrame, and
adjust split balances later if generalization challenges arise.

# define the desired size of the fraud samples for the validation and test sets
val_size_per_class = 200
test_size_per_class = 200

# create test sets
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=42)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=42)

# combine to form the balanced test set
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_test = X_test['is_fraud']
X_test = X_test.drop('is_fraud', axis=1)

# remove sampled rows from the original dataframes to avoid data leakage
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


# create validation sets
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=42)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=42)

# combine to form the balanced validation set
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_val = X_val['is_fraud']
X_val = X_val.drop('is_fraud', axis=1)

# remove sampled rows from the remaining dataframes
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


# create training sets
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=42)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=42)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_train = X_train['is_fraud']
X_train = X_train.drop('is_fraud', axis=1)


print("\n--- Final Dataset Shapes and Distributions ---")
print(f"X_train shape: {X_train.shape}, y_train distribution: {np.unique(y_train, return_counts=True)}")
print(f"X_val shape: {X_val.shape}, y_val distribution: {np.unique(y_val, return_counts=True)}")
print(f"X_test shape: {X_test.shape}, y_test distribution: {np.unique(y_test, return_counts=True)}")

After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a 50:50 split between fraud and non-fraud transactions:

Considering the high dimensional feature space with 19 input features, we’ll apply SMOTE to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):

from imblearn.over_sampling import SMOTE
from collections import Counter

train_target = 2000

smote_train = SMOTE(
  sampling_strategy={0: train_target, 1: train_target},  # increase sample size to 2,000
  random_state=12
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(f"\nAfter SMOTE with custom sampling_strategy (target train: {train_target}):")
print(f"X_train_oversampled shape: {X_train.shape}")
print(f"y_train_oversampled distribution: {Counter(y_train)}")

We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:

Lastly, we’ll apply column transformers to numerical and categorical features separately.

Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))])

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

Understanding Optimizers

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.

Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.

In this article, we’ll use the SGD Optimizer and Adam Optimizer.

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:

$$\begin{align*} w_j &:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$

(w: weight, b: bias, J: cost function, η: learning rate)

In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:

$$\begin{align*} J(y, \hat y) &=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &= \sum_{i=1}^m w_i x_i + b \end {align*}$$

2. How Adam (Adaptive Moment Estimation) Optimizer Works

Adam is an optimization algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

Adam optimizer combines the advantages of RMSprop (using squared gradients to scale the learning rate) and Momentum (using past gradients to accelerate convergence):

$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$

where:

α: The learning rate (default is 0.001)
ϵ: A small positive constant used to avoid division by zero
m^: First moment (mean) estimate with a bias correction, leveraging Momentum:

$$\begin{align*} \hat m_t &= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$

(β1: Decay rates, typically set to β1=0.9)

v^: Second moment (variance) estimate with a bias correction, leveraging RMSprop:

$$\begin{align*} \hat v_t &= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$

(β2: Decay rates, typically set to β2=0.999)

Since both m and v are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

This process involves a forward pass and backpropagation, during which SGD computes optimal weights and biases using gradients:

for i in range(0, n_samples, self.batch_size):
    # SGD starts with randomly selected mini-batch for the epoch
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    # A. forward pass
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[-1]  # final output of the network

    # B. backpropagation
    # 1) calculating gradients for the output layer)
    delta = y_pred - y_batch
    dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
    db = np.sum(delta, axis=0) / X_batch.shape[0]

    # 2) update output layer parameters
    self.weights[-1] -= self.learning_rate * dW
    self.biases[-1] -= self.learning_rate * db

    # 3) iterate backward from last hidden layer to the input layer
    for l in range(len(self.weights) - 2, -1, -1):
        delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
        dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
        db = np.sum(delta, axis=0) / X_batch.shape[0]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db

In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.

def _forward_pass(self, X):
    activations = [X]
    zs = []

    # forward through hidden layers
    for i in range(len(self.weights) - 1):
        z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) # using ReLU for hidden layers
        activations.append(a)

    # forward through output layer
    z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
    zs.append(z_output)

    # computes the final output using sigmoid function
    y_pred = 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    activations.append(y_pred)
    return activations, zs

So the final classifier looks like this:

from sklearn.metrics import accuracy_score

class MLP_SGD:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.01, n_epochs=1000, batch_size=32):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        for epoch in range(self.n_epochs):
            # shuffle datasets
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1]

                delta = y_pred - y_batch
                dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
                db = np.sum(delta, axis=0) / X_batch.shape[0]
                self.weights[-1] -= self.learning_rate * dW
                self.biases[-1] -= self.learning_rate * db

                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    db = np.sum(delta, axis=0) / X_batch.shape[0]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self

    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten() # for 1D output

Training / Prediction

Train the model and make a prediction using training and validation datasets:

# 1. define the model
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(30, 30, ), # 2 hidden layers with 30 neurons each
  learning_rate=0.001,           # a step size
  n_epochs=1000,                 # number of epochs
  batch_size=32                  # mini-batch size
)

# 2. train the model
mlp_sgd.fit(X_train_processed, y_train)

# 3. make a prediction with training and validation datasets
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

# 4. compute evaluation matrics
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=1)
recall = recall_score(y_true, y_pred, pos_label=1)
f1 = f1_score(y_true, y_pred, pos_label=1)


print(f"\nMLP (Custom SGD) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom SGD) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.7930 — 0.6650 (from training to validation)
Precision: 0.7790 — 0.6786 (from training to validation)

The model effectively learned and generalized the patterns, achieving a Recall of 79.3% (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.

Loss history:

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.

Leverage SckitLearn’s MCP Classifier

We can use an MCP Classifier to define a similar model, incorporating;

Early stopping using internal validation to prevent overfitting and
L2 regularization with a small tolerance.

from sklearn.neural_network import MLPClassifier

# define a model
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='sgd',
    learning_rate_init=0.001,
    learning_rate='constant',
    momentum=0.9,
    nesterovs_momentum=True,
    alpha=0.00001,           # l2 regulation strength
    max_iter=3000,           # max epochs (keep it high)
    batch_size=16,           # mini-batch size
    random_state=42,
    early_stopping=True,     # apply early stopping
    n_iter_no_change=50,     # stop the iteration if internal validation score doesn't improve for 50 epochs
    validation_fraction=0.1, # proportion of training data for internal validation (default is 0.1)
    tol=1e-4,                # tolerance for optimization
    verbose=False,
)

# training
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

# make a prediction
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)

Results

Recall: 0.7830 - 0.6200 (from training to validation)
Precision: 0.8208 - 0.6703 (from training to validation)

The model showed strong performance during training, achieving a Recall of 78.30%. Its performance declined on the validation set.

This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.

Leverage Keras Sequential Classifier

For the sequential classifier, we can further enhance the classifier by:

Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train) to address dataset imbalance and promote faster convergence,
Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,
Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,
Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and
Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


# calculates an initial bias for the output layer 
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])


# defines the model
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu'),
    Dropout(0.1), # 10% of the neurons in that layer randomly dropped out
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', # binary classification
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) # to address the imbalanced datasets
])



# compiles the model with the SGD optimizer
opt = SGD(learning_rate=0.001)
model_keras_sgd.compile(
    optimizer=opt, 
    loss='binary_crossentropy',
    metrics=[
        'accuracy', # add several metrics to return
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)


# defines early stopping to prevent overfitting
early_stopping_callback = EarlyStopping(
    monitor='val_recall',  # monitor recall 
    mode='max',         # maximize recall
    patience=50,        # stop after 50 epochs without loss improvement
    min_delta=1e-4,     # minimum change to be considered an improvement (tol)
    verbose=0
)


# compute the class weight
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


# train the model
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val), # use our external val set
    callbacks=[early_stopping_callback], # early stopping to prevent overfitting
    class_weight=class_weights_dict, # penarlize more misclassification on minority class
    verbose=0
)

# evaluate
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")

# display model summary
model_keras_sgd.summary()

Results

Recall: 0.7125 — 0.7250 (from training to validation)
Precision: 0.7607 — 0.7545 (from training to validation)

Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.

It suggests that the regularization techniques are likely effective in preventing significant overfitting.

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:

# apply Adam updates for output layer parameters
# 1) weights (w)
self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

# 2) bias (b)
self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)

Following the principles of forward and backward passes, we construct the final classifier by initializing it with beta1 and beta2, built upon an MLP_SGD architecture:

class MLP_Adam:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.001, n_epochs=1000, batch_size=32,
                 beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        # Adam optimizer internal states for each parameter (weights and biases)
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((1, fan_out)))
            self.v_biases.append(np.zeros((1, fan_out)))


    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        # global time step for Adam bias correction
        t = 0

        for epoch in range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # Mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += 1

                # 1. forward pass
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1] # Output of the network

                # 2. backpropagation
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[-2].T, delta) / X_batch.shape[0] # Average over batch
                grad_b_output = np.sum(delta, axis=0) / X_batch.shape[0]

                # apply Adam updates to weights
                self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
                self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
                m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
                v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
                self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                # apply Adam updates to bias
                self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
                self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
                m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
                v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
                self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                # Propagate gradients backward through hidden layers
                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    grad_b_hidden = np.sum(delta, axis=0) / X_batch.shape[0]

                    # apply Adam updates to weights
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (1 - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (1 - self.beta2) * (grad_w_hidden ** 2)
                    m_w_hat = self.m_weights[l] / (1 - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (1 - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    # apply Adam updates to bias
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (1 - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (1 - self.beta2) * (grad_b_hidden ** 2)
                    m_b_hat = self.m_biases[l] / (1 - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (1 - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self


    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten()

Training / Prediction

Train the model and make a prediction using training and validation datasets:

mlp_adam = MLP_Adam(hidden_layer_sizes=(30, 10), learning_rate=0.001, n_epochs=500, batch_size=32)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(f"\nMLP (Custom Adam) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom Adam) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.9870–0.6150 (from training to validation)
Precision: 0.9811–0.6474 (from training to validation)

While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.

Loss History

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.

Leverage SckitLearn’s MCP Classifier

We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:

model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='adam',             # update the optimizer from SGD to Adam
    learning_rate_init=0.001,
    learning_rate='constant',
    alpha=0.0001,
    max_iter=3000,
    batch_size=16,
    random_state=42,
    early_stopping=True,
    n_iter_no_change=50,
    validation_fraction=0.1,
    tol=1e-4,
    verbose=False,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)

Results

Recall: 0.8975–0.6400 (from training to validation)
Precision: 0.8864 — 0.6305 (from training to validation)

Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.

Leverage Keras Sequential Classifier

Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu')),
    Dropout(0.1),
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=0.001)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss='binary_crossentropy', 
    metrics=[
        'accuracy',
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor='val_recall',
    mode='max',
    patience=50,
    min_delta=1e-4,
    verbose=0
)

class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=0
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")


model_keras_adam.summary()

Results

Recall: 0.7995–0.7500 (from training to validation)
Precision: 0.8409–0.8065 (from training to validation)

The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).

This indicates good generalization, with only minor performance degradation on unseen data.

Final Results: Generalization

Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.

# Custom classifiers
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# MLPClassifer
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# Keras Sequential
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=0)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=0)

Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an AUPRC (Area Under Precision-Recall Curve) of 0.72.

Conclusion

In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.

Our findings underscore that effective machine learning hinges on three critical factors:

robust data preprocessing (tailored to objectives and data distribution),
judicious model selection, and
strategic framework or library choices.

Choosing the right framework

Generally speaking, choose MLPClassifier when:

You’re primarily working with tabular data,
You want to prioritize simplicity, quick iteration, and seamless integration,
You have simple, shallow architectures, and
You have a moderate dataset size (manageable on a CPU).

Choose Keras Sequential when:

You’re dealing with image, text, audio, or other sequential data,
You’re building deep learning models such as CNNs, RNNs, LSTMs,
You need fine-grained control over the model architecture, training process, or custom components,
You need to leverage GPU acceleration,
You’re planning for production deployment, and
You want to experiment with more advanced deep learning techniques.

Limitation of MLPs

While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.

Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.

You can find more info about me on my Portfolio / LinkedIn / Github.

Kuriko - freeCodeCamp.org

How to Build End-to-End Machine Learning Lineage

Table of Contents

Prerequisites:

Tools we’ll use:

What is Machine Learning Lineage?

What We’ll Build

The System Architecture: AI Pricing for Retailers

The ML Lineage

Workflow in Action

Step 1: Initiating a DVC Project

Step 2: The ML Lineage

Stage 1: The ETL Pipeline

DVC Configuration

Python Scripts

Outputs

Stage 2: The Data Drift Check

What is Data Drift?

DVC Configuration

Python Scripts

Outputs

Stage 3: Preprocessing

DVC Configuration

Python Scripts

Outputs

Stage 4: Tuning the Model

DVC Configuration

Python Scripts

Outputs

Stage 5: Performing Inference

DVC Configuration

Python Scripts

Outputs

Stage 6: Assessing Model Risk and Fairness

The Fairness Testing

DVC Configuration

Python Script

Outputs

Test in Local

Step 3: Deploying the DVC Project

Step 4: Configuring Scheduled Run with Prefect

Configuring the Docker Image Registry

Configure Prefect Tasks and Flows

Test in Local

Step 5: Deploying the Application

Test in Local

Conclusion

How to Build a Machine Learning System on Serverless Architecture

Table of Contents

Prerequisites

What We’re Building

AI Pricing for Retailers

The Models

Tuning and Training

The Prediction

Performance Validation

The System Architecture

Core AWS Resources in the Architecture

The Deployment Workflow in Action

Step 1: Draft Python Scripts

Scripts for Data Handling

Scripts for Model Training and Tuning (PyTorch Model)

1. PyTorch Models

2. Scikit-Learn Models (Backups)

Step 2: Configure Feature/Model Stores in S3

Model Store

Feature Store

Step 3: Create a Flask Application with API Endpoints

Key Points on Flask App Configuration

1. A Few API Endpoints Per Container

2. Understanding the handler Function and the role of AWSGI

3. Using Cache Storage

4. Handling Heavy Tasks Outside of the handler Function

Step 4: Publish a Docker Image to ECR

Test in Local

Publish the Docker Image to ECR

Step 5: Create a Lambda Function

Connect the Lambda function to API Gateway

Step 6: Configure AWS Resources

1. The IAM Role: Controls Who to Access Resources

2. Understanding the `handler` Function and the role of AWSGI

4. Handling Heavy Tasks Outside of the `handler` Function