Deep Learning - freeCodeCamp.org

AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)

Mohammed Fahd Abrah — Wed, 06 May 2026 18:13:01 +0000

We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on research papers where the original ideas were developed and tested.

Now, not everyone enjoys reading research papers or has the time to comb through and digest all that (sometimes very dense) info. So I decided to do the hard work for you and share the key insights in a series of AI paper reviews.

The goal isn’t to turn this into a heavy academic discussion, but to explain the main ideas in a clear and practical way. You'll learn what problem the paper was trying to solve, what approach it introduced, and why it mattered.

In each article, you’ll get a simple breakdown of the paper, how it works, and what you should take away from it. By the end, you should understand the key idea without needing to go through the full research paper yourself.

Paper Overview

The first paper I'll be reviewing is "Improving Language Understanding by Generative Pre-Training", by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.

Here's the actual paper if you want to read it yourself: Read the paper.

And here's a little infographic of what we'll cover here:

Executive Summary
Goals of the Paper
Methodology
Transformer vs. BERT vs. GPT
Model Architecture
Key Techniques
Key Findings
Conclusions
Limitations
Related Work & Context
Final Insight
Resources

Prerequisites

To get the most out of this breakdown, it helps to be familiar with a few basic ideas:

A general understanding of natural language processing (NLP) and how machines work with text
A high-level idea of what a Transformer model is (you don’t need deep details, just the concept)
The difference between supervised and unsupervised learning
Basic machine learning concepts like training data and models

If you’re not fully comfortable with all of these, that’s okay, you can still follow along. The goal here is to keep things clear and intuitive.

Executive Summary

Before models like GPT became what we know today, there was a key limitation: AI systems were good at specific tasks, but struggled with general understanding.

In this paper, the authors introduce a simple but powerful idea. Instead of training a model separately for each task, they first train it on a large amount of unlabeled text to learn the structure of language. Then, they adapt it to specific tasks using smaller labeled datasets.

According to the authors, this two-step approach (pre-training followed by fine-tuning) allows a single model to handle many different tasks with minimal changes.

In practice, this marked a major shift: rather than building a new model for every problem, we can train one general model that learns language itself and then reuse it across tasks.

Goals of the Paper

To understand the motivation behind this work, it helps to look at the main limitations in NLP at the time.

Most models depended heavily on large labeled datasets, which weren’t always available. Many tasks simply didn’t have enough labeled data to train effective systems. On top of that, existing models were usually designed for a single task, making them hard to reuse or adapt.

Because of this, the authors aimed to reduce the reliance on labeled data and move toward a more general approach. Their goal was to build a language model that could learn from large amounts of raw text and then be applied across different tasks.

According to the paper, they also wanted to enable transfer learning: the ability to take knowledge learned from one task and apply it to others. They also wanted to improve performance without needing to redesign a new model each time.

Methodology

To understand how the authors approached this problem, let’s look at the core idea behind their method.

Pre-Training

At the heart of the paper is a simple but powerful approach built in two stages. The first stage is pre-training, where the model learns directly from raw text.

According to the authors, the model is trained on a large corpus of unlabeled text using a language modeling objective (predicting the next word in a sequence) – specifically, predicting the next word based on the previous ones to solve the intractable problem of high dimension probabilities. Through this process, the model gradually learns important aspects of language, such as grammar, context, structure, and general patterns.

The paper highlights that datasets like BooksCorpus are used in this stage because they contain long, continuous text. This is important, since it helps the model understand relationships across sentences rather than just short fragments.

Fine-Tuning (Adapting to Tasks)

Once the model has learned general language patterns, the next step is fine-tuning, where it is adapted to specific tasks using labeled data.

According to the authors, this includes tasks like question answering, text classification, natural language inference, and semantic similarity. Instead of building a new model for each task, the same pre-trained model is reused with only small adjustments.

In practice, this is what makes the approach powerful: the model already understands language at a general level, so it can quickly adapt to different tasks without needing to be redesigned from scratch.

Transformer vs. BERT vs. GPT

Before diving into GPT-1, it helps to understand how modern language models are structured. Most of them are based on the Transformer architecture, but they use it in different ways: encoder-only models (like BERT), decoder-only models (like GPT), or full encoder–decoder models.

The original encoder–decoder Transformer was mainly used for tasks like machine translation. Encoder-only models are typically used for understanding tasks such as text classification and sentiment analysis, while decoder-only models are designed for generation tasks like text creation, powering systems such as ChatGPT, Gemini, and Claude.

Illustration comparing Transformer, GPT, and BERT architectures, adapted from Comparing Large Language Models: GPT vs. BERT vs. T5 showing encoder-decoder, decoder-only, and encoder-only designs

Transformer vs BERT vs GPT: Key Differences

Aspect	Transformer (Original)	BERT	GPT
Paper	Attention Is All You Need (2017)	BERT (2018)	GPT (2018–2019)
Architecture Type	Encoder + Decoder	Encoder-only	Decoder-only
Primary Goal	Sequence-to-sequence tasks (for example, translation)	Language understanding	Language generation
Training Objective	Predict next token (seq2seq setup)	Masked language modeling (fill in blanks)	Predict next token (autoregressive)
Directionality	Bidirectional (encoder) + left-to-right (decoder)	Fully bidirectional	Left-to-right only
Context Understanding	Strong (via attention)	Very strong (full bidirectional context)	Strong (but only past context)
Input/Output Style	Input → Output sequence	Input → Representation	Input → Generated text
Fine-tuning	Required for each task	Required for each task	Optional (GPT-2+ supports zero-shot)
Typical Tasks	Translation, summarization	Classification, QA, NLI	Text generation, QA, chat
Strength	Flexible architecture foundation	Deep understanding of text	General-purpose generation
Limitation	Not directly usable without adaptation	Cannot generate text naturally	Limited bidirectional context
Key Innovation	Self-attention mechanism	Deep bidirectional encoding	Scaled generative pre-training
Evolution Role	Foundation of all modern LLMs	Specialized understanding models	Path to general-purpose AI

Model Architecture

To support this pre-training and fine-tuning approach, the GPT-1 model is built on a Transformer (decoder) architecture.

According to the authors, this choice is important for a few reasons. Unlike older models such as LSTMs, Transformers handle long-range dependencies more effectively, meaning they can better understand relationships between words that are far apart in a sentence.

They also rely on self-attention, a mechanism that allows the model to focus on the most relevant parts of the text when processing each word. This helps the model capture context more accurately.

Another key advantage is that Transformers make transfer learning more effective, since the same learned representations can be reused across different tasks with minimal changes.

The paper highlights that, in these transfer learning scenarios, Transformers outperform LSTM-based models.

Figure 1 from “Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018), showing the Transformer architecture and task-specific input transformations.

Key Techniques

Along with the main approach, the authors introduce a few practical techniques that make the model more flexible across tasks.

According to the paper, different tasks are handled by converting them into text-based formats, so they can all be processed in a similar way. This makes it easier to use the same model across multiple problems without redesigning it each time.

Another important point is that the model requires only minimal architectural changes when switching between tasks. Most of the knowledge learned during pre-training is reused as-is.

The authors also include an auxiliary language modeling objective during fine-tuning, which helps the model retain its general understanding of language while adapting to specific tasks.

Key Findings

After training and evaluation, the results weren't just strong – they were surprisingly competitive.

According to the authors, the model outperformed state-of-the-art systems in 9 out of 12 tasks. It also showed clear improvements, including +8.9% in commonsense reasoning and +5.7% in question answering.

Another important observation is that the model performed well across datasets of different sizes, although performance was weaker on some smaller datasets.

This suggests that the pre-training step helped it generalize better, even when labeled data was limited.

In practice, what makes these results significant is that a single model was able to compete with specialized systems that were specifically designed for each individual task.

Figure 2 from “Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018), illustrating performance gains from layer transfer and zero-shot learning behavior.

Conclusions

To wrap things up, this paper introduced a major shift in how AI systems are built.

According to the authors, instead of training a new model from scratch for every task, we can first teach a model the structure of language through pre-training, and then adapt it to specific tasks through fine-tuning. This simple idea turns out to be highly effective.

The key takeaway is that language models can develop a general understanding of text, especially when combined with Transformer architectures and large-scale data. This makes transfer learning practical across many different tasks.

In my view, this is what makes the paper so impactful. It doesn’t just improve performance on a few benchmarks. It changes the overall approach to building AI systems.

This idea later became the foundation for models like GPT-2, GPT-3, and ChatGPT, and continues to shape modern large language models today.

Limitations

Like any approach, this method comes with its own limitations.

According to the paper, one of the main challenges is the need for large amounts of unlabeled data during the pre-training stage, which may not always be easy to get. The model’s performance also depends heavily on how well the fine-tuning step is done.

The authors also note that multi-task learning was not fully explored in this work, leaving some open questions about how well the model can handle multiple tasks at the same time.

In practice, another limitation is that performance can be weaker when working with very small datasets, especially if the fine-tuning process is not carefully handled.

To better understand where this paper fits, it helps to look at the ideas it builds on.

According to the authors, earlier approaches such as word embeddings (like Word2Vec and GloVe), LSTM-based language models, and semi-supervised learning had already made progress in understanding language. But these methods were often limited to learning representations at the word level or required more task-specific design.

What this paper does differently is move beyond that. Instead of focusing only on individual words, it learns broader language representations that capture context and meaning across entire sequences. This shift is what enables the model to generalize better across different tasks.

Final Insight

If there’s one idea to take away from this paper, it’s this: you don’t need to teach an AI system every task separately.

According to the authors, once a model learns the structure of language, it can adapt to a wide range of tasks with minimal changes. That shift – from task-specific models to general language understanding – is what makes this work so important.

In my view, this is the moment where things really changed. What started here with GPT-1 became the foundation for the systems we use today, including ChatGPT and other modern language models.

Resources:

Contact Me

How Neural Networks Work – Explained Using the Straight Line Equation y = ax + b

Samyukta Hegde — Thu, 08 Jan 2026 00:02:44 +0000

Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“

A straight line equation y = ax+b answers it in the simplest way possible. y can increase, decrease, or stay the same when x changes.

On the other hand, a deep neural network tries to answer it in a flexible way. It’s only possible because of multiple layers of straight line calculations stacked one over another along with non linear adjustments to help the network adapt and produce the desired result.

Since a straight line is the essence of neural networks, I think it’s time we try to understand the subtle details of y = ax+b, which I refer to as the magical equation. We’ll also go through the basics of linear regression and classification, which should help you understand the progression of a simple straight line to a complex deep neural network.

Prerequisites
y=ax+b
Linear Regression
Linear Classification
Comparison
Key Additions to Help Build Deep Neural Networks
Modelling a Deep Neural Network
Final Thoughts

Prerequisites

A basic understanding of linear algebra, particularly y=ax+b.
General idea about linear regression and classification.
Familiarity with the concept of deep neural networks.

y=ax+b

A straight line simply means that output changes steadily as input changes. There are no surprises (that is, no non linearity). Let’s analyze it properly.

y => Output variable
x => Input variable
a => Amount by which y changes when x changes (slope)
b => Value of y when x is 0 (y intercept)

We can take an example and model it in the same form to understand it better.

Ms. Poly is a math teacher who wants to formulate a study plan for her students to excel in an upcoming final exam. For simplicity, she creates a rule of thumb using only one factor: the number of hours studied per week. It has a direct impact on the marks scored by a student.

Before beginning, she makes certain assumptions:

Every student is capable of scoring at least 30 without studying.
For every hour a student studies, an additional 3 marks can be scored.

She then comes up with the following equation based on her ideas: y = 3x+30

y => Marks scored.
x => Number of hours studied.
a=3 => Increase in marks for every hour studied
b=30 => Minimum marks

In the above graph, she plots the points based on the results of the equation. As expected, it is a straight line. If she needs the marks scored for 9 hours of study, she can get it by just substituting x=9 in y=3x+30. Note that the data (x and y) are just based on her hunch and aren’t real.

But Ms. Poly wants to guide her students on how to prepare for the final exam based on actual data. So she conducts a pop quiz and grades it. In order to formulate a study plan, she interviews her students and collects information on how many hours they study math per week. She creates a table with two columns: number of hours studied (x) per week and marks scored (y). She tries her old formula y=3x+30, but it doesn’t seem to work. Thus, she doesn’t have any sensible equation describing the relation between x and y.

Let’s assume that a new student who hasn’t attended any exam (no y available) joins the class the next day, and Ms. Poly only knows the number of hours dedicated per week (x). How can she answer the question below?

If the new student studies for a certain number of hours (x), what can be the marks scored (y) in the exam?

It’s impossible unless there’s an equation defining the sample data. So, her task is to find one that fits the given points. This process is called curve fitting or regression.

Linear Regression

The core idea of linear regression to find a straight line that captures the trend of the existing data to facilitate predictions for new input data. Now, let’s dive straight into the example to understand the concept better.

Ms. Poly is determined to arrive at a solution. She plots the collected data on a graph to get a better picture.

She has absolutely no idea how x and y are related. So, she must figure out a formula, by trial and error, that roughly fits the points. She has to start with an intuitive guess, try to improve it in the subsequent steps and then arrive at the best possible solution.

Trial 1: Ms. Poly begins with her previous straight line equation.

y = 3x+30

She substitutes different values of x and plots it alongside the collected input data. This way she can get a clear picture of the differences in her assumption and reality.

Trial 2: She observes that the line needs a little more slope. This simply means that, in reality, more marks are being scored for every additional hour of study. By changing it from 3 to 4, the equation becomes:

y = 4x+30

The following graph depicts the new line alongside the sample data:

Trial 3: It looks better but she feels there is a need to shift the whole line upwards. This means that higher marks are being scored even if a student doesn’t dedicate any time for math in a week. She decides to retain the previous slope but changes the starting marks by 10, thus arriving at:

y = 4x+40

This particular line covers most of the points and can be considered the best possible solution.

Now, if she wishes to ascertain the marks scored by the new student who studied for 3.5 hours, she pins the value inside the formula and calculates the answer: y = 4*(3.5)+40=54

We saw how Ms. Poly arrived at a straight line equation to predict the output for an unknown input. Now she can chalk out a study plan for her class based on the equation.

Here, an expression is formulated to ascertain the change in output when the input changes. It looks like Ms. Poly is thinking like a data scientist. She has in fact modelled a very simple neural network for regression. The equation y=4x+40 can be considered as the only neuron (processing unit) within it. She’s adjusted the parameters a (weight) and b (bias) to arrive at the final formula which covers most of the points (thus minimizing the loss).

Here’s a breakdown of the y = 4x+40 equation:

y => Marks scored.
x => Number of hours studied.
a=4 => Increase in marks for every hour studied
b=40 => Minimum marks

At present, it is a rudimentary neural network which has no layering and non-linearity.

Now let’s shift our attention to a completely different scenario. Ms. Poly, being a teacher, wants to ensure that all her students pass the exam. Assuming, as an end result, she’s not interested in predicting the marks scored. She just wants to know:

If a student studies for a certain number of hours (x), will the student pass/fail(y) the exam?

This leads her to the process of classification.

Linear Classification

The linear classification process uses a simple straight line to divide the data into categories or classes. The line acts as a boundary so that the classes fall on either side of it. First, Ms. Poly defines the boundary condition for pass and fail.

If marks scored>=50, pass

If marks scored<50, fail

According to the data table, x=3 corresponds to y=52 (boundary condition). Therefore she considers x=3 as the classification line***.***

x=3 seems to segregate the points into the categories properly. She tries to confirm it by substituting another value. Thus, if a student studied for 9 hours, the score would lie towards the right side of x=3. So, they’d pass as per the classification equation.

Again, she’s arrived at an expression to ascertain the change in output when the input changes. But here, she has modelled a basic neural network for classification. The equation x=3 is the only neuron within it. It can be considered to be having two parts as explained below.

Pre-Activation Part: This portion of the neuron computes an intermediate value which is helpful in further processing. She’s figured out the parameters a (weight) and b (bias) to arrive at the following formula: z = x-3
```
 z => Intermediate Value.
 x => Number of hours studied.
 a=1 => Influence of the number of hours studied on the marks scored
 b=-3 => Minimum number of hours to study to pass the exam = 3
```
Activation Part: This portion triggers the neuron to make decisions based on a threshold value. The following equation segregates the points into two classes.
```
 y = 1 (Pass) if z>=0
 y = 0 (Fail) if z<0
```

This is a very plain neural network which has no layering and non-linearity but has pre-activation and activation parts inside a neuron.

Comparison

We looked at the examples of both linear regression and classification used by Ms. Poly. Regression helps in predicting a value while Classification helps in decision making. Let’s draw a small table to summarize the differences.

Upon careful observation we notice that both answer the question of how input change affects output.

But at a slightly higher level of complexity than a straight line. Because in the case of both regression and classification, we try to figure out the equation parameters by trial and error.

Here, since the requirements are simple, Ms. Poly just uses a straight line to solve both. A simple linear equation can handle only one steady trend. But in real life, problems that need solving are far more challenging and unpredictable. Some examples are:

Image Classification: An output label is produced based on the input images.

Text Translation: An English sentence can be given as an input to be translated to say, Spanish.

Chatbots: A text prompt is typed in by a user and a meaningful and relevant output is generated.

She probably should have to use a deep neural network if both data and task were complex. That presents another question: How does one build a deep neural network?

We will explore it further by extending the same example to a more realistic version.

Key Additions to Help Build Deep Neural Networks

In the above sections, we noted that Ms. Poly was interested in predicting the exam results of a student using just one factor - number of hours studied. However, in practice, is that one factor sufficient in determining the marks scored or whether the student passes the exam?

No. It’s not enough. She needs to take into account a lot of aspects like:

Number of hours studied
Number of hours of sleep/rest
Burnout due to over-studying
Difficulty level of topics in math
Pattern of the exam, and so on.

All the above neither act independently nor do they have a simple linear relation with the marks scored. So, she has to solve this problem by stacking the contributing factors one above the other in layers and also adding the element of non linearity. Let’s take a look at each in detail.

Layering

Burnout leads to lower score whereas good sleep increases score. But burnout can be reduced if the student is well rested. So, the impact on the final score when these two factors interact should be taken into account. This is possible only when the system solves it in layers. The first layer can deal with how they independently influence the score, the next layer can explore the interaction between them.

Non-Linearity

If the number of hours studied increases, the score might increase but when burnout overpowers the effect of study hours, the score reduces. The combined effect results in a non-linear graph. There is a rise and then dip in the score based on number of hours studied. It’s evident that the relationship is not straightforward as in a straight line. That’s where it becomes necessary to add non-linearity in the calculations. It helps the system to respond differently according to the conditions, allowing for flexibility in dealing with real world data and conditions.

Thus, Ms. Poly would have to extend the idea of linear regression/classification by including layering and non-linearity to build a fully functional neural network to help build a practical study plan.

Modelling a Deep Neural Network

Ms. Poly should start the work on modelling a deep neural network by following the steps mentioned below:

Step #1 - Define the Problem Clearly

The following factors should be considered before she begins the process of modelling:

What are the input features?
What are the output features?
What type of problem is it (regression/classification)?

Step #2 - Define the Input Layer

The input features form the first layer. There is no computation in this stage. They are represented as:

x1: Number of hours studied
x2: Number of hours of sleep/rest
x3: Burnout due to over-studying
x4: Difficulty level of topics in Maths
x5: Pattern of the exam

Step #3 - Define the First Hidden Layer

This step consists of two parts:

Apply Linear Transformation: The actual learning begins here. A straight line equation is used to understand the combined effect of the inputs. The general formula is z=Wx+b.

z: Intermediate value or Pre-activation
W: Weight matrix which consists of values corresponding to the impact of
each input feature
x: Matrix consisting of input features, [x1, x2, x3, x4, x5]
b: Bias which represents the initial assumptions of the teacher(when x=0)

It looks similar to a linear regression/classification equation. At first W and b are initialized to random values. Then in the subsequent steps, they are adjusted like it was done in earlier examples. We can consider the following combinations assuming we have two neurons in this layer:

Neuron 1: It can focus on study hours, burnout, and rest, with other features contributing less significantly.

Neuron 2: It can emphasize more on the difficulty level of the topic and the exam type compared to other inputs.

It’s important to note that this layer doesn’t calculate the interactions between the features but only on the way different linear combinations work together but independently. To make it clearer, how they contribute independently are added together. We don’t know how one input feature influences the other. For example, we know sleep increases score and burnout reduces score, but what we don’t know at this stage is if sleep reduces burnout, which in turn can influence the final score.

Add Non-Linearity: This step, also called activation, helps in capturing the complexities in different combinations of the features. Less study results in low marks, and too much burnout also results in low marks. It means there is a curve in the score graph which can’t be represented by a linear equation. The activation function is applied to the intermediate value and can be expressed as:

a = g(z)

a: Activation output
g: Activation function
z: Intermediate value or Pre-activation

For example: ReLU is an activation function which outputs z only if z is positive, else 0.

y = ReLU(z)=max(0,z)

We can see that it has no steady slope and is a non-linear activation function. It can suit this scenario as it lets the value pass through to the next layer only if the combined effect of features is greater than 0. Neuron 1 will let it’s output go to the next layer only if the intermediate value (z) that results from study hours, burnout and rest, is large enough to be influencing the final decision, else it’s ignored. There are multiple options for non-linear activation functions that one can choose from.

Step #4 - Stack Layers One Above the Other

This step helps in learning the mutual interactions between the inferences learned from the first hidden layer. The network attempts to understand the intricate details of the influencing factors and build a stable system. It is here that details of whether sleep reduces burnout are figured out. Every layer consists of linear and non linear transformations applied on the input, which are values obtained from the previous layer. Likewise multiple layers can be stacked one over the other based on the requirements. In this example, for representation, we have taken two hidden layers with two neurons each. The number of layers and neurons can vary based on requirements.

Step #5 - Define the Output Feature(s)

This appears to be the final stage in a deep neural network. Ms. Poly can decide what she wants for output: predict the marks scored by a student or predict if the student passes/fails the exam. If she wants the final marks scored, she just has to apply linear transformation in the neuron in the final layer to produce the output. If she wants pass/fail status, she has to apply both linear and non-linear transformations to achieve the desired results.

The diagram below shows an abstract representation of the deep neural network.

The next steps are:

Training the model: The network is trained in the following way:

Random weights and biases are assigned to the linear transformation portions of the network.
Then the network makes a prediction which is compared with the expected result.
If there are gaps between the actual result and the predicted result, corrections are made in weights and biases (this step is similar to what was done in linear regression and classification).
The steps above are repeated until the results improve.

Using the model: After the model has been trained, it is capable of yielding results for new input values.

Final Thoughts

In this article, we began with the basics of a straight line equation. Then we gradually navigated through slightly more elaborate concepts like linear regression and classification. They laid the groundwork for delving into the seemingly mysterious deep neural networks. But they are in fact built by stacking layers of linear transformations and non-linear activations, which help understand sophisticated real world patterns.

Despite all the complexities and layers, we can see that the straight line remains the foundation upon which neural networks are built. As we saw earlier, the equation that a deep neural network begins with is our magical equation: y = ax+b.

How to Set Up CUDA and WSL2 for Windows 11 (including PyTorch and TensorFlow GPU)

Md. Fahim Bin Amin — Wed, 03 Dec 2025 20:20:46 +0000

If you’re working on complex Machine Learning projects, you’ll need a good Graphics Processing Unit (or GPU) to power everything. And Nvidia is a popular option these days, as it has great compatibility and widespread support.

If you’re new to Machine Learning and are just getting started, then a free Kaggle or Colab might be enough for you. But that won’t be the case when you want to go deeper. You’ll need a GPU, which can get costly if you’re continuously using it on the cloud.

But there’s some good news: you can utilize your computer’s Nvidia GPU (GTX/RTX) quite easily and perform machine learning-related tasks right on your local machine. The cool thing is, it won’t cost you anything other than the electricity it uses!

When you’re running Machine Learning models on your local machines, the most suitable operating system is a Linux-based one, like Ubuntu. But Windows has improved a lot for this purpose. If you’re using the latest Windows 11, you can leverage Windows Subsystem for Linux (WSL) and use your GPU directly for Machine Learning-related workflows.

This process can be quite tricky, though, as can making two popular Machine Learning frameworks, TensorFlow and PyTorch, compatible with your system GPU in Windows 11. That’s why I have written this comprehensive guide to ease your pain.

In it, I’ll help you set up CUDA on Windows Subsystem for Linux 2 (WSL2) so you can leverage your Nvidia GPU for machine learning tasks.

By following these steps, you’ll be able to run ML frameworks like TensorFlow and PyTorch with GPU acceleration on Windows 11.

Keep in mind that this guide assumes you have a compatible Nvidia GPU. Make sure to check Nvidia's official compatibility list before proceeding.

I have also prepared a video for you that’ll help you follow proper guidelines throughout this article.

Also, if this tutorial helps you, then don’t forget to add a star to the GitHub repository CUDA-WSL2-Ubuntu-v2. If you face any issues or have any suggestions/improvements, then please raise an issue in the GitHub repository. Currently, the live website is available at ml-win11-v2.fahimbinamin.com.

Prerequisites
Windows Terminal
Windows PowerShell (Latest & Greatest)
Configure Windows Terminal
Configuration of my computer
CPU Virtualization
Install WSL2
Install Latest LTS Ubuntu via WSL2
Update & Upgrade Ubuntu Packages
Install and Configure Miniconda
Install Jupyter & Ipykernel
Nvidia Driver
Install CUDA dependencies
CUDA Toolkit
Add Path to Shell Profile for CUDA
nvcc Version
cuDNN SDK
TensorFlow GPU
- Check TensorFlow GPU
PyTorch GPU
- Check PyTorch GPU
- Check PyTorch & TensorFlow GPU inside Jupyter Notebook
Conclusion

Prerequisites

Before you begin, make sure you have the following requirements met:

Windows 11 operating system
Nvidia GPU (GTX/RTX series)
Administrator access to your PC
At least 30 GB of free disk space
Internet connection for downloads
Latest Nvidia drivers installed

Windows Terminal

First, you’ll need to ensure that you have Windows Terminal installed properly in your operating system. It is the newest terminal application for users of command-line tools and shells like Command Prompt, PowerShell, and WSL. You can download it from the Microsoft Store.

After ensuring that it’s installed properly, you can proceed to the next steps.

Windows PowerShell (Latest & Greatest)

Windows PowerShell is a modern and updated command-line shell from Microsoft. You can use some Linux specific commands directly on it. It comes with built-in command suggestions. You can download it from the official GitHub page.

Download the latest x64 installer and install it. After ensuring that it is installed properly, you can proceed to the next steps.

Configure Windows Terminal

Now you’ll need to configure your Windows Terminal to use PowerShell as the default shell. It’s optional and you might skip this step. But I recommend doing it for a better experience.

Open Windows Terminal. Click on the down arrow icon in the title bar and select "Settings".

In the Settings tab, under "Startup", find the "Default profile" dropdown menu. Select "PowerShell" from the list.

Now for the "Default terminal application", select "Windows Terminal".

By default, Windows PowerShell always shows the version number in the title bar. If you want to disable it, select the "PowerShell" profile from the left sidebar. Click on the "Command Line" field and add an --nologo argument at the end of the command. After this, the line becomes "C:\Program Files\PowerShell\7\pwsh.exe" --nologo.

If you don’t use other shells frequently and want to hide them in the dropdown, then you’ll need to select those profiles one by one from the left sidebar. Scroll down to the bottom and find the "Hide profile from dropdown" toggle and enable it. It will hide that specific shell from the dropdown menu.

For example, I am hiding the Azure Cloud Shell profile as I don't use it frequently:

Now click on the "Save" button at the bottom right corner to apply the changes. Close the Windows Terminal for now.

Configuration of My Computer

I figured it’d be helpful to share my current computer’s configuration so you can have a clear idea of which setup I’m using in this guide. Here are the details:

Component	Specification
Processor	AMD Ryzen 7 7700 8-Core Processor (8 Core 16 Threads)
RAM	64GB DDR5 6000MHz
Storage	1 TB Samsung 980 NVMe SSD, 4 TB HDD, 2 TB SATA SSD
GPU	NVIDIA GeForce RTX 3060 12GB GDDR6
Operating System	Windows 11 Pro Version 25H2

Now that you have an idea about my computer’s configuration, we can proceed to the next steps.

CPU Virtualization

As we are going to use WSL2, we’ll need to make sure that the CPU virtualization is enabled. To check whether virtualization is enabled or not from Windows, simply open the Windows Task Manager. Go to the Performance tab and select CPU from the left sidebar. In the bottom right corner, you will see the Virtualization status. If it shows "Enabled", then you are good to go. If it shows "Disabled", then you need to enable it from the BIOS.

⚠️ You have to ensure that CPU Virtualization is enabled in your BIOS settings. Different manufacturers have different ways to access the BIOS. Usually, you can access the BIOS by pressing the Delete or F2 key during the boot process. Once in BIOS, look for settings related to "Virtualization Technology" or "Intel VT-x"/"AMD-V" and make sure it is enabled. Save the changes and exit the BIOS.

Install WSL2

Open the Windows Terminal or Windows PowerShell as an administrator. Run the following command to install WSL2 along with the latest Ubuntu LTS distribution:

wsl.exe --install

It will install Windows Subsystem for Linux 2 (WSL2). After the installation is complete, you will be prompted to restart your computer. Do so to finalize the installation.

⚠️ If you encounter any issues during installation, refer to the official Microsoft documentation for troubleshooting WSL installation problems.

Install Latest LTS Ubuntu via WSL2

Open the Windows Terminal or Windows PowerShell again with the administrator privileges. If you want to check the available Linux distributions to install via WSL, run the following command:

wsl.exe --list --online

For installing any specific distribution, run the following command:

wsl.exe --install

We are going to install the latest LTS Ubuntu distribution. As of now, the latest LTS version is Ubuntu 24.04. But I prefer to install the Ubuntu directly as it always points to the latest LTS version. So, run the following command:

wsl.exe --install Ubuntu

You need to give it a default user account name. For me, I am going with fahim.

It also comes with a nice GUI management tool for WSL.

You can configure a lot of stuff in it including restricting core, RAM, disk space and a lot of specifications from the settings GUI window.

Update & Upgrade Ubuntu Packages

Open your Ubuntu terminal from Windows Terminal. First, we need to update and upgrade the existing packages to their latest versions.

To update the Ubuntu system, simply use the following command:

sudo apt update -y

To upgrade all the packages at once, simply use the following command:

sudo apt upgrade -y

⚠️ Make sure that you have a stable internet connection during the update and upgrade process to avoid any interruptions.

Install and Configure Miniconda

In Machine Learning, we need to manage multiple environments with different package versions. Conda is a popular package and environment management system that makes it easy to create and manage isolated environments for different projects. We will install Miniconda, a minimal installer for Conda, to manage our Python environments. But if you prefer Anaconda, you can install it instead.

Go to the official website of Miniconda. Currently the Miniconda installer is inside Anaconda here. If the official website gets updated, you can always search for "Miniconda installer" on Google to find the latest version. Also, you can create an issue in the official GitHub repository of this project to notify me about it.

As we are installing it inside WSL, we have to select the macOS/Linux Installation. Then select Linux Terminal Installer and choose Linux x86 for downloading the installer.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

It will download the installer to your WSL directory. Then use the following command to install it properly:

bash ~/Miniconda3-latest-Linux-x86_64.sh

⚠️ Make sure that you are in the correct directory where the installer is downloaded. If you downloaded it to a different location, adjust the path accordingly. Also, replace bash with zsh or sh if you are using a different shell.

Make sure to choose the initialization option properly. I prefer to keep the conda env active whenever I open a new shell. Therefore, I chose "Yes".

Make sure that the installation succeeds without any errors.

For the changes to take effect, you can close and reopen the current shell. But you can also do that without closing and reopening the shell by applying the command below.

source ~/.bashrc

⚠️ If you’re using a different shell like zsh or fish, make sure to source the appropriate configuration file (e.g., ~/.zshrc for zsh).

Install Jupyter & Ipykernel

I prefer to use Jupyter Notebook for running my machine learning experiments. It provides an interactive environment for coding and data analysis. We’ll install Jupyter Notebook and Ipykernel to run Jupyter notebooks in our conda environment. We will do that in all conda environments starting with the base environment. It also helps us to keep the conda environment kernel inside Jupyter Notebook.

First, make sure that you are in the base conda environment. You will see (base) on the left side of the terminal.

Now install Jupyter and Ipykernel both by applying the following command:

conda install jupyter ipykernel -y

Make sure that you accept the terms of service of Conda.

Now, I will create a separate conda environment for both TensorFlow and the PyTorch GPU. You can directly install them in the base environment or in any other environment as per your preference. I am not specifying any specific Python version while creating the environment. It will automatically install the latest stable version of Python.

conda create -name ml -y

To activate any specific conda environment, you have to use the following command:

conda activate

For example, if I want to activate my newly created ml environment, I will use this command:

conda activate ml

If you’re not sure which conda environments are installed in your system, you can check all available and installed conda environments in your system by running the following command:

conda env list

Nvidia Driver

Ensure that you have the latest Nvidia drivers installed on Windows. WSL2 uses the Windows driver, so no separate driver installation is needed in Ubuntu. You can download the latest drivers from the official Nvidia website.

If you are just installing the latest GPU driver, then after installing the drivers, restart your computer to ensure the changes take effect. You can either use the GeForce Game Ready Driver or the NVIDIA Studio Driver. But I recommend using the Studio Driver for better stability with creative and ML applications.

Install CUDA Dependencies

You might face some issues if you do not have the CUDA dependencies installed properly. I recommend that you install the required dependencies before proceeding further:

sudo apt install gcc g++ build-essential

After installing the dependencies, you can then verify the CUDA installation if you had any issues earlier.

CUDA Toolkit

TensorFlow GPU is very picky about the CUDA version. So we need to install a specific version of CUDA Toolkit that is compatible with the TensorFlow version we are going to install.

To understand exactly which CUDA version is compatible with which TensorFlow version, you can check the official TensorFlow GPU support matrix here.

At the time I’m writing this article, the TensorFlow GPU documentation says that we should have CUDA Toolkit 12.3. So I will ensure that I install exactly that version. You can simply click on that version link in the official docs and it will redirect you to the official Nvidia CUDA Toolkit download page. But if the link gets updated in the future, you can always search for "Nvidia CUDA Toolkit" on Google to find the latest version.

As TensorFlow GPU is asking for exact Version 12.3, I will select version 12.3.0 exactly.

In the CUDA Toolkit download page, make sure to choose the operating system as Linux, Architecture as x86_64, Distribution as WSL-Ubuntu, Version as 2.0 and the Installer type as runfile(local).

⚠️ As we are using Ubuntu in our WSL2, you can also choose Ubuntu as your operating system. But I prefer to choose WSL-Ubuntu for better compatibility.

After selecting those, it will give you the download commands. You have to apply them sequentially. Make sure that you don't keep the checkmark in "Kernel Objects" during installing CUDA.

⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the CUDA Toolkit properly. If you face any issues related to CUDA dependency, then quickly go through the Install CUDA dependencies section, where I have explained how to install the CUDA dependencies properly.

Add Path to Shell Profile for CUDA

After installing CUDA Toolkit, we need to add the CUDA binaries to our shell profile for easy access. This will allow us to run CUDA commands from any directory in the terminal.

Note that, depending on the shell you are using (bash, zsh, and so on), you need to add the CUDA path to the appropriate configuration file. Make sure to replace .bashrc with .zshrc or other configuration files if you are using a different shell.

To add the CUDA binary path, follow the command below:

echo 'export PATH=/usr/local/cuda-12.3/bin:$PATH' >> ~/.bashrc

You have to use the updated path where you installed it. Your terminal will show it after installing the CUDA:

Now, you need to add the path inside the Library path. Just use the exact path where you installed CUDA. Your terminal will list the path properly.

echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc

After adding those paths, you need to source the shell profile for the changes to take effect. You can do that by running the following command:

source ~/.bashrc

nvcc Version

NVCC stands for Nvidia CUDA Compiler. It is basically a compiler driver for the CUDA platform that allows developers to write parallel programs to run on Nvidia GPUs. As we have already installed the CUDA toolkit, we need to see whether the compiler is also properly activated. To check that, we need to verify the version.

Verify that CUDA is properly installed by checking the version:

nvcc --version

If the output shows the correct CUDA version, then you have successfully installed CUDA Toolkit in your WSL2 Ubuntu environment.

cuDNN SDK

The cuDNN (CUDA Deep Neural Network) SDK is a GPU accelerated library of primitives for deep neural networks, developed by Nvidia. It provides highly optimized building blocks for common deep learning operations, significantly speeding up the training and inference processes of AI models on Nvidia GPUs.

Note: Even though TensorFlow GPU suggests a specific cuDNN version, it’s often compatible with multiple versions. Because of this, I recommend downloading the latest cuDNN version that is compatible with your installed CUDA version. You can find the cuDNN download page here.

Select the Operating System as Linux, Architecture as x86_64, Distribution as Ubuntu, Version as 24.04, Installer Type as deb (local), Configuration as FULL. After selecting those, it will give you the download commands. You have to apply them sequentially.

⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the cuDNN SDK properly. If you face any issues related to CUDA dependency, then quickly go through the Install CUDA dependencies section, where I have explained how to install the CUDA dependencies properly.

TensorFlow GPU

Now, we are going to install TensorFlow GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created ml environment. To activate it, I’ll use the following command:

conda activate ml

⚠️ Make sure that you have activated the correct conda environment before installing TensorFlow GPU. You will see the environment name in the terminal prompt.

I will install ipykernel and jupyter in this new environment.

conda install jupyter ipykernel -y

Now, to install TensorFlow GPU, I will simply use the following command:

pip install tensorflow[and-cuda]

It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.

Check TensorFlow GPU

After installing TensorFlow GPU, we need to verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

If the output shows a list of available GPU devices, then TensorFlow GPU is successfully installed and working properly.

PyTorch GPU

Now, we’re going to install PyTorch GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created ml environment. To activate it, I will use the following command:

conda activate ml

Installing PyTorch GPU is very straightforward. You can use the official PyTorch installation command generator here.

Make sure to select PyTorch Build as the latest Stable one, Your OS as Linux, Package as Pip, Language as Python. For the Compute Platform, select the CUDA version that matches your installed CUDA Toolkit. For me, it is CUDA 12.3. But, if you can not find the exact one then choose the closest. As CUDA 12.3 is not available for me now, I am choosing CUDA 12.6.

After selecting those, it will give you the installation command. You have to apply it in your WSL Ubuntu terminal.

It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.

Check PyTorch GPU

After installing PyTorch GPU, verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:

python3 - << 'EOF'
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.get_device_name(0))
EOF

The output should look similar to the screenshot, showing:

True: GPU is available for PyTorch
1: Number of detected CUDA devices
0: Index of the current active CUDA device
A device object representation
NVIDIA GeForce RTX 3060 (or your GPU name)

Check PyTorch & TensorFlow GPU inside Jupyter Notebook

Now that the environment is fully configured, we will verify GPU support directly inside Jupyter Notebook. This ensures both PyTorch and TensorFlow can successfully detect and use your GPU.

1. Test PyTorch GPU

Create a new Jupyter Notebook and run the following commands one by one:

import torch

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.get_device_name(0))

If everything is configured correctly, you will see your GPU (for example NVIDIA GeForce RTX 3060) detected properly:

2. Test TensorFlow GPU

Next, run the following code to check whether TensorFlow detects your GPU:

import tensorflow as tf

print(tf.config.list_physical_devices('GPU'))

You can also check the number of GPUs detected:

print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

Finally, run TensorFlow’s built-in GPU validation (warnings are normal):

import tensorflow as tf

assert tf.test.is_gpu_available()
assert tf.test.is_built_with_cuda()

If TensorFlow logs show your GPU model (such as RTX 3060), then TensorFlow GPU is successfully installed and fully working inside Jupyter Notebook.

Conclusion

Thank you so much for reading all the way through. I hope you have been able to configure your Windows 11 computer properly for running almost any kind of Machine Learning-based experiments.

To get more content like this, you can follow me on LinkedIn and X. You can also check my website and follow me on GitHub if you are into open source and development.

How to Build End-to-End Machine Learning Lineage

Kuriko — Thu, 16 Oct 2025 13:43:13 +0000

Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance.

While many services for tracking ML lineage exist, creating a comprehensive and manageable lineage often proves complicated.

In this article, I’ll walk you through integrating a comprehensive ML lineage solution for an ML application deployed on serverless AWS Lambda, covering the end-to-end pipeline stages:

ETL pipeline
Data drift detection
Preprocessing
Model tuning
Risk and fairness evaluation.

What is Machine Learning Lineage?
What We’ll Build
- The System Architecture - AI Pricing for Retailers
- The ML Lineage
Workflow in Action
Step 1: Initiating a DVC Project
Step 2: The ML Lineage
Step 3: Deploying the DVC Project
Step 4: Configuring Scheduled Run with Prefect
Step 5: Deploying the Application
- Test in Local
Conclusion

Prerequisites:

Knowledge of key Machine Learning / Deep Learning concepts including the full lifecycle: data handling, model training, tuning, and validation.
Proficiency in Python, with experience using major ML libraries.
Basic understanding of DevOps principles.

Tools we’ll use:

Here is a summary of the tools we’re going to use to track the ML lineage:

DVC: An open-source version system for data. Used to track the ML lineage.
AWS S3: A secure object storage service from AWS. Used as a remote storage.
Evently AI: An open-source ML and LLM observability framework. Used to detect data drift.
Prefect: A workflow orchestration engine. Used to manage the schedule run of the lineage.

What is Machine Learning Lineage?

Machine learning (ML) lineage is a framework for tracking and understanding the complete lifecycle of a machine learning model.

It contains information at different levels such as:

Code: The scripts, libraries, and configurations for model training.
Data: The original data, transformations, and features.
Experiments: Training runs, hyperparameter tuning results.
Models: The trained models and their versions.
Predictions: The outputs of deployed models.

ML lineage is essential for multiple reasons:

Reproducibility: Recreate the same model and prediction for validation.
Root cause analysis: Trace back to the data, code, or configuration change when a model fails in production.
Compliance: Some regulated industries require proof of model training to ensure fairness, transparency, and adherence to laws like GDPR and the EU AI Act.

What We’ll Build

In this project, I’ll integrate an ML lineage into this price prediction system built on AWS Lambda architecture using DVC, an open-source version control system for ML applications.

The below diagram illustrates the system architecture and the ML lineage we’ll integrate:

Figure A: A comprehensive ML lineage for an ML application on serverless Lambda (Created by Kuriko IWAI)

The System Architecture: AI Pricing for Retailers

The system operates as a containerized, serverless microservice designed to provide optimal price recommendations to maximize retailer sales.

Its core intelligence comes from AI models trained on historical purchase data to predict the quantity of the product sold at various prices, allowing sellers to determine the best price.

For consistent deployment, the prediction logic and its dependencies are packaged into a Docker container image and stored in AWS ECR (Elastic Container Registry).

The prediction is then served by an AWS Lambda function, which retrieves and runs the container from ECR and exposes the result via AWS API Gateway for the Flask application to consume.

If you want to see how to build this from the ground up, you can follow along with my tutorial How to Build a Machine Learning System on Serverless Architecture.

The ML Lineage

In the system, GitHub handles the code lineage, while DVC captures the lineage of:

Data (blue boxes): ETL and preprocessing.
Experiments (light orange): Hyperparamters tuning and validation.
Models and Prediction (dark orange): Final model artifacts and prediction results.

DVC tracks the lineage through separate stages, from data extraction to fairness testing (yellow rows in Figure A).

For each stage, DVC uses an MD5 or SHA256 hash to track and push metadata like artifacts, metrics, and reports to its remote on AWS S3.

The pipeline incorporates Evently AI to handle data drift tests, which are essential for identifying shifts in data distributions that could compromise the model's generalization capabilities in production.

Only models that successfully pass both the data drift and fairness tests can serve predictions via the AWS API gateway (red box in Figure A).

Lastly, this entire lineage process is triggered weekly by the open-source workflow scheduler, Prefect.

Prefect prompts DVC to check for updates in data and scripts, and executes the full lineage process if changes are detected.

Workflow in Action

The building process involves five main steps:

Initiate a DVC project
Define the lineage stages with the DVC script dvc.yaml and corresponding Python script
Deploy the DVC project
Configure scheduled run with Prefect
Deploy the application

Let’s walk through each step together.

Step 1: Initiating a DVC Project

The first step is to initiate a DVC project:

$dvc init

This command automatically creates a .dvc directory at the root of the project folder:

.
.dvc/
│
└── cache/         # [.gitignore] store dvc caches (cached actual data files)
└── tmp/           # [.gitignore]
└── .gitignore     # gitignore cache, tmp, and config.local
└── config         # dvc config for production
└── config.local   # [.gitignore] dvc config for local

DVC maintains a fast, lightweight Git repository by separating the original data in large files from the repository.

The process involves caching the original data in the local .dvc/cache directory, creating a small .dvc metadata file which contains an MD5 hash and a link to the original data file path, pushing only the small metadata files to Git, and pushing the original data to the DVC remote.

Step 2: The ML Lineage

Next, we’ll configure the ML lineage with the following stages:

etl_pipeline: Extract, clean, impute the original data and perform feature engineering.
data_drift_check: Run data drift tests. If they fail, the system exits.
preprocess: Create training, validation, and test datasets.
tune_primary_model: Tune hyperparameters and train the model.
inference_primary_model: Perform inference on the test dataset.
assess_model_risk: Runs risk and fairness tests.

Each stage requires defining the DVC command and its corresponding Python script.

Let’s get started.

Stage 1: The ETL Pipeline

The first stage is to extract, clean, impute the original data, and perform feature engineering.

DVC Configuration

We’ll create the dvc.yaml file at the root of the project directory and add the etl_pipeline stage:

dvc.yaml

stages:
  etl_pipeline:
    # the main command dvc will run in this stage
    cmd: python src/data_handling/etl_pipeline.py

    # dependencies necessary to run the main command
    deps:
      - src/data_handling/etl_pipeline.py
      - src/data_handling/
      - src/_utils/

    # output paths for dvc to track
    outs:
      - data/original_df.parquet
      - data/processed_df.parquet

The dvc.yaml file defines a sequence of steps (stages) using sections like:

cmd: The shell command to be executed for that stage
deps: Dependencies that need to run the cmd
prams: Default parameters for the cmd defined in the params.yaml file
metrics: The metrics files to track
reports: The report files to track
plots: The DVC plot files for visualization
outs: The output files produced by the cmd, which DVC will track

The configuration helps DVC ensure reproducibility by explicitly listing dependencies, outputs, and the commands of each stage. It also helps it manage the lineage by establishing a Directed Acyclic Graph (DAG) of the workflow, linking each stage to the next.

Python Scripts

Next, let’s add Python scripts, ensuring the data is stored using the file paths specified in the outs section of the dvc.yaml file:

src/data_handling/etl_pipeline.py:

import os
import argparse

import src.data_handling.scripts as scripts
from src._utils import main_logger

def etl_pipeline():
    # extract the entire data
    df = scripts.extract_original_dataframe()

    # load perquet file
    ORIGINAL_DF_PATH = os.path.join('data', 'original_df.parquet')
    df.to_parquet(ORIGINAL_DF_PATH, index=False) # dvc tracked

    # transform
    df = scripts.structure_missing_values(df=df)
    df = scripts.handle_feature_engineering(df=df)

    PROCESSED_DF_PATH = os.path.join('data', 'processed_df.parquet')
    df.to_parquet(PROCESSED_DF_PATH, index=False) # dvc tracked
    return df

# for dvc execution
if __name__ == '__main__':  
    parser = argparse.ArgumentParser(description="run etl pipeline")
    parser.add_argument('--stockcode', type=str, default='', help="specific stockcode to process. empty runs full pipeline.")
    parser.add_argument('--impute', action='store_true', help="flag to create imputation values")
    args = parser.parse_args()

    etl_pipeline(stockcode=args.stockcode, impute_stockcode=args.impute)

Outputs

The original and structured data in Pandas’ DataFrames are stored in the DVC cache:

data/original_df.parquet
data/processed_df.parquet

Stage 2: The Data Drift Check

Before jumping into preprocessing, we’ll run data drift tests to ensure any notable drift is in the data. To do this, we’ll use EventlyAI, an open-source ML and LLM observability framework.

What is Data Drift?

Data drift refers to any changes in the statistical properties like the mean, variance, or distribution of the data that the model is trained on.

There are three main types of data drift:

Covariate Drift (Feature Drift): A change in the input feature distribution.
Prior Probability Drift (Label Drift): A change in the target variable distribution.
Concept Drift: A change in the relationship between the input data and the target variable.

Data drift compromises the model's generalization capabilities over time, making its detection after deployment crucial.

DVC Configuration

We’ll add the data_drift_check stage right after the etl_pipeline stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
     # the main command dvc will run in this stage
    cmd: >
      python src/data_handling/report_data_drift.py
      data/processed/processed_df.csv 
      data/processed_df_${params.stockcode}.parquet
      reports/data_drift_report_${params.stockcode}.html
      metrics/data_drift_${params.stockcode}.json
      ${params.stockcode}

    # default values to the parameters (defined in the param.yaml file)
    params:
      - params.stockcode

    # dependencies necessary to run the main command
    deps:
      - src/data_handling/report_data_drift.py
      - src/

    # output file pathes for dvc to track
    plots:
      - reports/data_drift_report_${params.stockcode}.html:

    metrics:
      - metrics/data_drift_${params.stockcode}.json:
          type: json

Then, add default values to the parameters passed to the DVC command:

params.yaml:

params:
  stockcode:  OF CHOICE>

Python Scripts

After generating an API token from the EventlyAI workplace, we’ll add a Python script to detect data drift and store the results in the metrics variable:

src/data_handling/report_data_drift.py:

import os
import sys
import json
import pandas as pd
import datetime
from dotenv import load_dotenv

from evidently import Dataset, DataDefinition, Report
from evidently.presets import DataDriftPreset
from evidently.ui.workspace import CloudWorkspace

import src.data_handling.scripts as scripts
from src._utils import main_logger


if __name__ == '__main__':
    # initiate evently cloud workspace
    load_dotenv(override=True)
    ws = CloudWorkspace(token=os.getenv('EVENTLY_API_TOKEN'), url='https://app.evidently.cloud')

    # retrieve evently project
    project = ws.get_project('EVENTLY AI PROJECT ID')

    # retrieve paths from the command line args
    REFERENCE_DATA_PATH = sys.argv[1]
    CURRENT_DATA_PATH = sys.argv[2]
    REPORT_OUTPUT_PATH = sys.argv[3]
    METRICS_OUTPUT_PATH = sys.argv[4]
    STOCKCODE = sys.argv[5]

    # create folders if not exist
    os.makedirs(os.path.dirname(REPORT_OUTPUT_PATH), exist_ok=True)
    os.makedirs(os.path.dirname(METRICS_OUTPUT_PATH), exist_ok=True)

    # extract datasets
    reference_data_full = pd.read_csv(REFERENCE_DATA_PATH)
    reference_data_stockcode = reference_data_full[reference_data_full['stockcode'] == STOCKCODE]
    current_data_stockcode = pd.read_parquet(CURRENT_DATA_PATH)

    # define data schema
    nums, cats = scripts.categorize_num_cat_cols(df=reference_data_stockcode)
    for col in nums: current_data_stockcode[col] = pd.to_numeric(current_data_stockcode[col], errors='coerce')

    schema = DataDefinition(numerical_columns=nums, categorical_columns=cats)

    # define evently dataset w/ the data schema
    eval_data_1 = Dataset.from_pandas(reference_data_stockcode, data_definition=schema)
    eval_data_2 = Dataset.from_pandas(current_data_stockcode, data_definition=schema)

    # execute drift detection
    report = Report(metrics=[DataDriftPreset()])
    data_eval = report.run(reference_data=eval_data_1, current_data=eval_data_2)
    data_eval.save_html(REPORT_OUTPUT_PATH)

    # create metrics for dvc tracking
    report_dict = json.loads(data_eval.json())
    num_drifts = report_dict['metrics'][0]['value']['count']
    shared_drifts = report_dict['metrics'][0]['value']['share']
    metrics = dict(
        drift_detected=bool(num_drifts > 0.0), num_drifts=num_drifts, shared_drifts=shared_drifts,
        num_cols=nums,
        cat_cols=cats,
        stockcode=STOCKCODE,
        timestamp=datetime.datetime.now().isoformat(),
    )

    # load metrics file
    with open(METRICS_OUTPUT_PATH, 'w') as f:
        json.dump(metrics, f, indent=4)
        main_logger.info(f'... drift metrics saved to {METRICS_OUTPUT_PATH}... ')

    # stop the system if data drift is found
    if num_drifts > 0.0: sys.exit('❌ FATAL: data drift detected. stopping pipeline')

If data drift is found, the script immediately exits using the final sys.exit command.

Outputs

The script generates two files that DVC will track:

reports/data_drift_report.html: The data drift report in a HTML file.
metrics/data_drift.json: The data drift metics in a JSON file including drift results along with feature columns and a timestamp:

metrics/data_drift.json:

{
    "drift_detected": false,
    "num_drifts": 0.0,
    "shared_drifts": 0.0,
    "num_cols": [
        "invoiceno",
        "invoicedate",
        "unitprice",
        "product_avg_quantity_last_month",
        "product_max_price_all_time",
        "unitprice_vs_max",
        "unitprice_to_avg",
        "unitprice_squared",
        "unitprice_log"
    ],
    "cat_cols": [
        "stockcode",
        "customerid",
        "country",
        "year",
        "year_month",
        "day_of_week",
        "is_registered"
    ],
    "timestamp": "2025-10-07T00:24:29.899495"
}

The drift test results are also available on the Evently workplace dashboard for further analysis:

Figure B. Screenshot of the Evently workspace dashboard

Stage 3: Preprocessing

If no data drift is detected, the linage moves onto the preprocessing stage.

DVC Configuration

We’ll add the preprocess stage right after the data_drift_check stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    cmd: >
      python src/data_handling/preprocess.py --target_col ${params.target_col} --should_scale ${params.should_scale} --verbose ${params.verbose}

    deps:
      - src/data_handling/preprocess.py
      - src/data_handling/
      - src/_utils

    # params from params.yaml
    params:
      - params.target_col
      - params.should_scale
      - params.verbose

    outs:
      # train, val, test datasets
      - data/x_train_df.parquet
      - data/x_val_df.parquet
      - data/x_test_df.parquet
      - data/y_train_df.parquet
      - data/y_val_df.parquet
      - data/y_test_df.parquet

      # preprocessed input datasets
      - data/x_train_processed.parquet
      - data/x_val_processed.parquet
      - data/x_test_processed.parquet

      # trained preprocessor and human readable feature names for shap analysis
      - preprocessors/column_transformer.pkl
      - preprocessors/feature_names.json

And then add default values of the parameters used in the cmd:

params.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

Python Scripts

Next, we’ll add a Python script to create training, validation, and test datasets and preprocess input data:

import os
import argparse
import json
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import src.data_handling.scripts as scripts
from src._utils import main_logger

def preprocess(stockcode: str = '', target_col: str = 'quantity', should_scale: bool = True, verbose: bool = False):
    # initiate metrics to track (dvc)
    DATA_DRIFT_METRICS_PATH = os.path.join('metrics', f'data_drift_{args.stockcode}.json')

    if os.path.exists(DATA_DRIFT_METRICS_PATH):
        with open(DATA_DRIFT_METRICS_PATH, 'r') as f:
            metrics = json.load(f)
    else: metrics = dict()

    # load processed df from dvc cache
    PROCESSED_DF_PATH = os.path.join('data', 'processed_df.parquet')
    df = pd.read_parquet(PROCESSED_DF_PATH)

    # categorize num and cat columns
    num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
    if verbose: main_logger.info(f'num_cols: {num_cols} \ncat_cols: {cat_cols}')

    # structure cat cols
    if cat_cols:
        for col in cat_cols: df[col] = df[col].astype('string')

    # initiate preprocessor (either load from the dvc cache or create from scratch)
    PREPROCESSOR_PATH = os.path.join('preprocessors', 'column_transformer.pkl')
    try:
        preprocessor = joblib.load(PREPROCESSOR_PATH)
    except:
        preprocessor = scripts.create_preprocessor(num_cols=num_cols if should_scale else [], cat_cols=cat_cols)

    # creates train, val, test datasets
    y = df[target_col]
    X = df.copy().drop(target_col, axis='columns')

    # split
    test_size, random_state = 50000, 42
    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=False)
    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state, shuffle=False)

    # store train, val, test datasets (dvc track)
    X_train.to_parquet('data/x_train_df.parquet', index=False)
    X_val.to_parquet('data/x_val_df.parquet', index=False)
    X_test.to_parquet('data/x_test_df.parquet', index=False)
    y_train.to_frame(name=target_col).to_parquet('data/y_train_df.parquet', index=False)
    y_val.to_frame(name=target_col).to_parquet('data/y_val_df.parquet', index=False)
    y_test.to_frame(name=target_col).to_parquet('data/y_test_df.parquet', index=False)

    # preprocess
    X_train = preprocessor.fit_transform(X_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    # store preprocessed input data (dvc track)
    pd.DataFrame(X_train).to_parquet(f'data/x_train_processed.parquet', index=False)
    pd.DataFrame(X_val).to_parquet(f'data/x_val_processed.parquet', index=False)
    pd.DataFrame(X_test).to_parquet(f'data/x_test_processed.parquet', index=False)

    # save feature names (dvc track) for shap
    with open('preprocessors/feature_names.json', 'w') as f:
        feature_names = preprocessor.get_feature_names_out()
        json.dump(feature_names.tolist(), f)

    return  X_train, X_val, X_test, y_train, y_val, y_test, preprocessor


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='run data preprocessing')
    parser.add_argument('--stockcode', type=str, default='', help='specific stockcode')
    parser.add_argument('--target_col', type=str, default='quantity', help='the target column name')
    parser.add_argument('--should_scale', type=bool, default=True, help='flag to scale numerical features')
    parser.add_argument('--verbose', type=bool, default=False, help='flag for verbose logging')
    args = parser.parse_args()

    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = preprocess(
        target_col=args.target_col,
        should_scale=args.should_scale,
        verbose=args.verbose,
        stockcode=args.stockcode,
    )

Outputs

This stage generates the necessary datasets for both model training and inference:

Input features:

data/x_train_df.parquet
data/x_val_df.parquet
data/x_test_df.parquet

Preprocessed input features:

data/x_train_processed_df.parquet
data/x_val_processed_df.parquet
data/x_test_processed_df.parquet

Target variables:

data/y_train_df.parquet
data/y_val_df.parquet
data/y_test_df.parquet

The preprocessor and human-readable feature names are also stored in cache for inference and SHAP feature impact analysis later:

preprocessors/column_transformer.pk
preprocessors/feature_names.json

Lastly, DVC adds the preprocess_status , x_train_processed_path, and preprocessor_path to the data summary metrics file data.json created in Step 2 to track the end-to-end process of Steps 2 and 3:

metrics/data.json:

{
    "drift_detected": false,
    "num_drifts": 0.0,
    "shared_drifts": 0.0,
    "num_cols": [
        "invoiceno",
        "invoicedate",
        "unitprice",
        "product_avg_quantity_last_month",
        "product_max_price_all_time",
        "unitprice_vs_max",
        "unitprice_to_avg",
        "unitprice_squared",
        "unitprice_log"
    ],
    "cat_cols": [
        "stockcode",
        "customerid",
        "country",
        "year",
        "year_month",
        "day_of_week",
        "is_registered"
    ],
    "timestamp": "2025-10-07T00:24:29.899495",

    # updates
    "preprocess_status": "completed",
    "x_train_processed_path": "data/x_train_processed_85123A.parquet",
    "preprocessor_path": "preprocessors/column_transformer.pkl"
}

Next, let’s move onto the model/experiment lineage.

Stage 4: Tuning the Model

Now that we’ve created the datasets, we’ll tune and train the primary model. It’s a multi-layered feedforward network on PyTorch, using training and validation datasets created in the preprocess stage.

DVC Configuration

First, we’ll add the tuning_primary_model stage right after the preprocess stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    cmd: >
      python src/model/torch_model/main.py
      data/x_train_processed_${params.stockcode}.parquet
      data/x_val_processed_${params.stockcode}.parquet
      data/y_train_df_${params.stockcode}.parquet
      data/y_val_df_${params.stockcode}.parquet
      ${tuning.should_local_save}
      ${tuning.grid}
      ${tuning.n_trials}
      ${tuning.num_epochs}
      ${params.stockcode}

    deps:
      - src/model/torch_model/main.py
      - src/data_handling/
      - src/model/
      - src/_utils/

    params:
      - params.stockcode
      - tuning.n_trials
      - tuning.grid
      - tuning.should_local_save

    outs:
      - models/production/dfn_best_${params.stockcode}.pth # dvc track

    metrics:
      - metrics/dfn_val_${params.stockcode}.json: # dvc track

Then we’ll add default values to the parameters:

params.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

tuning:
  n_trials: 100
  num_epochs: 3000
  should_local_save: False
  grid: False

Python Scripts

Next, we’ll add the Python scripts to tune the model using Bayesian optimization and then train the optimal model on the complete X_train and y_train datasets created in the preprocess stage.

src/model/torch_model/main.py:

import os
import sys
import json
import datetime
import pandas as pd
import torch
import torch.nn as nn

import src.model.torch_model.scripts as scripts


def tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode: str = '',
        should_local_save: bool = True,
        grid: bool = False,
        n_trials: int = 50,
        num_epochs: int = 3000
    ) -> tuple[nn.Module, dict]:

    # perform bayesian optimization
    best_dfn, best_optimizer, best_batch_size, best_checkpoint = scripts.bayesian_optimization(
        X_train, X_val, y_train, y_val, n_trials=n_trials, num_epochs=num_epochs
    )

    # save the model artifact (dvc track)
    DFN_FILE_PATH = os.path.join('models', 'production', f'dfn_best_{stockcode}.pth' if stockcode else 'dfn_best.pth')
    os.makedirs(os.path.dirname(DFN_FILE_PATH), exist_ok=True)
    torch.save(best_checkpoint, DFN_FILE_PATH)

    return best_dfn, best_checkpoint



def track_metrics_by_stockcode(X_val, y_val, best_model, checkpoint: dict, stockcode: str):
    MODEL_VAL_METRICS_PATH = os.path.join('metrics', f'dfn_val_{stockcode}.json')
    os.makedirs(os.path.dirname(MODEL_VAL_METRICS_PATH), exist_ok=True)

    # validate the tuned model
    _, mse, exp_mae, rmsle = scripts.perform_inference(model=best_model, X=X_val, y=y_val)
    model_version = f"dfn_{stockcode}_{os.getpid()}"
    metrics = dict(
        stockcode=stockcode,
        mse_val=mse,
        mae_val=exp_mae,
        rmsle_val=rmsle,
        model_version=model_version,
        hparams=checkpoint['hparams'],
        optimizer=checkpoint['optimizer_name'],
        batch_size=checkpoint['batch_size'],
        lr=checkpoint['lr'],
        timestamp=datetime.datetime.now().isoformat()
    )
    # store the validation results (dvc track)
    with open(MODEL_VAL_METRICS_PATH, 'w') as f:
        json.dump(metrics, f, indent=4)
        main_logger.info(f'... validation metrics saved to {MODEL_VAL_METRICS_PATH} ...')


if __name__ == '__main__':
    # fetch command arg values
    X_TRAIN_PATH = sys.argv[1]
    X_VAL_PATH = sys.argv[2]
    Y_TRAIN_PATH = sys.argv[3]
    Y_VAL_PATH = sys.argv[4]
    SHOULD_LOCAL_SAVE = sys.argv[5] == 'True'
    GRID = sys.argv[6] == 'True'
    N_TRIALS = int(sys.argv[7])
    NUM_EPOCHS = int(sys.argv[8])
    STOCKCODE = str(sys.argv[9])

    # extract training and validation datasets from dvc cache
    X_train, X_val = pd.read_parquet(X_TRAIN_PATH), pd.read_parquet(X_VAL_PATH)
    y_train, y_val = pd.read_parquet(Y_TRAIN_PATH), pd.read_parquet(Y_VAL_PATH)

    # tuning
    best_model, checkpoint = tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode=STOCKCODE, should_local_save=SHOULD_LOCAL_SAVE, grid=GRID, n_trials=N_TRIALS, num_epochs=NUM_EPOCHS
    )

    # metrics tracking
    track_metrics_by_stockcode(X_val, y_val, best_model=best_model, checkpoint=checkpoint, stockcode=STOCKCODE)

Outputs

The stage generates two files:

models/production/dfn_best.pth: Includes model artifacts and checkpoint like the optimal hyperparameter set.
metrics/dfn_val.json: Contains tuning results, model version, timestamp, and validation results for MSE, MAE, and RMSLE:

metrics/dfn_val.json:

{
    "stockcode": "85123A",
    "mse_val": 0.6137686967849731,
    "mae_val": 9.092489242553711,
    "rmsle_val": 0.6953299045562744,
    "model_version": "dfn_85123A_35604",
    "hparams": {
        "num_layers": 4,
        "batch_norm": false,
        "dropout_rate_layer_0": 0.13765888061300502,
        "n_units_layer_0": 184,
        "dropout_rate_layer_1": 0.5509872409359128,
        "n_units_layer_1": 122,
        "dropout_rate_layer_2": 0.2408753527744403,
        "n_units_layer_2": 35,
        "dropout_rate_layer_3": 0.03451842588822594,
        "n_units_layer_3": 224,
        "learning_rate": 0.026240673135104406,
        "optimizer": "adamax",
        "batch_size": 64
    },
    "optimizer": "adamax",
    "batch_size": 64,
    "lr": 0.026240673135104406,
    "timestamp": "2025-10-07T00:31:08.700294"
}

Stage 5: Performing Inference

After the model tuning phase is complete, we’ll configure the test inference for a final evaluation.

The final evaluation uses the MSE, MAE, and RMSLE metrics, as well as SHAP for feature impact and interpretability analysis.

SHAP (SHapley Additive exPlanations) is a framework for quantifying how much each feature contributes to a model’s prediction by using the concept of Shapley values from game theory.

The SHAP values are leveraged for future EDA and feature engineering.

DVC Configuration

First, we’ll add the inference_primary_model stage to the DVC configuration.

This stage has the plots section where DVC will track and version the generated visualization files on the SHAP values.

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    ### 
  inference_primary_model:
    cmd: >
      python src/model/torch_model/inference.py
      data/x_test_processed_${params.stockcode}.parquet
      data/y_test_df_${params.stockcode}.parquet
      models/production/dfn_best_${params.stockcode}.pth
      ${params.stockcode}
      ${tracking.sensitive_feature_col}
      ${tracking.privileged_group}

    deps:
      - src/model/torch_model/inference.py
      - models/production/
      - src/

    params:
      - params.stockcode
      - tracking.sensitive_feature_col
      - tracking.privileged_group

    metrics:
      - metrics/dfn_inf_${params.stockcode}.json: # dvc track
          type: json

    plots:
      # shap summary / beeswarm plot for global interpretability
      - reports/dfn_shap_summary_${params.stockcode}.json:
          template: simple
          x: shap_value
          y: feature_name
          title: SHAP Beeswarm Plot

      # shap mean absolute vals - feature importance bar plot
      - reports/dfn_shap_mean_abs_${params.stockcode}.json:
          template: bar
          x: mean_abs_shap
          y: feature_name
          title: Mean Absolute SHAP Importance

    outs:
      - data/dfn_inference_results_${params.stockcode}.parquet
      - reports/dfn_raw_shap_values_${params.stockcode}.parquet # save raw shap vals for detailed analysis later

Python Scripts

Next, we’ll add scripts where the trained model performs inference:

src/model/torch_model/inference.py:

import os
import sys
import json
import datetime
import numpy as np
import pandas as pd
import torch
import shap

import src.model.torch_model.scripts as scripts
from src._utils import main_logger


if __name__ == '__main__':
    # load test dataset
    X_TEST_PATH = sys.argv[1]
    Y_TEST_PATH = sys.argv[2]
    X_test, y_test = pd.read_parquet(X_TEST_PATH), pd.read_parquet(Y_TEST_PATH)

    # create X_test w/ column names for shap analysis and sensitive feature tracking
    X_test_with_col_names = X_test.copy()
    FEATURE_NAMES_PATH = os.path.join('preprocessors', 'feature_names.json')
    try:
        with open(FEATURE_NAMES_PATH, 'r') as f: feature_names = json.load(f)
    except FileNotFoundError: feature_names = X_test.columns.tolist()
    if len(X_test_with_col_names.columns) == len(feature_names): X_test_with_col_names.columns = feature_names

    # reconstruct the optimal model tuned in the previous stage
    MODEL_PATH = sys.argv[3]
    checkpoint = torch.load(MODEL_PATH)
    model = scripts.load_model(checkpoint=checkpoint)

    # perform inference
    y_pred, mse, exp_mae, rmsle = scripts.perform_inference(model=model, X=X_test, y=y_test, batch_size=checkpoint['batch_size'])

    # create result df w/ y_pred, y_true, and sensitive features
    STOCKCODE = sys.argv[4]
    SENSITIVE_FEATURE = sys.argv[5]
    PRIVILEGED_GROUP = sys.argv[6]
    inference_df = pd.DataFrame(y_pred.cpu().numpy().flatten(), columns=['y_pred'])
    inference_df['y_true'] = y_test
    inference_df[SENSITIVE_FEATURE] = X_test_with_col_names[f'cat__{SENSITIVE_FEATURE}_{str(PRIVILEGED_GROUP)}'].astype(bool)
    inference_df.to_parquet(path=os.path.join('data', f'dfn_inference_results_{STOCKCODE}.parquet'))

    # record inference metrics
    MODEL_INF_METRICS_PATH = os.path.join('metrics', f'dfn_inf_{STOCKCODE}.json')
    os.makedirs(os.path.dirname(MODEL_INF_METRICS_PATH), exist_ok=True)
    model_version = f"dfn_{STOCKCODE}_{os.getpid()}"
    inf_metrics = dict(
        stockcode=STOCKCODE,
        mse_inf=mse,
        mae_inf=exp_mae,
        rmsle_inf=rmsle,
        model_version=model_version,
        hparams=checkpoint['hparams'],
        optimizer=checkpoint['optimizer_name'],
        batch_size=checkpoint['batch_size'],
        lr=checkpoint['lr'],
        timestamp=datetime.datetime.now().isoformat()
    )
    with open(MODEL_INF_METRICS_PATH, 'w') as f: # dvc track
        json.dump(inf_metrics, f, indent=4)
        main_logger.info(f'... inference metrics saved to {MODEL_INF_METRICS_PATH} ...')


    ## shap analysis
    # compute shap vals
    model.eval()

    # prepare backgdound data
    X_test_tensor = torch.from_numpy(X_test.values.astype(np.float32)).to(device_type)

    # take the small samples from x_test as background
    background = X_test_tensor[np.random.choice(X_test_tensor.shape[0], 100, replace=False)].to(device_type)

    # define deepexplainer
    explainer = shap.DeepExplainer(model, background)

    # compute shap vals
    shap_values = explainer.shap_values(X_test_tensor) # outputs = numpy array or tensor

    # convert shap array to pandas df
    if isinstance(shap_values, list): shap_values = shap_values[0]
    if isinstance(shap_values, torch.Tensor): shap_values = shap_values.cpu().numpy()
    shap_values = shap_values.squeeze(axis=-1) # type: ignore
    shap_df = pd.DataFrame(shap_values, columns=feature_names)

    # shap raw data (dvc track)
    RAW_SHAP_OUT_PATH = os.path.join('reports', f'dfn_raw_shap_values_{STOCKCODE}.parquet')
    os.makedirs(os.path.dirname(RAW_SHAP_OUT_PATH), exist_ok=True)
    shap_df.to_parquet(RAW_SHAP_OUT_PATH, index=False)
    main_logger.info(f'... shap values saved to {RAW_SHAP_OUT_PATH} ...')

    # bar plot of mean abs shap vals (dvc report)
    mean_abs_shap = shap_df.abs().mean().sort_values(ascending=False)
    shap_mean_abs_df = pd.DataFrame({'feature_name': feature_names, 'mean_abs_shap': mean_abs_shap.values })
    MEAN_ABS_SHAP_PATH = os.path.join('reports', f'dfn_shap_mean_abs_{STOCKCODE}.json')
    shap_mean_abs_df.to_json(MEAN_ABS_SHAP_PATH, orient='records', indent=4)

Outputs

This stage generates five output files:

data/dfn_inference_result_${params_stockcode}.parquet: Stores prediction results, labeled targets, and any columns with sensitive features like gender, age, income, and more. I’ll use this file for the fairness test in the last stage.
metrics/dfn_inf.json: Stores evaluation metrics and tuning results:

{
    "stockcode": "85123A",
    "mse_inf": 0.6841545701026917,
    "mae_inf": 11.5866117477417,
    "rmsle_inf": 0.7423332333564758,
    "model_version": "dfn_85123A_35834",
    "hparams": {
        "num_layers": 4,
        "batch_norm": false,
        "dropout_rate_layer_0": 0.13765888061300502,
        "n_units_layer_0": 184,
        "dropout_rate_layer_1": 0.5509872409359128,
        "n_units_layer_1": 122,
        "dropout_rate_layer_2": 0.2408753527744403,
        "n_units_layer_2": 35,
        "dropout_rate_layer_3": 0.03451842588822594,
        "n_units_layer_3": 224,
        "learning_rate": 0.026240673135104406,
        "optimizer": "adamax",
        "batch_size": 64
    },
    "optimizer": "adamax",
    "batch_size": 64,
    "lr": 0.026240673135104406,
    "timestamp": "2025-10-07T00:31:12.946405"
}

reports/dfn_shap_mean_abs.json: Stores the mean SHAP values:

[
    {
        "feature_name":"num__invoicedate",
        "mean_abs_shap":0.219255722
    },
    {
        "feature_name":"num__unitprice",
        "mean_abs_shap":0.1069829418
    },
    {
        "feature_name":"num__product_avg_quantity_last_month",
        "mean_abs_shap":0.1021453096
    },
    {
        "feature_name":"num__product_max_price_all_time",
        "mean_abs_shap":0.0855356899
    },
...
]

reports/dfn_shap_summary.json: Contains the data points necessary to draw the beeswarm/bar plots.
reports/dfn_raw_shap_values.parquet: Stores raw SHAP values.

Stage 6: Assessing Model Risk and Fairness

The last stage is to assess risk and fairness of the final inference results.

The Fairness Testing

Fairness testing in ML is the process of systematically evaluating a model’s predictions to ensure they are not unfairly biased toward specific groups defined by sensitive attributes like race and gender.

In this project, we’ll use the registration status is_registered column as a sensitive feature and make sure the Mean Outcome Difference (MOD) is within the specified threshold of 0.1.

The MOD is calculated as the absolute difference between the mean prediction values of the privileged (registered) and unprivileged (unregistered) groups.

DVC Configuration

First, we’ll add the assess_model_risk stage right after the inference_primary_model stage:

dvc.yaml:

stages:
  etl_pipeline:
    ###
  data_drift_check:
    ### 
  preprocess:
    ### 
  tune_primary_model:
    ### 
  inference_primary_model:
    ###
  assess_model_risk:
    cmd: >
      python src/model/torch_model/assess_risk_and_fairness.py
      data/dfn_inference_results_${params.stockcode}.parquet
      metrics/dfn_risk_fairness_${params.stockcode}.json
      ${tracking.sensitive_feature_col}
      ${params.stockcode}
      ${tracking.privileged_group}
      ${tracking.mod_threshold}

    deps:
      - src/model/torch_model/assess_risk_and_fairness.py
      - src/_utils/
      - data/dfn_inference_results_${params.stockcode}.parquet # ensure the result df as dependency

    params:
      - params.stockcode
      - tracking.sensitive_feature_col
      - tracking.privileged_group
      - tracking.mod_threshold

    metrics:
      - metrics/dfn_risk_fairness_${params.stockcode}.json:
          type: json

Then we’ll add default values to the parameters:

param.yaml:

params:
  target_col: "quantity"
  should_scale: True
  verbose: False

tuning:
  n_trials: 100
  num_epochs: 3000
  should_local_save: False
  grid: False

# adding default values to the tracking metrics
tracking:
  sensitive_feature_col: "is_registered"
  privileged_group: 1 # member
  mod_threshold: 0.1

Python Script

The corresponding Python script contains the calculate_fairness_metrics function which performs the risk and fairness assessment:

src/model/torch_model/assess_risk_and_fairness.py:

import os
import json
import datetime
import argparse
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_log_error

from src._utils import main_logger


def calculate_fairness_metrics(
        df: pd.DataFrame,
        sensitive_feature_col: str,
        label_col: str = 'y_true',
        prediction_col: str = 'y_pred',
        privileged_group: int = 1,
        mod_threshold: float = 0.1,
    ) -> dict:

    metrics = dict()
    unprivileged_group = 0 if privileged_group == 1 else 1

    ## 1. risk assessment - predictive performance metrics by group
    for group, name in zip([unprivileged_group, privileged_group], ['unprivileged', 'privileged']):
        subset = df[df[sensitive_feature_col] == group]
        if len(subset) == 0: continue

        y_true = subset[label_col].values
        y_pred = subset[prediction_col].values

        metrics[f'mse_{name}'] = float(mean_squared_error(y_true, y_pred)) # type: ignore
        metrics[f'mae_{name}'] = float(mean_absolute_error(y_true, y_pred)) # type: ignore
        metrics[f'rmsle_{name}'] = float(root_mean_squared_log_error(y_true, y_pred)) # type: ignore

        # mean prediction (outcome disparity component)
        metrics[f'mean_prediction_{name}'] = float(y_pred.mean()) # type: ignore

    ## 2. bias assessment - fairness metrics
    # absolute mean error difference
    mae_diff = metrics.get('mae_unprivileged', 0) - metrics.get('mae_privileged', 0)
    metrics['mae_diff'] = float(mae_diff)

    # mean outcome difference
    mod = metrics.get('mean_prediction_unprivileged', 0) - metrics.get('mean_prediction_privileged', 0)
    metrics['mean_outcome_difference'] = float(mod)
    metrics['is_mod_acceptable'] = 1 if abs(mod) <= mod_threshold else 0

    return metrics


def main():
    parser = argparse.ArgumentParser(description='assess bias and fairness metrics on model inference results.')
    parser.add_argument('inference_file_path', type=str, help='parquet file path to the inference results w/ y_true, y_pred, and sensitive feature cols.')
    parser.add_argument('metrics_output_path', type=str, help='json file path to save the metrics output.')
    parser.add_argument('sensitive_feature_col', type=str, help='column name of sensitive features')
    parser.add_argument('stockcode', type=str)
    parser.add_argument('privileged_group', type=int, default=1)
    parser.add_argument('mod_threshold', type=float, default=.1)
    args = parser.parse_args()

    try:
        # load inf df
        df_inference = pd.read_parquet(args.inference_file_path)
        LABEL_COL = 'y_true'
        PREDICTION_COL = 'y_pred'
        SENSITIVE_COL = args.sensitive_feature_col

        # compute fairness metrics
        metrics = calculate_fairness_metrics(
            df=df_inference,
            sensitive_feature_col=SENSITIVE_COL,
            label_col=LABEL_COL,
            prediction_col=PREDICTION_COL,
            privileged_group=args.privileged_group,
            mod_threshold=args.mod_threshold,
        )

        # add items to metrics
        metrics['model_version'] = f'dfn_{args.stockcode}_{os.getpid()}'
        metrics['sensitive_feature'] = args.sensitive_feature_col
        metrics['privileged_group'] = args.privileged_group
        metrics['mod_threshold'] = args.mod_threshold
        metrics['stockcode'] = args.stockcode
        metrics['timestamp'] = datetime.datetime.now().isoformat()

        # load metrics (dvc track)
        with open(args.metrics_output_path, 'w') as f:
            json_metrics = { k: (v if pd.notna(v) else None) for k, v in metrics.items() }
            json.dump(json_metrics, f, indent=4)

    except Exception as e:
        main_logger.error(f'... an error occurred during risk and fairness assessment: {e} ...')
        exit(1)

if __name__ == '__main__':
    main()

Outputs

The final stage generates a metrics file which contains test results and model version:

metrics/dfn_risk_fairness.json:

{
    "mse_unprivileged": 3.5370739412593575,
    "mae_unprivileged": 1.48263614013523,
    "rmsle_unprivileged": 0.6080000224747837,
    "mean_prediction_unprivileged": 1.8507767915725708,
    "mae_diff": 1.48263614013523,
    "mean_outcome_difference": 1.8507767915725708,
    "is_mod_acceptable": 1,
    "model_version": "dfn_85123A_35971",
    "sensitive_feature": "is_registered",
    "privileged_group": 1,
    "mod_threshold": 0.1,
    "timestamp": "2025-10-07T00:31:15.998590"
}

That’s all for the lineage configuration. Now, we’ll test it in local.

Test in Local

We’ll run the entire ML lineage with this command:

$dvc repro -f

-f forces DVC to rerun all the stages with or without any updates.

The command will automatically create the dvc.lock file at the root of the project directory:

schema: '2.0'
stages:
  etl_pipeline_full:
    cmd: python src/data_handling/etl_pipeline.py
    deps:
    - path: src/_utils/
      hash: md5
      md5: ae41392532188d290395495f6827ed00.dir
      size: 15870
      nfiles: 10
    - path: src/data_handling/
      hash: md5
      md5: a8a61a4b270581a7c387d51e416f4e86.dir
      size: 95715
...

The dvc.lock file must be published in Git to make sure DVC will load the latest files:

$git add dvc.lock .dvc dvc.yaml params.yaml
$git commit -m'updated dvc config'
$git push

Step 3: Deploying the DVC Project

Next, we’ll deploy the DVC project to ensure the AWS Lambda function can access the cached files in production.

We’ll start by configuring the DVC remote where the cached files are stored.

DVC offers various storage types like AWS S3 and Google Cloud. We’ll use AWS S3 for this project but your choice depend on the project ecosystem, your familiarity with the tool, and any resource constraints.

First, we’ll create a new S3 bucket in the selected AWS region:

$aws s3 mb s3:///  --region

Make sure the IAM role has the following permissions: s3:ListBucket, s3:GetObject, s3:PutObject, and s3:DeleteObject.

Then, add theURI of the S3 bucket to the DVC remote:

$dvc remote add -d  ss3:///

Next, push the cache files to the DVC remote:

$dvc push

Now, all cache files are stored in the S3 bucket:

Figure C. Screenshot of the DVC remote in AWS S3 bucket

As shown in Figure A, this deployment step is necessary for the AWS Lambda function to access the DVC cache in production.

Step 4: Configuring Scheduled Run with Prefect

The next step is to configure the scheduled run of the entire lineage with Prefect.

Prefect is an open-source workflow orchestration tool for building, scheduling, and monitoring pipelines. It uses a concept called a work pool to effectively decouple the orchestration logic from the execution infrastructure.

Then, the work pool serves as a standardized base configuration by running a Docker container image to guarantee a consistent execution environment for all flows.

Configuring the Docker Image Registry

The first step is to configure the Docker image registry for the Prefect work pool:

For local deployment: A container registry in the Docker Hub.
For production deployment: AWS ECR.

For local deployment, we’ll first authenticate the Docker client:

$docker login

And grant a user permission to run Docker commands without sudo:

$sudo dscl . -append /Groups/docker GroupMembership $USER

For production deployment, we’ll create a new ECR:

$aws ecr create-repository --repository-name  --region

(Make sure the IAM role has access to this new ECR URI.)

Configure Prefect Tasks and Flows

Next, we’ll configure the Prefect task and flow in the project:

The Prefect task executes the dvc repro and dvc push commands
The Prefect flow weekly executes the Prefect task.

src/prefect_flows.py:

import os
import sys
import subprocess
from datetime import timedelta, datetime
from dotenv import load_dotenv
from prefect import flow, task
from prefect.schedules import Schedule
from prefect_aws import AwsCredentials

from src._utils import main_logger

# add project root to the python path - enabling prefect to find the script
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))

# define the prefect task
@task(retries=3, retry_delay_seconds=30)
def run_dvc_pipeline():
    # execute the dvc pipeline 
    result = subprocess.run(["dvc", "repro"], capture_output=True, text=True, check=True)

    # push the updated data
    subprocess.run(["dvc", "push"], check=True)


# define the prefect flow
@flow(name="Weekly Data Pipeline")
def weekly_data_flow():
    run_dvc_pipeline()

if __name__ == '__main__':
    # docker image registry (either docker hub or aws ecr)
    load_dotenv(override=True)
    ENV = os.getenv('ENV', 'production')
    DOCKER_HUB_REPO = os.getenv('DOCKER_HUB_REPO')
    ECR_FOR_PREFECT_PATH = os.getenv('S3_BUCKET_FOR_PREFECT_PATH')
    image_repo = f'{DOCKER_HUB_REPO}:ml-sales-pred-data-latest' if ENV == 'local' else f'{ECR_FOR_PREFECT_PATH}:latest'

    # define weekly schedule
    weekly_schedule = Schedule(
        interval=timedelta(weeks=1),
        anchor_date=datetime(2025, 9, 29, 9, 0, 0),
        active=True,
    )

    # aws credentials to access ecr
    AwsCredentials(
        aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
        aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
        region_name=os.getenv('AWS_REGION_NAME'),
    ).save('aws', overwrite=True)

    # deploy the prefect flow
    weekly_data_flow.deploy(
        name='weekly-data-flow',
        schedule=weekly_schedule, # schedule
        work_pool_name="wp-ml-sales-pred", # work pool where the docker image (flow) runs
        image=image_repo, # create a docker image at docker hub (local) or ecr (production)
        concurrency_limit=3,
        push=True # push the docker image to the image_repo
    )

Test in Local

Next, we’ll test the workflow locally with the Prefect server:

$uv run prefect server start

$export PREFECT_API_URL="http://127.0.0.1:4200/api"

Run the prefect_flows.py script:

$uv run src/prefect_flows.py

Upon the successful execution, the Prefect dashboard indicates the workflow is scheduled to run:

Figure D. As screenshot of the Prefect dashboard

Step 5: Deploying the Application

The final step is to deploy the entire application as a containerized Lambda by configuring the Dockerfile and the Flask application scripts.

The specific process in this final deployment step depends on the infrastructure.

But the common point is that DVC eliminates the need to store the large Parquet or CSV files directly in the feature store or model store because it caches them as lightweight hashed files.

So, first, we’ll simplify the loading logic of the Flask application script by using the dvc.api framework:

app.py:

### ... the rest components remain the same  ...

import dvc.api

DVC_REMOTE_NAME=


def configure_dvc_for_lambda():
    # set dvc directories to /tmp
    os.environ.update({
        'DVC_CACHE_DIR': '/tmp/dvc-cache',
        'DVC_DATA_DIR': '/tmp/dvc-data',
        'DVC_CONFIG_DIR': '/tmp/dvc-config',
        'DVC_GLOBAL_CONFIG_DIR': '/tmp/dvc-global-config',
        'DVC_SITE_CACHE_DIR': '/tmp/dvc-site-cache'
    })
    for dir_path in ['/tmp/dvc-cache', '/tmp/dvc-data', '/tmp/dvc-config']:
        os.makedirs(dir_path, exist_ok=True)


def load_x_test():
    global X_test
    if not os.environ.get('PYTEST_RUN', False):
        main_logger.info("... loading x_test ...")

        # config dvc directories
        configure_dvc_for_lambda()
        try:
            with dvc.api.open(X_TEST_PATH, remote=DVC_REMOTE_NAME, mode='rb') as fd:
                X_test = pd.read_parquet(fd)
                main_logger.info('✅ successfully loaded x_test via dvc api')
        except Exception as e:
            main_logger.error(f'❌ general loading error: {e}', exc_info=True)


def load_preprocessor():
    global preprocessor
    if not os.environ.get('PYTEST_RUN', False):
        main_logger.info("... loading preprocessor ...")
        configure_dvc_for_lambda()
        try:
            with dvc.api.open(PREPROCESSOR_PATH, remote=DVC_REMOTE_NAME, mode='rb') as fd:
                preprocessor = joblib.load(fd)
                main_logger.info('✅ successfully loaded preprocessor via dvc api')

        except Exception as e:
            main_logger.error(f'❌ general loading error: {e}', exc_info=True)

### ... the rest components remain the same  ...

Then, update the Dockerfile to enable Docker to correctly reference the DVC components:

Dockerfile.lambda.production:

# use an official python runtime
FROM public.ecr.aws/lambda/python:3.12

# set environment variables (adding dvc related env variables)
ENV JOBLIB_MULTIPROCESSING=0
ENV DVC_HOME="/tmp/.dvc"
ENV DVC_CACHE_DIR="/tmp/.dvc/cache"
ENV DVC_REMOTE_NAME="storage"
ENV DVC_GLOBAL_SITE_CACHE_DIR="/tmp/dvc_global"

# copy requirements file and install dependencies
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir dvc dvc-s3

# setup dvc
RUN dvc init --no-scm
RUN dvc config core.no_scm true

# copy the code to the lambda task root
COPY . ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]

Lastly, ensure the large files are ignored from the Docker container image:

.dockerignore:

### ... the rest components remain the same  ...

# dvc cache contains large files
.dvc/cache
.dvcignore

# add all folders that DVC will track
data/
preprocessors/
models/
reports/
metrics/

Test in Local

Finally, we’ll build and test the Docker image:

$docker build -t my-app -f Dockerfile.lambda.local .
$docker run -p 5002:5002 -e ENV=local my-app app.py

Upon the successful configuration, the waitress server will run the Flask application.

After confirming the changes, push the code to Git:

$git add .
$git commit -m'updated dockerfiles and flask app scripts'
$git push

This push command triggers the CI/CD pipeline via GitHub Actions, which generates a Docker container image and pushes it to AWS ECR.

And then after a successful pipeline flow and verification, we can manually run the deployment workflow using GitHub Actions.

And that’s it!

You can learn more here: Integrating the infrastructure CI/CD pipeline to an ML application

All code is available in my GitHub repository.

The mock app is available here.

Conclusion

Building robust ML applications requires comprehensive ML lineage to ensure reliability and traceability.

In this article, you learned how to build an ML lineage by integrating open-source services like DVC and Prefect.

In practice, initial planning matters. Specifically, defining how metrics are tracked and at which stages leads directly to a cleaner, more maintainable code structure and the extensibility in the future.

Moving forward, we can consider adding more stages to the lineage and integrating advanced logic for data drift detection or fairness tests.

This will further ensure continued model performance and data integrity in the production environment.

You can check out my Portfolio / Github.

All images, unless otherwise noted, are by the author.

How to Use Transformers for Real-Time Gesture Recognition

OMOTAYO OMOYEMI — Mon, 06 Oct 2025 13:39:30 +0000

Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.

This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.

In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.

Why Transformers for Gestures?
What You’ll Learn
Prerequisites
Project Setup
Generate a Gesture Dataset
Option 1: Generate a Synthetic Dataset
Training Script: train.py
Export the Model to ONNX
Evaluate Accuracy + Latency
Option 2: Use Small Samples from Public Gesture Datasets
Accessibility Notes & Ethical Limits
Next Steps
Conclusion

Why Transformers for Gestures?

Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.

Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.

Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.

What You’ll Learn

In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:

Create (or record) a tiny gesture dataset
Train a Vision Transformer (ViT) with temporal pooling
Export the model to ONNX for faster inference
Build a real-time Gradio app that classifies gestures from your webcam
Evaluate your model’s accuracy and latency with simple scripts
Understand the accessibility potential and ethical limits of gesture recognition

Prerequisites

To follow along, you should have:

Basic Python knowledge (functions, scripts, virtual environments)
Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required
Python 3.8+ installed on your system
A webcam (for the live demo in Gradio)
Optionally: GPU access (training on CPU works, but is slower)

Project Setup

Create a new project folder and install the required libraries.

# Create a new project directory and navigate into it
mkdir transformer-gesture && cd transformer-gesture

# Set up a Python virtual environment
python -m venv .venv

# Activate the virtual environment
# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:

mkdir transformer-gesture && cd transformer-gesture: This command creates a new directory named "transformer-gesture" and then navigates into it.
python -m venv .venv: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".
Activating the virtual environment:
- For Windows PowerShell, you can use .venv\Scripts\Activate.ps1 to activate the virtual environment.
- For macOS/Linux, use source .venv/bin/activate to activate the virtual environment.

Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.

Create a requirements.txt file:

torch>=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn

The list provided is a set of package dependencies typically found in a requirements.txt file for a Python project. Here's a brief explanation of each package:

torch>=2.0: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.
torchvision: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.
torchaudio: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.
timm: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.
huggingface_hub: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.
onnx: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.
onnxruntime: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.
gradio: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.
numpy: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.
opencv-python: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.
pillow: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.
matplotlib: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.
seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
scikit-learn: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.

Install dependencies:

pip install -r requirements.txt

The command pip install -r requirements.txt is used to install all the Python packages listed in a file named requirements.txt. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.

By running this command, pip, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.

Generate a Gesture Dataset

To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.

Option 1: Generate a Synthetic Dataset

We’ll use a small Python script that creates short .mp4 clips of a moving (or still) coloured box. Each class represents a gesture:

swipe_left – box moves from right to left
swipe_right – box moves from left to right
stop – box stays still in the center

Save this script as generate_synthetic_gestures.py in your project root:

import os, cv2, numpy as np, random, argparse

def ensure_dir(p): os.makedirs(p, exist_ok=True)

def make_clip(mode, out_path, seconds=1.5, fps=16, size=224, box_size=60, seed=0, codec="mp4v"):
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    # background + box color
    bg_val = rng.randint(160, 220)
    bg = np.full((H, W, 3), bg_val, dtype=np.uint8)
    color = (rng.randint(20, 80), rng.randint(20, 80), rng.randint(20, 80))

    # path of motion
    y = rng.randint(40, H - 40 - box_size)
    if mode == "swipe_left":
        x_start, x_end = W - 20 - box_size, 20
    elif mode == "swipe_right":
        x_start, x_end = 20, W - 20 - box_size
    elif mode == "stop":
        x_start = x_end = (W - box_size) // 2
    else:
        raise ValueError(f"Unknown mode: {mode}")

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    if not vw.isOpened():
        raise RuntimeError(
            f"Could not open VideoWriter with codec '{codec}'. "
            "Try --codec XVID and use .avi extension, e.g. out.avi"
        )

    for t in range(frames):
        alpha = t / max(1, frames - 1)
        x = int((1 - alpha) * x_start + alpha * x_end)
        # small jitter to avoid being too synthetic
        jitter_x, jitter_y = rng.randint(-2, 2), rng.randint(-2, 2)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=-1)
        # overlay text
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2, cv2.LINE_AA)
        cv2.putText(frame, mode, (8, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 1, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

def write_labels(labels, out_dir):
    with open(os.path.join(out_dir, "labels.txt"), "w", encoding="utf-8") as f:
        for c in labels:
            f.write(c + "\n")

def main():
    ap = argparse.ArgumentParser(description="Generate a tiny synthetic gesture dataset.")
    ap.add_argument("--out", default="data", help="Output directory (default: data)")
    ap.add_argument("--classes", nargs="+",
                    default=["swipe_left", "swipe_right", "stop"],
                    help="Class names (default: swipe_left swipe_right stop)")
    ap.add_argument("--clips", type=int, default=16, help="Clips per class (default: 16)")
    ap.add_argument("--seconds", type=float, default=1.5, help="Seconds per clip (default: 1.5)")
    ap.add_argument("--fps", type=int, default=16, help="Frames per second (default: 16)")
    ap.add_argument("--size", type=int, default=224, help="Frame size WxH (default: 224)")
    ap.add_argument("--box", type=int, default=60, help="Box size (default: 60)")
    ap.add_argument("--codec", default="mp4v", help="Codec fourcc (mp4v or XVID)")
    ap.add_argument("--ext", default=".mp4", help="File extension (.mp4 or .avi)")
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, ".")  # writes labels.txt to project root

    print(f"Generating synthetic dataset -> {args.out}")
    for cls in args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = "stop" if cls == "stop" else ("swipe_left" if "left" in cls else ("swipe_right" if "right" in cls else "stop"))
        for i in range(args.clips):
            filename = os.path.join(cls_dir, f"{cls}_{i+1:03d}{args.ext}")
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + 1,
                codec=args.codec
            )
        print(f"  {cls}: {args.clips} clips")

    print("Done. You can now run: python train.py, python export_onnx.py, python app.py")

if __name__ == "__main__":
    main()

The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.

Now run it inside your virtual environment:

python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5

The command above runs a Python script named generate_synthetic_gestures.py, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".

This creates a dataset like:

data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt

Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.

Training Script: `train.py`

Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.

Here’s the full training script:

# train.py
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader
import timm
from dataset import GestureClips, read_labels

class ViTTemporal(nn.Module):
    """Frame-wise ViT encoder -> mean pool over time -> linear head."""
    def __init__(self, num_classes, vit_name="vit_tiny_patch16_224"):
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=True, num_classes=0, global_pool="avg")
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    def forward(self, x):  # x: (B,T,C,H,W)
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  # (B*T, D)
        feats = feats.view(B, T, -1).mean(dim=1)  # (B, D)
        return self.head(feats)

def train():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    labels, _ = read_labels("labels.txt")
    n_classes = len(labels)

    train_ds = GestureClips(train=True)
    val_ds   = GestureClips(train=False)
    print(f"Train clips: {len(train_ds)} | Val clips: {len(val_ds)}")

    # Windows/CPU friendly
    train_dl = DataLoader(train_ds, batch_size=2, shuffle=True,  num_workers=0, pin_memory=False)
    val_dl   = DataLoader(val_ds,   batch_size=2, shuffle=False, num_workers=0, pin_memory=False)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.05)

    best_acc = 0.0
    epochs = 5
    for epoch in range(1, epochs + 1):
        # ---- Train ----
        model.train()
        total, correct, loss_sum = 0, 0, 0.0
        for x, y in train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(0)
            correct += (logits.argmax(1) == y).sum().item()
            total += x.size(0)

        train_acc = correct / total if total else 0.0
        train_loss = loss_sum / total if total else 0.0

        # ---- Validate ----
        model.eval()
        vtotal, vcorrect = 0, 0
        with torch.no_grad():
            for x, y in val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(1) == y).sum().item()
                vtotal += x.size(0)
        val_acc = vcorrect / vtotal if vtotal else 0.0

        print(f"Epoch {epoch:02d} | train_loss {train_loss:.4f} "
              f"| train_acc {train_acc:.3f} | val_acc {val_acc:.3f}")

        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), "vit_temporal_best.pt")

    print("Best val acc:", best_acc)

if __name__ == "__main__":
    train()

Running the command python train.py initiates the training process for your gesture recognition model. Here's a breakdown of what happens:

Load your dataset from data/: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.
Fine-tune a pre-trained Vision Transformer: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.
Save the best checkpoint as vit_temporal_best.pt: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.

What Training Looks Like

You should see logs similar to this:

Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200

Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:

Adding more clips per class
Training for more epochs
Switching to real recorded gestures

Figure 1. Example training logs from train.py, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.

Export the Model to ONNX

To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.

Note: ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.

ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.

Create a file called export_onnx.py:

import torch
from train import ViTTemporal
from dataset import read_labels

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

# Dummy input: batch=1, 16 frames, 3x224x224
dummy = torch.randn(1, 16, 3, 224, 224)

# Export
torch.onnx.export(
    model, dummy, "vit_temporal.onnx",
    input_names=["video"], output_names=["logits"],
    dynamic_axes={"video": {0: "batch"}},
    opset_version=13
)

print("Exported vit_temporal.onnx")

Run it with python export_onnx.py.

This generates a file vit_temporal.onnx in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.

Create a file called app.py:

import os, tempfile, cv2, torch, onnxruntime, numpy as np
import gradio as gr
from dataset import read_labels

T = 16
SIZE = 224
MODEL_PATH = "vit_temporal.onnx"

labels, _ = read_labels("labels.txt")

# --- ONNX session + auto-detect names ---
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
# detect first input and first output names to avoid mismatches
INPUT_NAME = ort_session.get_inputs()[0].name   # e.g. "input" or "video"
OUTPUT_NAME = ort_session.get_outputs()[0].name # e.g. "logits" or something else

def preprocess_clip(frames_rgb):
    if len(frames_rgb) == 0:
        frames_rgb = [np.zeros((SIZE, SIZE, 3), dtype=np.uint8)]
    if len(frames_rgb) < T:
        frames_rgb = frames_rgb + [frames_rgb[-1]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) for f in frames_rgb]
    clip = np.stack(clip, axis=0)                                    # (T,H,W,3)
    clip = np.transpose(clip, (0, 3, 1, 2)).astype(np.float32) / 255 # (T,3,H,W)
    clip = (clip - 0.5) / 0.5
    clip = np.expand_dims(clip, 0)                                   # (1,T,3,H,W)
    return clip

def _extract_path_from_gradio_video(inp):
    if isinstance(inp, str) and os.path.exists(inp):
        return inp
    if isinstance(inp, dict):
        for key in ("video", "name", "path", "filepath"):
            v = inp.get(key)
            if isinstance(v, str) and os.path.exists(v):
                return v
        for key in ("data", "video"):
            v = inp.get(key)
            if isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4")
                tmp.write(v); tmp.flush(); tmp.close()
                return tmp.name
    if isinstance(inp, (list, tuple)) and inp and isinstance(inp[0], str) and os.path.exists(inp[0]):
        return inp[0]
    return None

def _read_uniform_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) or 1
    idxs = np.linspace(0, total - 1, max(T, 1)).astype(int)
    want = set(int(i) for i in idxs.tolist())
    j = 0
    while True:
        ok, bgr = cap.read()
        if not ok: break
        if j in want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += 1
    cap.release()
    return frames

def predict_from_video(gradio_video):
    video_path = _extract_path_from_gradio_video(gradio_video)
    if not video_path or not os.path.exists(video_path):
        return {}
    frames = _read_uniform_frames(video_path)

    # If OpenCV choked on the codec (common with recorded webm), re-encode once:
    if len(frames) == 0:
        tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4"); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) or 640
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) or 480
        out = cv2.VideoWriter(tmp_name, fourcc, 20.0, (w, h))
        while True:
            ok, frame = cap.read()
            if not ok: break
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    # >>> use the detected ONNX input/output names <<<
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]  # (1, C)
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

def predict_from_image(image):
    if image is None:
        return {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[0]
    probs = torch.softmax(torch.from_numpy(logits), dim=1)[0].numpy().tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

with gr.Blocks() as demo:
    gr.Markdown("# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**.")
    with gr.Tab("Video (record or upload)"):
        vid_in = gr.Video(label="Record from webcam or upload a short clip")
        vid_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Video").click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    with gr.Tab("Single Image (fallback)"):
        img_in = gr.Image(label="Upload an image frame", type="numpy")
        img_out = gr.Label(num_top_classes=3, label="Prediction")
        gr.Button("Classify Image").click(fn=predict_from_image, inputs=img_in, outputs=img_out)

if __name__ == "__main__":
    demo.launch()

Running the command python app.py launches a Gradio application in your web browser. Here's what happens:

Webcam feed streams live: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.
Predictions update continuously: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.
Top 3 gesture classes displayed with probabilities: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.

When you open the app in your browser, you'll find two tabs. In the Video tab, you can click Record from webcam to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click Classify Video. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.

Here’s an example where I raised my hand for a stop gesture, and the model predicts “stop” as the top class:

Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.

Evaluate Accuracy + Latency

Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:

Accuracy: does the model predict the right gesture class?
Latency: how fast does it respond, especially on CPU vs GPU?

1. Quick Accuracy Check

Save this as eval.py in the same folder as your other scripts:

import torch
from dataset import GestureClips, read_labels
from train import ViTTemporal

labels, _ = read_labels("labels.txt")
n_classes = len(labels)

# Load validation data
val_ds = GestureClips(train=False)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=2, shuffle=False)

# Load trained model
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load("vit_temporal_best.pt", map_location="cpu"))
model.eval()

correct, total = 0, 0
all_preds, all_labels = [], []

with torch.no_grad():
    for x, y in val_dl:
        logits = model(x)
        preds = logits.argmax(dim=1)
        correct += (preds == y).sum().item()
        total += y.size(0)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(f"Validation accuracy: {correct/total:.2%}")

2. Confusion Matrix

Let’s also visualize which gestures are confused. Add this snippet at the bottom of eval.py:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(6,6))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=labels, yticklabels=labels, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.show()

When you run python eval.py, a heatmap like this will pop up:

Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.

3. Latency Benchmark

Finally, let’s see how fast inference runs. Save the following as benchmark.py:

import time, numpy as np, onnxruntime
from dataset import read_labels

labels, _ = read_labels("labels.txt")

ort = onnxruntime.InferenceSession("vit_temporal.onnx", providers=["CPUExecutionProvider"])
INPUT_NAME = ort.get_inputs()[0].name
OUTPUT_NAME = ort.get_outputs()[0].name

dummy = np.random.randn(1, 16, 3, 224, 224).astype(np.float32)

# Warmup
for _ in range(3):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

# Benchmark
t0 = time.time()
for _ in range(50):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(f"Average latency: {(t1 - t0)/50:.3f} seconds per clip")

Run: python benchmark.py

On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.

Note: If latency is high, you can enable quantization in ONNX to shrink the model and speed up inference.

Option 2: Use Small Samples from Public Gesture Datasets

If you’d prefer to see your model trained on real gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few .mp4 samples are enough to follow along.

Recommended sources

20BN Jester Dataset: Contains short clips of hand gestures like swiping, clapping, and pointing.
WLASL: A large-scale dataset of isolated sign language words.

Both projects provide small .mp4 videos you can use as realistic training examples. I’ve linked them below.

Setting up your dataset folder

Once you download a few clips, place them in the data/ folder under subfolders named after each gesture class. For example:

data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4

And update labels.txt to match the folder names:

swipe_left
swipe_right
stop

Now your dataset is ready, and the same training scripts from earlier (train.py, eval.py) will work without modification.

Why choose this option?

Gives more realistic results than synthetic coloured boxes
Lets you see how the model handles actual human hand movements
It just requires a bit more effort (downloading clips, trimming them if needed)

Tip: If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as .mp4 files and organize them in the same folder structure.

Accessibility Notes & Ethical Limits

While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the human context:

Accessibility first: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.
Dataset sensitivity: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.
Error tolerance: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing stop with go). Always plan for fallback options (like manual input or confirmation).
Bias and inclusivity: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.

In other words: this demo is a teaching scaffold, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.

Next Steps

If you’d like to push this project further, here are some directions to explore:

Better models: Try video-focused Transformers like TimeSformer or VideoMAE for stronger temporal reasoning.
Larger vocabularies: Add more gesture classes, build your own dataset, or use portions of public datasets like 20BN Jester or WLASL.
Pose fusion: Combine gesture video with human pose keypoints from MediaPipe or OpenPose for more robust predictions.
Real-time smoothing: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.
Quantization + edge devices: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.

Conclusion

In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.

This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.

Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.

Here’s the GitHub repo for full source code: transformer-gesture.

Machine Learning vs Deep Learning vs Generative AI - What are the Differences?

Nitheesh Poojary — Thu, 02 Oct 2025 15:22:13 +0000

When I started using LLMs for work and personal use, I picked up on some technical terms, such as "machine learning" and "deep learning," which are the main technologies behind these LLMs. I've always been interested in learning about the differences between these technologies. Most companies in the industry are now developing their own AI tools, which makes MLOps necessary for managing and utilizing them.

Before I began learning about MLOps, I tried to understand the technologies behind LLMs and how they work. In this article, I’ll share my understanding of machine learning, deep learning, and generative AI, along with their potential applications.

Artificial Intelligence (AI)
Machine Learning (ML): The Foundation
Deep Learning: Adding Complexity
Generative AI: Write New
Summary of Differences Between Machine Learning vs Deep Learning vs Generative AI
Conclusion

Artificial Intelligence (AI)

Artificial Intelligence (AI) is a form of technology that lets machines solve problems in a way that is identical to how people do it. It helps businesses make better decisions on a large scale by helping them recognize images, create content, and make predictions based on data. Artificial intelligence includes machine learning, deep learning, and generative AI.

Machine Learning (ML): The Foundation

When we give computers many examples, they learn how to make their own decisions or guesses. It's like teaching a kid to tell the difference between animals. You show them a lot of pictures of cats and dogs and say things like "This is a cat" and "This is a dog." In the end, they learn to tell the difference between cats and dogs on their own. Machine learning is similar in that you give a computer a lot of data with examples, and it learns how to make predictions about new data.

How Does Machine Learning Work?

Machine Learning (ML) is the process of teaching computers to find patterns in data and make decisions or predictions without being instructed what to do. There are usually six main steps in this process:

Data Collection: Get many examples, like thousands of emails, photos, or sales records. The more training data you have, the more accurate your predictions will be.

Data Preparation: At this stage, you clean the data by getting rid of mistakes and adding missing labels.

Selecting Algorithm (Models): It's like choosing the right tools for the job. Models can find patterns in data or make predictions. You can find machine learning models for your data here.

Training Phase: After you pick the right model for your cleaned-up data, you teach it. This is like getting ready for a test.

Evaluation: Use the test data to assess the model's performance and see if it can make accurate predictions on unseen data.

Deployment: Put the trained model to work in the real world.

Training Phase: Teach the computer with 10,000 house sales with details like size (2,000 sq ft), number of bedrooms (3), and location (downtown). Cost: $300,000.

Learning: The algorithm finds patterns, such as the fact that bigger houses cost more and places in the city center cost more. More bedrooms make a house worth more.

Prediction: Think about a new house with 1,800 square feet, two bedrooms, and a location in the suburbs. It guesses a figure based on what it has learned.

Types of Machine Learning

Supervised Learning: Give algorithms labeled and defined training data to look for patterns. The sample data tells the algorithm what to do and what to expect as an output. For instance, millions of X-ray reports that say someone is healthy or sick would need to be tagged. Then, machine learning programs could use this training data to guess if a new X-ray shows signs of illness.
Unsupervised Learning: Algorithms that use unsupervised learning learn from data that doesn't have labels. The algorithm must find patterns in untagged data without outside help. For instance, finding groups of people on Facebook or Twitter who have similar interests.
Reinforcement Learning: This technique is a kind of machine learning in which an agent learns how to make choices by interacting with the world around it. The agent receives points for doing things right and loses points for doing things wrong. Its goal is to get as many points as possible. For instance, cars learn how to drive safely by making mistakes in simulations. They get rewards for staying in their lane, following traffic rules, and not hitting other cars.

Machine Learning—Real-World Examples

Email Spam Detection

You can show the computer thousands of emails that say "spam" or "not spam." It learns patterns, like how emails with "FREE MONEY" are usually spam. It can now automatically sort your inbox.

Photo Recognition

Give the computer millions of pictures with labels that say what's in them. It learns that apples are likely to be round and have stems. Your phone can now tell what things are in your pictures.

Movie Recommendations

Netflix keeps track of the movies you've seen and rated. It finds people who like the same things you do. It suggests movies that other people like.

Deep Learning: Adding Complexity

Deep learning is a type of artificial intelligence. It helps computers understand data like humans do. Deep learning can identify complex images, text, sound, and other data patterns to make accurate predictions. It uses artificial neural networks that work like the human brain. Neural networks are connected nodes that handle information.

How Does Deep Learning Work?

Artificial neural networks are used in deep learning to learn from data. These networks consist of interconnected layers of nodes. Each node learns a different thing about the data.

For instance, when you show a computer a picture of a cat, the picture goes through a lot of steps. The first layer looks for shapes and edges. The second layer puts these shapes together to make ears, eyes, and whiskers. The last layers say things like "This picture looks like a cat." Deep learning can make a lot of mistakes when learning, but it gets better and better after each piece of feedback.

Deep Learning—Real-World Examples

Tesla Autopilot: Processes eight cameras simultaneously to navigate roads, recognize traffic signs, and avoid obstacles.
Google's DeepMind: Detects over fifty eye diseases from retinal scans with 94% accuracy.
ChatGPT: Helps with writing, coding, and problem-solving.

Generative AI: Write New

Generative AI is a subset of deep learning that makes new things, like stories, pictures, music, or code, instead of just looking at or sorting through things that are already there. Generative AI systems learn patterns from a lot of training data and then use those patterns to make new content.

Real-World Examples

Chatbots help institutions give better customer service by making product suggestions and answering questions.
Automatically generate technical documents from the source code.
Auto-generate quizzes, practice problems, and explanations

Summary of Differences Between Machine Learning vs Deep Learning vs Generative AI

Feature	Machine Learning (ML)	Deep Learning (DL)	Generative AI (GenAI)
Definition	Subset of AI where machines learn from data to make predictions or decisions.	Subset of AI using artificial neural networks with multiple layers to model complex patterns	Subset of Deep learning that can create new content (text, images, code, etc.) similar to human-created content
Data Requirements	Small-to-medium datasets.	Large amounts of data (structured and unstructured)	Massive datasets for training, varying amounts for generation
Computational Power	Works on CPUs, moderate hardware.	Needs GPUs/TPUs for training.	Requires large-scale GPU/TPU clusters.
Use Cases	Predictions and classification.	Recognize complex data like speech, images, and language.	Generate new, original content.
When NOT to Use	Data is very complex/unstructured; accuracy is critical (medical, legal) ,Need to handle images/audio/video	The dataset is small (<1000 samples), and computational resources are limited.	Copyright/IP restriction
Cost Comparison	Low ($1K-$10K) (Standard serve)	Medium ($10K-$100K)	High ($100K-$1M+)
Real-World Examples	Netflix recommendations, fraud detection, spam filters.	Face recognition, self-driving cars, Siri/Alexa.	Original creative outputs (text, images, code, video).

Conclusion

To sum it up, anyone who is keen to learn more about artificial intelligence needs to know the differences between machine learning, deep learning, and generative AI.

Machine learning is the basis for this because it lets computers learn from data and make predictions. Deep learning takes this a step further by using neural networks to process complicated data patterns in a way that is similar to how humans understand things.

Generative AI goes a step further by making new things, which shows how creative AI can be. As these technologies get better, they open up a lot of new opportunities in many fields, such as improving customer service, making medical diagnoses more accurate, and making new content. To maximize AI's benefits in your life, stay current on new developments.

How to Build a Machine Learning System on Serverless Architecture

Kuriko — Tue, 26 Aug 2025 16:23:28 +0000

Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks.

But a model isn’t truly valuable until it’s in production, serving real users and solving real problems.

In this article, you’ll learn how to ship a production-ready ML application built on serverless architecture.

Prerequisites
What We’re Building
The System Architecture
- Core AWS Resources in the Architecture
The Deployment Workflow in Action
Building a Client Application (Optional)
- The React Application
Final Results
Conclusion

Prerequisites

This project requires some basic experience with:

Machine Learning / Deep Learning: The full lifecycle, including data handling, model training, tuning, and validation.
Coding: Proficiency in Python, with experience using major ML libraries such as PyTorch and Scikit-Learn.
Full-stack deployment: Experience deploying applications using RESTful APIs.

What We’re Building

AI Pricing for Retailers

This project aims to help a middle-sized retailer compete with large players like Amazon.

Smaller companies often can’t afford significant price discounts, so they can face challenges finding optimal price points as they expand their product lines.

Our goal is to leverage AI models to recommend the best price for a selected product to maximize sales for the retailer, and display it on a client-side user interface (UI):

You can explore the UI from here.

The Models

I’ll train and tune multiple models so that when the primary model fails, a backup model gets loaded to serve predictions.

Primary Model: Multi-layered feedforward network (on the PyTorch library)
Backup Models (Backups): LightGBM, SVR, and Elastic Net (on the Scikit-Learn library)

The backup models are prioritized based on learning capabilities.

Tuning and Training

The primary model was trained on a dataset of around 500,000 samples (source) and fine-tuned using Optuna's Bayesian Optimization, with grid search available for further refinement.

The backups are also trained on the same samples and tuned using the Scikit-Optimize framework.

The Prediction

All models serve predictions on logged quantity values.

Logarithmic transformations of the quantity data make the distribution denser, which helps models learn patterns more effectively. This is because logarithms reduce the impact of extreme values, or outliers, and can help normalize skewed data.

Performance Validation

We’ll evaluate model performance using different metrics for the transformed and original data, with a lower value always indicating better performance.

Logged values: Mean Squared Error (MSE)
Actual values: Root Mean Squared Log Error (RMSLE) and Mean Absolute Error (MAE)

The System Architecture

We’re going to build a complete ecosystem around an AWS Lambda function to create a scalable ML system:

Fig. The system architecture (Created by Kuriko IWAI)

AWS Lambda is a serverless production where a service provider can run the application without managing servers. Once they upload the code, AWS takes on the responsibility of managing the underlying infrastructure.

In the serverless production, the code is deployed as a stateless function that runs only when it’s triggered by an event like HTTP requests or scheduled tasks.

This event-driven nature makes serverless production extremely efficient in resource allocation because:

There’s no server management: The cloud provider takes care of operational tasks.
You have automatic scaling: Serverless applications automatically scale up or down based on demand.
You have pay-per-use billing: Charged for the exact amount of compute resources the application consumes.

Note that other cloud ecosystems like Google Cloud Platform (GCP) and Microsoft Azure offer comprehensive alternatives to AWS. Which one you choose depends on your budget, project type, and familiarity with each ecosystem.

Core AWS Resources in the Architecture

The system architecture focuses on the following points:

The application is fully containerized on Docker for universal accessibility.
The container image is stored in AWS Elastic Container Registry (ECR).
The API Gateway’s REST API endpoints trigger an event to invoke the Lambda function.
The Lambda function loads the container image from ECR and perform inference.
Trained models, processors, and input features are stored in AWS S3 buckets.
A Redis client serves cached analytical data and past predictions stored in the ElastiCache.

And to build the system, we’ll use the following AWS resources:

Lamda: Serves a function to perform inference.
API Gateway: Routes API calls to the Lambda function.
S3 Storage: Serves feature store and model store.
ElastiCache: Store cached predictions and analytical data.
ECR: Stores Docker container images to allow Lambda to pull the image.

Each resource requires configuration. I’ll explore those details in the next section.

The Deployment Workflow in Action

The deployment workflow involves the following steps:

Draft data preparation, model training, and serialization scripts
Configure designated feature store and model store in S3
Create a Flask application with API endpoints
Publish a Docker image to ECR
Create a Lambda function
Configure related AWS resources

We’ll now walk through each of these steps to help you fully understand the process.

For your reference, here is the repository structure:

.
.venv/                  [.gitignore]    # stores uv venv
│
└── data/               [.gitignore]
│     └──raw/                           # stores raw data
│     └──preprocessed/                  # stores processed data after imputation and engineering
│
└── models/             [.gitignore]    # stores serialized model after training and tuning
│     └──dfn/                           # deep feedforward network
│     └──gbm/                           # light gbm
│     └──en/                            # elastic net
│     └──production/                    # models to be stored in S3 for production use
|
└── notebooks/                          # stores experimentation notebooks
│
└── src/                                # core functions
│     └──_utils/                        # utility functions
│     └──data_handling/                 # functions to engineer features
│     └──model/                         # functions to train, tune, validate models
│     │     └── sklearn_model
│     │     └── torch_model
│     │     └── ...
│     └──main.py                        # main script to run the inference locally
│
└──app.py                               # Flask application (API endpoints)
└──pyproject.toml                       # project configuration
└──.env                [.gitignore]     # environment variables
└──uv.lock                              # dependency locking
└──Dockerfile                           # for Docker container image
└──.dockerignore
└──requirements.txt
└──.python-version                      # python version locking (3.12)

Step 1: Draft Python Scripts

The first step is to draft Python scripts for data preparation, model training and tuning.

We’ll run these scripts in a batch process because these are resource-intensive and stateful tasks that aren’t suitable for serverless functions optimized for short-lived, stateless, and event-driven tasks.

Serverless functions also can experience cold starts. With heavy tasks in the function, the API gateway would timeout before serving predictions.

src/main.py

import os
import torch
import warnings
import pickle
import joblib
import numpy as np
import lightgbm as lgb
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from skopt.space import Real, Integer, Categorical
from dotenv import load_dotenv

import src.data_handling as data_handling
import src.model.torch_model as t
import src.model.sklearn_model as sk


if __name__ == '__main__': 
    load_dotenv(override=True)
    os.makedirs(PRODUCTION_MODEL_FOLDER_PATH, exist_ok=True)

    # create train, validation, test datasets
    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = data_handling.main_script()

    # store the trained preprocessor in local storage
    joblib.dump(preprocessor, PREPROCESSOR_PATH)

    # model tuning and training
    best_dfn_full_trained, checkpoint = t.main_script(X_train, X_val, y_train, y_val)

    # serialize the trained model
    torch.save(checkpoint, DFN_FILE_PATH)

    # svr
    best_svr_trained, best_hparams_svr = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[1]
    )
    if best_svr_trained is not None:
        with open(SVR_FILE_PATH, 'wb') as f:
            pickle.dump({ 'best_model': best_svr_trained, 'best_hparams': best_hparams_svr }, f)

    # elastic net
    best_en_trained, best_hparams_en = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[0]
    )
    if best_en_trained is not None:
        with open(EN_FILE_PATH, 'wb') as f:
            pickle.dump({ 'best_model': best_en_trained, 'best_hparams': best_hparams_en }, f)

    # light gbm
    best_gbm_trained, best_hparams_gbm = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[2]
    )

    if best_gbm_trained is not None:
        with open(GBM_FILE_PATH, 'wb') as f:
            pickle.dump({'best_model': best_gbm_trained, 'best_hparams': best_hparams_gbm }, f)

Run the script to train and serialize the models using the uv package management:

$uv venv
$source .venv/bin/activate
$uv run src/main.py

The main.py script includes several key components.

Scripts for Data Handling

These scripts involve loading original data, structure missing values, and engineer features necessary for the future prediction.

src/data_handling/main.py

import os
import joblib
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import src.data_handling.scripts as scripts
from src._utils import main_logger


# load and save the original data frame in parquet
df = scripts.load_original_dataframe()
df.to_parquet(ORIGINAL_DF_PATH, index=False)

# imputation
df = scripts.structure_missing_values(df=df)

# feature engineering
df = scripts.handle_feature_engineering(df=df)

# save processed df in csv and parquet
scripts.save_df_to_csv(df=df)
df.to_parquet(PROCESSED_DF_PATH, index=False)


# for preprocessing, classify numerical and categorical columns
num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
if cat_cols:
    for col in cat_cols: df[col] = df[col].astype('string')

# creates training, validation, and test datasets (test dataset is for inference only)
y = df[target_col]
X = df.copy().drop(target_col, axis='columns')
test_size, random_state = 50000, 42
X_tv, X_test, y_tv, y_test = train_test_split(
    X, y, test_size=test_size, random_state=random_state
)
X_train, X_val, y_train, y_val = train_test_split(
    X_tv, y_tv, test_size=test_size, random_state=random_state
)

# transform the input datasets
X_train, X_val, X_test, preprocessor = scripts.transform_input(
    X_train, X_val, X_test, num_cols=num_cols, cat_cols=cat_cols
)

# retrain and serialize the preprocessor
if preprocessor is not None: preprocessor.fit(X)
joblib.dump(preprocessor, PREPROCESSOR_PATH)

Scripts for Model Training and Tuning (PyTorch Model)

The scripts involve initiating the model, searching optimal neural architecture and hyperparameters, and serializing the fully-trained model so that the system can load the trained model when performing inference.

Because the primary model is built on PyTorch and the backups use Scikit-Learn, we’re drafting the scripts separately.

1. PyTorch Models

The training script contains training the model with the validation over a subset of training data.

It contains the early stopping logic when the loss history is not improved for a given consecutive epochs (that is, 10 epochs).

src/model/torch_model/scripts/training.py

import torch
import torch.nn as nn
import optuna # type: ignore
from sklearn.model_selection import train_test_split

from src._utils import main_logger

# device
device_type = device_type if device_type else 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
device = torch.device(device_type)

# gradient scaler for stability (only applicable for cuba)
scaler = torch.GradScaler(device=device_type) if device_type == 'cuba' else None

# start training
best_val_loss = float('inf')
epochs_no_improve = 0
for epoch in range(num_epochs):
    model.train()
    for batch_X, batch_y in train_data_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()

        try:
            # pytorch's AMP system automatically handles the casting of tensors to Float16 or Float32
            with torch.autocast(device_type=device_type):
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)

                # break the training loop when models return nan or inf
                if torch.any(torch.isnan(outputs)) or torch.any(torch.isinf(outputs)):
                    main_logger.error(
                        'pytorch model returns nan or inf. break the training loop.'
                    )
                    break

            # create scaled gradients of losses
            if scaler is not None:
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)  # cliping grad
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                scaler.step(optimizer)  # unscales the gradients
                scaler.update()  # updates the scale

            else:
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # cliping grad
                optimizer.step()

        except:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()


    # run validation on a subset of the training dataset
    model.eval()
    val_loss = 0.0

    # switch the torch mode
    with torch.inference_mode():
        for batch_X_val, batch_y_val in val_data_loader:
            batch_X_val, batch_y_val = batch_X_val.to(device), batch_y_val.to(device)
            outputs_val = model(batch_X_val)
            val_loss += criterion(outputs_val, batch_y_val).item()

    val_loss /= len(val_data_loader)

    # check if early stop
    if val_loss < best_val_loss - min_delta:
        best_val_loss = val_loss
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1
        if epochs_no_improve >= patience:
            main_logger.info(f'early stopping at epoch {epoch + 1}')
            break

The tuning script uses the study component from the Optuna library to run the Bayesian Optimization.

The study component choose a neural architecture and hyperparameter set to test from the global search space.

Then, it builds, trains, and validates the model to find the optimal neural architecture that can minimize the loss (MSE, for instance).

src/model/torch_model/scripts/tuning.py

import itertools
import pandas as pd
import numpy as np
import optuna
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

from src.model.torch_model.scripts.pretrained_base import DFN
from src.model.torch_model.scripts.training import train_model
from src._utils import main_logger

# device
device_type = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
device = torch.device(device_type)

# loss function
criterion = nn.MSELoss()

# define objective function for optuna
def objective(trial):
    # model
    num_layers = trial.suggest_int('num_layers', 1, 20)
    batch_norm = trial.suggest_categorical('batch_norm', [True, False])
    dropout_rates = []
    hidden_units_per_layer = []
    for i in range(num_layers):
        dropout_rates.append(trial.suggest_float(f'dropout_rate_layer_{i}', 0.0, 0.6))
        hidden_units_per_layer.append(trial.suggest_int(f'n_units_layer_{i}', 8, 256)) # hidden units per layer

    model = DFN(
        input_dim=X_train.shape[1],
        num_layers=num_layers,
        dropout_rates=dropout_rates,
        batch_norm=batch_norm,
        hidden_units_per_layer=hidden_units_per_layer
    ).to(device)

    # optimizer
    learning_rate = trial.suggest_float('learning_rate', 1e-10, 1e-1, log=True)
    optimizer_name = trial.suggest_categorical('optimizer', ['adam', 'rmsprop', 'sgd', 'adamw', 'adamax', 'adadelta', 'radam'])
    optimizer = _handle_optimizer(optimizer_name=optimizer_name, model=model, lr=learning_rate)

    # data loaders
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128, 256])
    test_size = 10000 if len(X_train) > 15000 else int(len(X_train) * 0.2)
    X_train_search, X_val_search, y_train_search, y_val_search = train_test_split(X_train, y_train, test_size=test_size, random_state=42)
    train_data_loader = create_torch_data_loader(X=X_train_search, y=y_train_search, batch_size=batch_size)
    val_data_loader = create_torch_data_loader(X=X_val_search, y=y_val_search, batch_size=batch_size)

    # training
    num_epochs = 3000 # ensure enough epochs (early stopping would stop the loop when overfitting)
    _, best_val_loss = train_model(
        train_data_loader=train_data_loader,
        val_data_loader=val_data_loader,
        model=model,
        optimizer=optimizer,
        criterion = criterion,
        num_epochs=num_epochs,
        trial=trial,
    )
    return best_val_loss


# start to optimize hyperparameters and architecture
study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=50, timeout=600)

# best 
best_trial = study.best_trial
best_hparams = best_trial.params

# construct the model based on the tuning results
best_lr = best_hparams['learning_rate']
best_batch_size = best_hparams['batch_size']
input_dim = X_train.shape[1]
best_model = DFN(
    input_dim=input_dim,
    num_layers=best_hparams['num_layers'],
    hidden_units_per_layer=[v for k, v in best_hparams.items() if 'n_units_layer_' in k],
    batch_norm=best_hparams['batch_norm'],
    dropout_rates=[v for k, v in best_hparams.items() if 'dropout_rate_layer_' in k],
).to(device)

# construct an optimizer based on the tuning results
best_optimizer_name = best_hparams['optimizer']
best_optimizer = _handle_optimizer(
    optimizer_name=best_optimizer_name, model=best_model, lr=best_lr
)

# create torch data loaders
train_data_loader = create_torch_data_loader(
    X=X_train, y=y_train, batch_size=best_batch_size
)
val_data_loader = create_torch_data_loader(
    X=X_val, y=y_val, batch_size=best_batch_size
)

# retrain the best model with full training dataset applying the optimal batch size and optimizer
best_model, _ = train_model(
    train_data_loader=train_data_loader,
    val_data_loader=val_data_loader,
    model=best_model,
    optimizer=best_optimizer,
    criterion = criterion,
    num_epochs=1000
)

# create a checkpoint for serialization (reconstruct the model using the checkpoint)
checkpoint = {
    'state_dict': best_model.state_dict(),
    'hparams': best_hparams,
    'input_dim': X_train.shape[1],
    'optimizer': best_optimizer,
    'batch_size': best_batch_size
}

# serialize the model w/ checkpoint
torch.save(checkpoint, FILE_PATH)

2. Scikit-Learn Models (Backups)

For Scikit-Learn models, we’ll run k-fold cross validation during training to prevent overfitting.

K-fold cross-validation is a technique for evaluating a machine learning model's performance by training and testing it on different subsets of training data.

We define the run_kfold_validation function where the model is trained and validated using 5-fold cross-validation.

src/model/sklearn_model/scripts/tuning.py

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def run_kfold_validation(
        X_train,
        y_train,
        base_model,
        hparams: dict,
        n_splits: int = 5, # the number of folds 
        early_stopping_rounds: int = 10,
        max_iters: int = 200
    ) -> float:

    mses = 0.0

    # create k-fold component
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for fold, (train_index, val_index) in enumerate(kf.split(X_train)):
        # create a subset of training and validation datasets from the entire training data
        X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

        # reconstruct a model
        model = base_model(**hparams)

        # start the cross validation
        best_val_mse = float('inf')
        patience_counter = 0
        best_model_state = None
        best_iteration = 0

        for iteration in range(max_iters):
            # train on a subset of the training data
            try:
                model.train_one_step(X_train_fold, y_train_fold, iteration)
            except:
                model.fit(X_train_fold, y_train_fold)

            # make a prediction on validation data 
            y_pred_val_kf = model.predict(X_val_fold)

            # compute validation loss (MSE)
            current_val_mse = mean_squared_error(y_val_fold, y_pred_val_kf)

            # check if epochs should be stopped (early stopping)
           if current_val_mse < best_val_mse:
                best_val_mse = current_val_mse
                patience_counter = 0
                best_model_state = model.get_params()
                best_iteration = iteration
           else:
                patience_counter += 1

           # execute early stopping when patience_counter exceeds early_stopping_rounds
           if patience_counter >= early_stopping_rounds:
                main_logger.info(f"Fold {fold}: Early stopping triggered at iteration {iteration} (best at {best_iteration}). Best MSE: {best_val_mse:.4f}")
                break


        # after training epochs, reconstruct the best performing model 
        if best_model_state: model.set_params(**best_model_state)

        # make prediction
        y_pred_val_kf = model.predict(X_val_fold)

        # add MSEs
        mses += mean_squared_error(y_pred_val_kf, y_val_fold)

    # compute the final loss (avarage of MSEs across folds)
    ave_mse = mses / n_splits
    return ave_mse

Then, for the tuning script, we use the gp_minimize function from the Scikit-Optimize library.

The gp_minimize function is used to tune hyperparameters with Bayesian optimization.

This function intelligently searches the best hyperparameter set that can minimize the model's error, which is calculated using the run_kfold_validation function defined earlier.

The best-performing hyperparameters are then used to reconstruct and train the final model.

src/model/sklearn_model/scripts/tuning.py

from functools import partial
from skopt import gp_minimize


# define the objective function for Bayesian Optimization using Scikit-Optimize
def objective(params, X_train, y_train, base_model, hparam_names):
    hparams = {item: params[i] for i, item in enumerate(hparam_names)}
    ave_mse = run_kfold_validation(X_train=X_train, y_train=y_train, base_model=base_model, hparams=hparams)
    return ave_mse

# create the search space
hparam_names = [s.name for s in space]
objective_partial = partial(objective, X_train=X_train, y_train=y_train, base_model=base_model, hparam_names=hparam_names)

# search the optimal hyperparameters
results = gp_minimize(
    func=objective_partial,
    dimensions=space,
    n_calls=n_calls,
    random_state=42,
    verbose=False,
    n_initial_points=10,
)
# results
best_hparams = dict(zip(hparam_names, results.x))
best_mse = results.fun

# reconstruct the model with the best hyperparameters
best_model = base_model(**best_hparams)

# retrain the model with full training dataset
best_model.fit(X_train, y_train)

Step 2: Configure Feature/Model Stores in S3

The trained models and processed data are stored in the S3 bucket as a Parquet file.

We’ll draft the s3_upload function where the Boto3 client, a low-level interface to an AWS service, initiates the connection to S3:

import os
import boto3
from dotenv import load_dotenv

from src._utils import main_logger

def s3_upload(file_path: str):
    # initiate the boto3 client
    load_dotenv(override=True)
    S3_BUCKET_NAME = os.environ.get('S3_BUCKET_NAME') # the bucket created in s3
    s3_client = boto3.client('s3', region_name=os.environ.get('AWS_REGION_NAME')) # your default region

    if s3_client:
        # create s3 key and upload the file to the bucket
        s3_key = file_path if file_path[0] != '/' else file_path[1:]
        s3_client.upload_file(file_path, S3_BUCKET_NAME, s3_key)
        main_logger.info(f"file uploaded to s3://{S3_BUCKET_NAME}/{s3_key}")
    else:
        main_logger.error('failed to create an S3 client.')

Model Store

Trained PyTorch models are serialized (converted) into .pth files.

Then, these files are uploaded to the S3 bucket, enabling the system to load the trained model when it performs inference in production.

import torch

from src._utils import s3_upload

# model serialization, store in local
torch.save(trained_model.state_dict(), MODEL_FILE_PATH)

# upload to s3 model store
s3_upload(file_path=MODEL_FILE_PATH)

Feature Store

The processed data is converted into a CSV and Parquet file format.

Then, the Parquet files are uploaded to the S3 bucket, enabling the system to load the lightweight data when it creates prediction data to perform inference in production.

from src._utils import s3_upload

# store csv and parquet files in local
df.to_csv(file_path, index=False)
df.to_parquet(DATA_FILE_PATH, index=False)

# store in s3 feature store
s3_upload(file_path=DATA_FILE_PATH)

# trained preprocessor is also stored to transform the prediction data
s3_upload(file_path=PROCESSOR_PATH)

Step 3: Create a Flask Application with API Endpoints

Next, we’ll create a Flask application with API endpoints.

Flask needs to configure Python scripts in the app.py file located at the root of the project repository.

As showed in the code snippets, the app.py file needs to contain the components in order of:

AWS Boto3 client setup,
Flask app configuration and API endpoint setup,
Loading the trained preprocessor, processed input data X_test, and trained models,
Invoke the Lambda function via API Gateway, and
The local test section.

Note that X_test should never be used during model training to avoid data leakage.

app.py

from flask import Flask
from flask_cors import cross_origin
from waitress import serve
from dotenv import load_dotenv

from src._utils import main_logger

# global variables (will be loaded from the S3 buckets)
_redis_client = None
X_test = None
preprocessor = None
model = None
backup_model = None

# load env if local else skip (lambda refers to env in production)
AWS_LAMBDA_RUNTIME_API = os.environ.get('AWS_LAMBDA_RUNTIME_API', None)
if AWS_LAMBDA_RUNTIME_API is None: load_dotenv(override=True)


#### <---- 1. AWS BOTO3 CLIENT ---->
# boto3 client 
S3_BUCKET_NAME = os.environ.get('S3_BUCKET_NAME', 'ml-sales-pred')
s3_client = boto3.client('s3', region_name=os.environ.get('AWS_REGION_NAME', 'us-east-1'))
try:
    # test connection to boto3 client
    sts_client = boto3.client('sts')
    identity = sts_client.get_caller_identity()
    main_logger.info(f"Lambda is using role: {identity['Arn']}")
except Exception as e:
    main_logger.error(f"Lambda credentials/permissions error: {e}")

#### <---- 2. FLASK CONFIGURATION & API ENDPOINTS ---->
# configure the flask app
app = Flask(__name__)
app.config['CORS_HEADERS'] = 'Content-Type'

# add a simple API endpoint to serve the prediction by price point to test
@app.route('/v1/predict-price/', methods=['GET', 'OPTIONS'])
@cross_origin(origins=origins, methods=['GET', 'OPTIONS'], supports_credentials=True)
def predict_price(stockcode):
    df_stockcode = None

    # fetch request params
    data = request.args.to_dict()

    try:
        # fetch cache
        if _redis_client is not None:
            # returns cached prediction results if any without performing inference
            cached_prediction_result = _redis_client.get(cache_key_prediction_result_by_stockcode)
            if cached_prediction_result: 
                return jsonify(json.loads(json.dumps(cached_prediction_result)))

            # historical data of the selected product
            cached_df_stockcode = _redis_client.get(cache_key_df_stockcode)
            if cached_df_stockcode: df_stockcode = json.loads(json.dumps(cached_df_stockcode))


        # define the price range to make predictions. can be a request param, or historical min/max prices
        min_price = float(data.get('unitprice_min', df_stockcode['unitprice_min'][0]))
        max_price = float(data.get('unitprice_max', df_stockcode['unitprice_max'][0]))

        # create bins in the price range. when the number of the bins increase, the prediction becomes more smooth, but requires more computational cost
        NUM_PRICE_BINS = int(data.get('num_price_bins', 100))
        price_range = np.linspace(min_price, max_price, NUM_PRICE_BINS)

        # create a prediction dataset by merging X_test (dataset never used in model training) and df_stockcode
        price_range_df = pd.DataFrame({ 'unitprice': price_range })
        test_sample = X_test.sample(n=1000, random_state=42)
        test_sample_merged = test_sample.merge(price_range_df, how='cross') if X_test is not None else price_range_df
        test_sample_merged.drop('unitprice_x', axis=1, inplace=True)
        test_sample_merged.rename(columns={'unitprice_y': 'unitprice'}, inplace=True)

        # preprocess the dataset
        X = preprocessor.transform(test_sample_merged) if preprocessor else test_sample_merged

        # perform inference
        y_pred_actual = None
        epsilon = 0
        # try using the primary model
        if model:
            input_tensor = torch.tensor(X, dtype=torch.float32)
            model.eval()
            with torch.inference_mode():
                y_pred = model(input_tensor)
                y_pred = y_pred.cpu().numpy().flatten()
                y_pred_actual = np.exp(y_pred + epsilon)

        # if not, use backups
        elif backup_model:
            y_pred = backup_model.predict(X)
            y_pred_actual = np.exp(y_pred + epsilon)


        # finalize the outcome for client app
        df_ = test_sample_merged.copy()
        df_['quantity'] = np.floor(y_pred_actual) # quantity must be an integer
        df_['sales'] = df_['quantity'] * df_['unitprice'] # compute sales
        df_ = df_.sort_values(by='unitprice')

        # aggregate the results by the unitprice in the price range
        df_results = df_.groupby('unitprice').agg(
            quantity=('quantity', 'median'),
            quantity_min=('quantity', 'min'),
            quantity_max=('quantity', 'max'),
            sales=('sales', 'median'),
        ).reset_index()

        # find the optimal price point
        optimal_row = df_results.loc[df_results['sales'].idxmax()]
        optimal_price = optimal_row['unitprice']
        optimal_quantity = optimal_row['quantity']
        best_sales = optimal_row['sales']

        all_outputs = []
        for _, row in df_results.iterrows():
            current_output = {
                "stockcode": stockcode,
                "unit_price": float(row['unitprice']),
                'quantity': int(row['quantity']),
                'quantity_min': int(row['quantity_min']),
                'quantity_max': int(row['quantity_max']),
                "predicted_sales": float(row['sales']),
            }
            all_outputs.append(current_output)

        # store the prediction results in cache
        if all_outputs and _redis_client is not None:
             serialized_data = json.dumps(all_outputs)
            _redis_client.set(
                cache_key_prediction_result_by_stockcode, 
                serialized_data,
                ex=3600     # expire in an hour
            )

        # return a list of all outputs
        return jsonify(all_outputs)

    except Exception as e: return jsonify([])


# request header management (for the process from API gateway to the Lambda)
@app.after_request
def add_header(response):
    response.headers['Cache-Control'] = 'public, max-age=0'
    response.headers['Access-Control-Allow-Origin'] = CLIENT_A
    response.headers['Access-Control-Allow-Headers'] = 'Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token,Origin'
    response.headers['Access-Control-Allow-Methods'] = 'GET, POST, OPTIONSS'
    response.headers['Access-Control-Allow-Credentials'] = 'true'
    return response

#### <---- 3. LOADING PROCESSOR, DATASET, AND MODELS ---->
load_processor()
load_x_test()
load_model()

#### <---- 4. INVOKE LAMBDA ---->
def handler(event, context):
    logger.info("lambda handler invoked.")
    try:
        # connecting the redis client after the lambda is invoked
        get_redis_client()
    except Exception as e:
        logger.critical(f"failed to establish initial Redis connection in handler: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps({'error': 'Failed to initialize Redis client. Check environment variables and network config.'})
        }

    # use the awsgi package to convert JSON to WSGI
    return awsgi.response(app, event, context)


#### <---- 5. FOR LOCAL TEST ---->
# serve the application locally on WSGI server, waitress
# lambda will ignore this section.
if __name__ == '__main__':   
    if os.getenv('ENV') == 'local':
        main_logger.info("...start the operation (local)...")
        serve(app, host='0.0.0.0', port=5002)
    else:
        app.run(host='0.0.0.0', port=8080)

I’ll test the endpoint locally using the uv package manager:

$uv run app.py --cache-clear

$curl http://localhost:5002/v1/predict-price/{STOCKCODE}

The system provided a list of sales predictions for each price point:

Fig. Screenshot of the Flask app local response

Key Points on Flask App Configuration

There are various points you should take into consideration when configuring a Flask application with Lambda. Let’s go over them now:

1. A Few API Endpoints Per Container

Adding many API endpoints to a single serverless instance can lead to monolithic function concern where issues in one endpoint impact others.

In this project, we’ll focus on a single endpoint per container – and if needed, we can add separate Lambda functions to the system.

2. Understanding the `handler` Function and the role of AWSGI

The handler function is invoked every time the Lambda function receives a client request from the API Gateway.

The function takes the event argument that includes the request details in a JSON dictionary and passes it to the Flask application.

AWSGI acts as an adapter, translating a Lambda event in JSON format into a WSGI request that a Flask application can understand, and converts the application’s response back into a JSON format that Lambda and API Gateway can process.

3. Using Cache Storage

The get_redis_client function is called once the handler function is called by the API Gateway. This allows the Flask application to store or fetch a cache from the Redis client:

import redis
import redis.cluster
from redis.cluster import ClusterNode

_redis_client = None

def get_redis_client():
    global _redis_client
    if _redis_client is None:
        REDIS_HOST = os.environ.get("REDIS_HOST")
        REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
        REDIS_TLS = os.environ.get("REDIS_TLS", "true").lower() == "true"
        try:
            startup_nodes = [ClusterNode(host=REDIS_HOST, port=REDIS_PORT)]
            _redis_client = redis.cluster.RedisCluster(
                startup_nodes=startup_nodes,
                decode_responses=True,
                skip_full_coverage_check=True,
                ssl=REDIS_TLS,                  # elasticache has encryption in transit: enabled -> must be true
                ssl_cert_reqs=None,
                socket_connect_timeout=5,
                socket_timeout=5,
                health_check_interval=30,
                retry_on_timeout=True,
                retry_on_error=[
                    redis.exceptions.ConnectionError,
                    redis.exceptions.TimeoutError
                ],
                max_connections=10,            # limit connections for Lambda
                max_connections_per_node=2     # limit per node
            )
            _redis_client.ping()
            main_logger.info("successfully connected to ElastiCache Redis Cluster (Configuration Endpoint)")
        except Exception as e:
            main_logger.error(f"an unexpected error occurred during Redis Cluster connection: {e}", exc_info=True)
            _redis_client = None
    return _redis_client

4. Handling Heavy Tasks Outside of the `handler` Function

Serverless functions can experience a cold start duration.

While a Lambda function can run for up to 15 minutes, its associated API Gateway has a timeout of 29 seconds (29,000 ms) for a RESTful API.

So, any heavy tasks like loading preprocessors, input data, or models should be performed once outside of the handler function, ensuring they are ready before the API endpoint is called.

Here are the loading functions called in app.py.

app.py

import joblib

from src._utils import s3_load, s3_load_to_temp_file

preprocessor = None
X_test = None
model = None
backup_model = None


# load processor
def load_preprocessor():
    global preprocessor
    preprocessor_tempfile_path = s3_load_to_temp_file(PREPROCESSOR_PATH)
    preprocessor = joblib.load(preprocessor_tempfile_path)
    os.remove(preprocessor_tempfile_path)


# load input data
def load_x_test():
    global X_test
    x_test_io = s3_load(file_path=X_TEST_PATH)
    X_test = pd.read_parquet(x_test_io)


# load model
def load_model():
    global model, backup_model
    # try loading & reconstructing the primary model
    try:
        # first load io file from the s3 bucket
        model_data_bytes_io_ = s3_load(file_path=DFN_FILE_PATH)
        # convert to checkpoint dictionary (containing hyperparameter set)
        checkpoint_ = torch.load(
            model_data_bytes_io_, 
            weights_only=False, 
            map_location=device
        )
        # reconstruct the model
        model = t.scripts.load_model(checkpoint=checkpoint_, file_path=DFN_FILE_PATH)
        # set the model evaluation mode
        model.eval()

    # else, backup model
     except:
        load_artifacts_backup_model()

Step 4: Publish a Docker Image to ECR

After configuring the Flask application, we’ll containerize the entire application on Docker.

Containerization makes a package of the application, including models, its dependencies, and configuration in machine learning context, as a container.

Docker creates a container image based on the instructions defined in a Dockerfile, and the Docker engine uses the image to run the isolated container.

In this project, we’ll upload the Docker container image to ECR, so the Lambda function can access it in production.

After this, we’ll define the .dockerignore file to optimize the container image:

.dockerignore

# any irrelevant data
__pycache__/
.ruff_cache/
.DS_Store/
.venv/
dist/
.vscode
*.psd
*.pdf
[a-f]*.log
tmp/
awscli-bundle/

# add any experimental models, unnecessary data
dfn_bayesian/
dfn_grid/
data/
notebooks/

Dockerfile

# serve from aws ecr 
FROM public.ecr.aws/lambda/python:3.12

# define a working directory in the container
WORKDIR /app

# copy the entire repository (except .dockerignore) into the container at /app
COPY . /app/

# install dependencies defined in the requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# define commands
ENTRYPOINT [ "python" ]
CMD [ "-m", "awslambdaric", "app.handler" ]

Test in Local

Next, we’ll test the Docker image by building the container named my-app locally:

$docker build -t my-app -f Dockerfile .

Then, we’ll run the container with the waitress server in local:

$docker run -p 5002:5002 -e ENV=local my-app app.py

The -e ENV=local flag sets the environment variable inside the container, which will trigger the waitress.serve() call in the app.py.

In the terminal, you’ll find a message saying the following:

You can also call the endpoint created to see the results returned:

$uv run app.py --cache-clear

$curl http://localhost:5002/v1/predict-price/{STOCKCODE}

Publish the Docker Image to ECR

To publish the Docker image, we first need to configure the default AWS credentials and region:

From the AWS account console, issue an access token and check the default region.
Store them in the ~/aws/credentials and ~/aws/config files:

~/aws/credentials

[default] 
aws_secret_access_key=
aws_access_key_id=

~/aws/config

[default]
region=

After the configuration, we’ll publish the Docker image to ECR.

# authenticate the docker client to ECR
$aws ecr get-login-password --region  | docker login --username AWS --password-stdin .dkr.ecr..amazonaws.com

# create repository
$aws ecr create-repository --repository-name  --region 

# tag the docker image
$docker tag :  .dkr.ecr..amazonaws.com/:

# push
$docker push .dkr.ecr..amazonaws.com/:

Here’s what’s going on:

: Your default AWS region (for example, us-east-1 ).
: 12-digit AWS account ID.
: Your desired repository name.
: Your desired tag name (for example, v1.0).

Now, the Docker image is stored in ECR with the tag:

Fig. Screenshot of the AWS ECR console

Step 5: Create a Lambda Function

Next, we’ll create a Lambda function.

From the Lambda console, choose:

The Container Image option,
The container image URL from the pull down list,
A function name of our choice, and
An architecture type (arm64 is recommended for a better price-performance).

Fig. Screenshot of AWS Lambda function configuration

The Lambda function my-app was successfully launched.

Connect the Lambda function to API Gateway

Next, we’ll add API gateway as an event trigger to the Lambda function.

First, visit the API Gateway console and create REST API methods using the ARN of the Lambda function (press enter or click to view image in full size):

Fig. Screenshot of the AWS API Gateway configuration

Then, add resources to the created API gateway to create an endpoint:
API Gateway > APIs > Resources > Create Resource

Align the resource endpoint with the API endpoint defined in the app.py.
Configure CORS (for example, accept specific origins).
Deploy the resource to the stage.

Going back to the Lambda console, you’ll find the API Gateway is connected as an event trigger:
Lambda > Function > my-app (your function name)

Fig. Screenshot of the AWS Lambda dashboard

Step 6: Configure AWS Resources

Lastly, we’ll configure the related AWS resources to make the system work in production.

This process involves the following steps:

1. The IAM Role: Controls Who to Access Resources

AWS requires IAM roles to grant temporary, secure permissions to users, mitigating security risks related to long-term credentials like passwords.

The IAM role leverages policies to grant accesses to the selected service. Policies can be issued by AWS or customized by the user by defining the inline policy.

It is important to avoid overly permissive access rights for the IAM role.

In the Lambda function console, check the execution role:
Lambda > Function > > Permission > The execution role.
Set up the following policies to allow the Lambda’s IAM role to handle necessary operations:
- Lambda AWSLambdaExecute: Allows executing the function.
- EC2 Inline policy: Allows controlling the security group and the VPC of the Lambda function.
- ECR AmazonElasticContainerRegistryPublicFullAccess + Inline policy: Allows storing and pulling the Docker image.
- ElastiCache AmazonElastiCacheFullAccess + Inline policy: Allows storing and pulling caches.
- S3: AmazonS3ReadOnlyAccess + Inline policy: Allows reading and storing contents.

Now, the IAM role can access these resources and perfo the allowed actions.

2. The Security Group: Controls Network Traffic

A security group is a virtual firewall that controls inbound and outbound network traffic for AWS resources.

It uses stateful (allowing return traffic automatically) “allow-only” rules based on protocol, port, and IP address, where it denies all traffic by default.

Create a new security group for the Lambda function:
EC2 > Security Groups >

Now, we’ll want to setup inbound / outbound traffic rules.

The inbound rules:

S3 → Lambda:Type*: HTTPS /* Protocol*: TCP /* Port range*: 443 / Source: Custom**
ElastiCache → Lambda:Type*: Custom TCP /* Port range*: 6379 / Source: Custom**

*Choose the created security group for the Lambda function as a custom source.

The outbound rules:

Lambda → Internet: Type*: HTTPS /* Protocol*: TCP /* Port range*: 443 /* Destination*: 0.0.0.0/0*
ElastiCache → Internet: Type*: All Traffic /* Destination*: 0.0.0.0/0*

3. The Virtual Private Cloud (VPC)

A Virtual Private Cloud (VPC) provides a logically isolated private network for the AWS resources, acting as our own private data center within AWS.

AWS can create a Hyperplane ENI (Elastic Network Interface) for the Lambda function and its connected resources in the subnets of the VPC.

Though it’s optional, we’ll use the VPC to connect the Lambda function to the S3 storage and ElastiCache.

This process involves:

Creating a VPC endpoint from the VPC console:VPC > Create VPC.
Creating an STS (Security Token Service) endpoint:
VPC > PrivateLink and Lattice > Endpoints > Create Endpoint >
- Type*: AWS Service*
- Service name*: com.amazonaws..sts*
- Type*: Interface*
- VPC: Select the VPC created earlier.
- Subnets*: Select all subnets.*
- Security groups*: Select the security group of the Lambda function.*
- Policy*: Full access*
- Enable DNS names

The VPC must have a dedicated endpoint for STS to receive temporary credentials from STS.

Create an S3 endpoint in the VPC:
VPC > PrivateLink and Lattice > Endpoints > Create Endpoint >
- Type*: AWS Service*
- Service name*: com.amazonaws..s3*
- Type*: Gateway*
- VPC: Select the VPC created earlier.
- Subnets*: Select all subnets.*
- Security groups*: Select the security group of the Lambda function.*
- Policy*: Full access*

Lastly, check the security group of the Lambda function and ensure that its VPC ID directs to the VPC created: EC2 > Security Group > > VPC ID.

That’s all for the deployment flow.

We can now test the API endpoint in production. Copy the Invoke URL of the deployed API endpoint: API Gateway > APIs > Stages > Invoke URL. Then call the API endpoint and check if it responds predictions:

$curl -H 'Authorization: Bearer YOUR_API_TOKEN' -H 'Accept: application/json' \
     '/'

For logging and debugging, we’ll use the LiveTail of CloudWatch: CloudWatch > LiveTail.

Building a Client Application (Optional)

For full-stack deployment, we’ll build a simple React application to display the prediction using the recharts library for visualization.

Other options for quick frontend deployment include Streamlit or Gradio.

The React Application

The React application creates a web page that fetches and visualizes sales predictions from an external API, recommending an optimal price point.

The app uses useState to manage its data and state, including the selected product, the list of sales predictions, and the loading/error status.

When the user initiates a request, a useEffect hook triggers a fetch request to a Flask backend. It handles the API response as a data stream, processing it line by line to progressively update the predictions.

The AreaChart from the recharts library then visualizes this data. The X-axis represents the price and the Y-axis represents the sales. The chart updates in real-time as the data streams in. Finally, the app displays the optimal price once all the predictions are received.

App.js: (in a separate React app)

import { useState, useEffect } from "react"
import { AreaChart, Area, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer, ReferenceLine } from 'recharts'


function App() {
  // state
  const [predictions, setPredictions] = useState([])
  const [start, setStart] = useState(false)
  const [isLoading, setIsLoading] = useState(false)

  // product data
  let selectedStockcode = '85123A'
  let selectedProduct = productOptions.filter(item => item.id === selectedStockcode)[0]

  // api endpoint
  const flaskBackendUrl = "YOUR FLASK BACKEND URL"

  // create chart data to display
  const chartDataSales = predictions && predictions.length > 0
    ? predictions
      .map(item => ({
        price: item.unit_price,
        sales: item.predicted_sales,
        volume: item.unit_price !== 0 ? item.predicted_sales / item.unit_price : 0
      }))
      .sort((a, b) => a.price - b.price)
    : [...selectedProduct['histPrices']]

  // optimal price to display
  const optimalPrice = predictions.length > 0
    ? predictions.sort((a, b) => b.predicted_sales - a.predicted_sales)[0]['unit_price']
    : 0

  // fetch prediction results
  useEffect(() => {
    const handlePrediction = async () => {
      setIsLoading(true)
      setPredictions([])
      const errorPrices = selectedProduct['errorPrices']

      await fetch(flaskBackendUrl)
        .then(res => {
          if (res.status !== 200) { setPredictions(errorPrices); setIsLoading(false); setStart(false) }
          else return Promise.resolve(res.clone().json())
        })
        .then(res => {
          if (res && res.length > 0) setPredictions(res)
          else setPredictions(errorPrices)
          setIsLoading(false); setStart(false)
        })
        .catch(err => { setPredictions(errorPrices); setIsLoading(false); setStart(false) })
        .finally(setStart(false))
    }

    if (start) handlePrediction()
    if (predictions && predictions.length > 0) setStart(false)
  }, [flaskBackendUrl, start])


  // render
  if (isLoading) return <Loading />
  return (
    <div>
      <ResponsiveContainer width="100%" height="100%">
        <AreaChart
          key={chartDataSales.length}
          data={chartDataSales.sort(data => data.unit_price)}
          margin={{ top: 10, right: 30, left: 0, bottom: 0 }}
        >
          <CartesianGrid strokeDasharray="3 3" strokeOpacity={0.6} />

          <XAxis
            dataKey="price"
            label={{ value: "Unit Price ($)", position: "insideBottom", offset: 0, fontSize: 12, marginTop: 10 }}
            tickFormatter={(tick) => `$${parseFloat(tick).toFixed(2)}`}
            tick={{ fontSize: 12 }}
            padding={{ left: 20, right: 20 }}
          />

          <YAxis
            label={{ value: "Predicted Sales ($)", angle: -90, position: "insideLeft", fontSize: 12 }}
            tick={{ fontSize: 12 }}
            tickFormatter={(tick) => `$${tick.toLocaleString()}`}
          />

          {/* tooltips with the prediction result data */}
          <Tooltip
            contentStyle={{
              borderRadius: '8px',
              padding: '10px',
              boxShadow: '0px 0px 15px rgba(0,0,0,0.5)'
            }}
            formatter={(value, name) => {
              if (name === 'sales') {
                return [`$${value.toFixed(4)}`, 'Predicted Sales']
              }
              if (name === 'volume') {
                return [`${value.toFixed(0)}`, 'Volume']
              }
              return value
            }}
            labelFormatter={(label) => `Price: $${label.toFixed(2)}`}
          />

          {/* chart area = sales */}
          <Area
            type="monotone"
            dataKey="sales"
            fillOpacity={1}
            fill="url(#colorSales)"
          />

          {/* vertical line for the optimal price */}
          {optimalPrice &&
            <ReferenceLine
              x={optimalPrice}
              strokeDasharray="4 4"
              ifOverflow="visible"
              label={{
                value: `Optimal Price: $${optimalPrice !== null && optimalPrice > 0 ? Math.ceil(optimalPrice * 10000) / 10000 : ''}`,
                position: "right",
                fontSize: 12,
                offset: 10
              }}
            />
          }
        AreaChart>
      ResponsiveContainer>

      {optimalPrice && <p>Optimal Price: $ {Math.ceil(optimalPrice * 10000) / 10000}p>}

    div>
  )
}

export default App

Final Results

Now, the application is ready to serve.

You can explore the UI from here.

All code (backend) is available in my Github Repo.

Conclusion

Building a machine learning system requires thoughtful project scoping and architecture design.

In this article, we built a dynamic pricing system as a simple single interface on containerized serverless architecture.

Moving forward, we’d need to consider potential drawbacks of this minimal architecture:

Increase in cold start duration: The WSGI adapter awsgi layer adds a small overhead. Loading a larger container image takes longer time.
Monolithic function: Adding endpoints to the Lambda function can lead to a monolithic function where an issue in one endpoint impacts others.
Less granular observability: AWS CloudWatch cannot provide individual invocation/error metrics per API endpoint without custom instrumentation.

To scale the application effectively, extracting functionalities into a new microservice can be a good strategy to the next step.

I’m Kuriko IWAI, and you can find more of my work and learn more about me here:

Portfolio / LinkedIn / Github

All images, unless otherwise noted, are by the author. This application utilizes synthetic dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This information about AWS is current as of August 2025 and is subject to change.

Deep Reinforcement Learning in Natural Language Understanding

Oyedele Tioluwani — Fri, 15 Aug 2025 15:00:27 +0000

Language is messy, subtle, and full of meaning that shifts with context. Teaching machines to truly understand it is one of the hardest problems in artificial intelligence.

That challenge is what natural language understanding (NLU) sets out to solve. From voice assistants that follow instructions to support systems that interpret user intent, NLU sits at the core of many real-world AI applications.

Most systems today are trained using labeled data and supervised techniques. But there's growing interest in something more adaptive: deep reinforcement learning (DRL). Instead of learning from fixed examples, DRL allows a model to improve through trial, error, and feedback, much like a person learning through experience.

This article looks at where DRL fits into the modern NLU landscape. We'll explore how it's being used to fine-tune responses, guide conversation flow, and align models with human values.

What we’ll cover:

Overview of Deep Reinforcement Learning
What is Natural Language Understanding (NLU)?
Challenges in NLU and How to Address Them
Where DRL Adds Value in NLU
Modern Architectures in NLU from BERT to Claude
The Niche Role of DRL in Modern NLU
Reinforcement Learning from Human Feedback (RLHF)
Ecosystem and Tools for DRL in NLP
Hands-On Demo: Simulating DRL Feedback in NLU
Case Studies of DRL in NLU
Wrapping Up

Overview of Deep Reinforcement Learning

Reinforcement learning is a subfield of machine learning. It’s inspired by behavioral psychology, in which agents learn to maximize cumulative rewards by performing behaviors in a given environment.

Traditionally, reinforcement learning techniques have been used to solve simple problems with discrete state and action spaces. But the development of deep learning has opened the door to applying these techniques to more complicated, high-dimensional environments, like computer vision, natural language processing (NLP), and robotics.

DRL uses deep neural networks to approximate complex functions that translate observations into actions, allowing agents to learn from raw sensory data. Deep neural networks, which represent knowledge in numerous layers of abstraction, may catch detailed patterns and relationships in data, allowing for more effective decision-making.

Imagine you’re playing a video game where you’re controlling a character, and your goal is to get the highest score possible. Now, when you first start playing, you might not know the best way to play, right? You might try different things like jumping, running, or shooting, and you see what works and what doesn’t.

We can think of DRL as a technique that enables computers or robots to learn how to play video games as time goes on. DRL involves a computer learning from its environment, learning from its experiences and mistakes. The computer, like the player, tries different actions and receives feedback based on its performance. If it performs well, it gets rewards, while if it fails, it gets a penalty.

The computer’s job is to figure out the best possible actions to take in different situations to maximize rewards. Instead of learning from trial and error, DRL uses deep neural networks, which are like super-smart brains that can understand vast amounts of data and patterns. These neural networks help the computer make better decisions in the future, and over time, it can become even better at playing the game – sometimes even better than humans.

Image Source

What is Natural Language Understanding (NLU)?

NLU is a subfield of artificial intelligence (AI), and its aim is to help computers understand, interpret, and respond to human language in meaningful ways. It involves creating algorithms and models that can process and analyze text to extract meaningful information, determine the intent behind it, and provide appropriate replies.

NLU is a basic part of many AI applications, such as chatbots, virtual assistants, and personalized recommendation systems, which require the ability to interpret and respond to human language.

Its key components include:

Text processing: NLU systems must be able to process and interpret text, which includes tokenization (cutting it down into words or phrases), part-of-speech tagging, and named entity recognition.
Sentiment analysis: Identifying the sentiment communicated in a piece of text (positive, negative, or neutral) is a common task in NLU.
Intent recognition: Identifying the goal or objective of a user’s input, such as buying a flight or requesting weather forecasts.
Language generation: (technically part of Natural Language Generation, or NLG): While NLU focuses on understanding text, NLG is about producing coherent, contextually appropriate text. Many AI systems combine both, first interpreting the input through NLU, then generating an appropriate response using NLG.
Entity extraction: Identifying and categorizing essential details in the text, such as dates, locations, and people.

Challenges in NLU and How to Address Them

NLU aims to help machines interpret, understand, and respond to human language in ways that make sense. While it has made great progress, there are still challenges that limit how well it works in practice.

Below are some of these challenges and how Deep Reinforcement Learning (DRL) can play a supportive role. DRL is not a replacement for large-scale pretraining or instruction tuning, but it can complement them by helping models adapt through interaction and feedback.

Ambiguity

Naturally, words can have more than one meaning, and a single sentence or phrase might be understood in different ways. This makes it hard for NLU systems to always pinpoint what the speaker or writer intends.

DRL can help reduce ambiguity by allowing models to learn from feedback. If a certain interpretation gets positive results, the model can prioritize it. If not, it can try a different approach. While this does not remove ambiguity entirely, it can improve a model’s ability to make better choices over time, especially when combined with a strong pretrained foundation.

Contextual understanding

Understanding language often depends on context such as cultural references, sarcasm, or the tone behind certain words. These are straightforward for people but challenging for machines to recognize.

By learning from interaction signals such as whether a user is satisfied with a response, DRL can help a model adapt to context more effectively. However, the core ability to understand context still comes from large-scale pretraining. DRL mainly fine-tunes and adjusts this behavior during use.

Language variation

Human language comes in many forms including different dialects, slang, colloquialisms, and regional expressions. This variety can challenge NLU systems that have not seen enough examples of these patterns during training.

With DRL, models can adapt to new language styles when exposed to them repeatedly in real-world use. This makes them more flexible and responsive, although their base understanding still relies on the diversity of the data used during pretraining.

Scalability

As text data continues to grow, NLU systems must be able to process large volumes quickly and efficiently, especially in real-time applications such as chatbots and virtual assistants.

DRL can contribute by helping models optimize certain processing steps through trial and feedback. While it will not replace architectural or infrastructure improvements, it can help fine-tune performance for specific high-traffic tasks.

Computational complexity

Training advanced NLU models is resource-intensive, which can be a challenge for mobile devices, edge computing, or other resource-limited environments.

DRL can make the learning process more efficient by reusing past experiences through techniques such as off-policy learning and reward modeling. Combined with smaller, distilled model architectures, this can make it easier to deploy capable NLU systems even with limited computing power.

Where DRL Adds Value in NLU

DRL is not a primary training method for most NLU models. Its main value comes when interaction, feedback, or rewards can be used to improve how a system behaves after it has already been pretrained. When applied selectively, DRL can help refine and personalize model performance in ways that matter for specific use cases.

Here are some areas where DRL has shown potential:

Dialogue systems
DRL can help chatbots and virtual assistants manage conversations more smoothly. It can be used to refine turn-taking, handle vague questions in a better way, or adjust responses to improve user satisfaction during longer conversations.
Text summarization
Most summarization models rely on supervised learning. DRL can be added as a fine-tuning step to focus on factors such as relevance or fluency, especially when custom reward signals are linked to specific goals or user preferences.
Response generation and language modeling
DRL can guide language generation toward outputs that are more useful, aligned with user intent, or better suited to certain tone and safety requirements.
Reward-based optimization in parsing or classification
In certain cases, DRL has been used to improve outputs based on downstream objectives such as increasing label confidence or enhancing the quality of supporting explanations, alongside accuracy.
Interactive machine translation
DRL can help translation systems adapt over time by learning from reinforcement signals like human corrections or post-editing feedback, leading to gradual improvements in quality.

In short, DRL works best as a targeted enhancement. It is not used to build general-purpose NLU systems from scratch, but it can make existing systems more adaptable, aligned, and responsive when feedback loops are part of the application.

Modern Architectures in NLU from BERT to Claude

Early NLU systems used Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), but most modern systems use transformers.

These models use a mechanism called self-attention to capture long-range dependencies. Self-attention allows each word to “attend” to every other word in the input, assigning weights that determine relevance for understanding the current word. Long-range dependencies occur when the meaning of one word depends on another far away in the text (like linking “he” to “the president” from earlier sentences). This helps maintain context over large spans of text.

Here’s how the main types of transformer models are used today:

Encoder-only models

Examples: BERT, RoBERTa, ALBERT, DeBERTa

These models process text input and create rich contextual representations without generating new text. They are excellent for classification, entity extraction, and tasks that require understanding rather than producing language. The encoder reads the whole input and encodes it into a vector representation, which is then used by a task-specific head for predictions.

They're often fine-tuned for specific tasks and perform especially well in structured language understanding.

Encoder-decoder models

Examples: T5, FLAN-T5

These models have two components: an encoder that reads and encodes the input text, and a decoder that generates an output sequence based on that encoded representation. They are ideal for sequence-to-sequence tasks such as summarization, translation, and instruction following. The encoder captures the meaning of the input, while the decoder produces coherent output in the target form.

They’re flexible and particularly useful in multi-task learning setups

Decoder-only models

Examples: GPT-4, Claude 3, Gemini

These models generate text one token at a time, predicting the next token based on all previous tokens in the sequence. They excel in open-ended text generation, creative writing, and reasoning tasks. Because they are trained to predict the next word given any context, they can perform many tasks simply by being prompted, without additional training.

They’re typically aligned with human preferences using techniques like Reinforcement Learning from Human Feedback (RHLF).

These models are now widely used in real-world applications, such as chatbots, enterprise tools, and multilingual digital assistants, and many can handle new tasks with just a prompt, requiring no additional training.

The Niche Role of DRL in Modern NLU

DRL is not a general-purpose solution for most NLU challenges, such as handling ambiguity or understanding context. These problems are typically addressed using large-scale pretraining and supervised or instruction-based fine-tuning.

That said, DRL still plays a valuable role in specific areas where feedback and long-term optimization are useful. It is commonly applied in:

Improving dialogue strategy: DRL helps conversational agents manage turn-taking, adjust tone, and adapt to user preferences across multiple interactions.
Aligning model behavior using RLHF: Reinforcement learning from human feedback (RLHF – more on this below) uses DRL to train models that respond in ways people find more helpful, safe, or contextually appropriate.
Reward modeling for alignment and safety: DRL enables the training of reward models that guide language systems toward ethical, culturally aware, or domain-specific behavior.

Looking ahead, DRL is likely to grow in importance for applications that involve real-time interaction, long-horizon reasoning, or agent-driven workflows. For now, it serves as a targeted enhancement alongside more widely used training methods.

Reinforcement Learning from Human Feedback (RLHF)

Let’s talk a bit more about RLHF, as it’s pretty important here. It’s also currently the primary way DRL is applied in large-scale language models such as GPT‑4, Claude, and Gemini.

It works in three main steps:

Reward model training – Human annotators rank model outputs for the same prompt. These rankings are used to train a reward model that scores outputs based on how helpful, safe, or relevant they are.
Policy optimization – Using algorithms such as PPO (Proximal Policy Optimization), the base language model is fine-tuned to maximize the reward model’s score.
Iteration and safety – RLHF loops are often combined with safety-focused reward modeling, constitutional AI (following explicit guidelines for safe behavior), refusal strategies for harmful requests, and red‑teaming to probe weaknesses.

Data‑efficient variants are increasingly common, such as offline RL, replay buffers, and leveraging implicit feedback like click‑through logs.

In practice, RLHF has significantly improved the ability of models to follow instructions, avoid harmful outputs, and align with human values.

Ecosystem and Tools for DRL in NLP

If you're looking to explore DRL in NLU, you don't have to start from scratch. There’s a solid ecosystem of tools that make it easier to test ideas, build prototypes, and fine-tune models using rewards and feedback.

Here are a few go-to libraries:

trl by Hugging Face: A lightweight framework built specifically for applying reinforcement learning to transformer models. It's widely used for RLHF, reward modeling, and steering model outputs based on human preferences.
Stable-Baselines3: A simple, well-documented library for classic DRL algorithms like PPO, A2C, and DQN. It’s great for testing DRL setups in smaller or custom environments.
RLlib (part of Ray): Designed for scaling up. If you're working on distributed training or combining DRL with larger pipelines, RLlib helps manage the complexity.

These libraries pair well with open-source large language models like LLaMA, Mistral, Gemma, and Command R+. Together, they give you everything you need to experiment with DRL-backed language systems, whether you're tuning responses in a chatbot or building a reward model for alignment.

Hands-On Demo: Simulating DRL Feedback in NLU

You don’t need a full reinforcement learning pipeline to understand reward signals. This notebook demonstrates how you can simulate preference-based feedback using GPT-3.5. Users interact with the model, provide binary feedback (good or bad), and the system logs each interaction with a corresponding reward. It mirrors the principles behind techniques like RLHF.

Setup and Authentication

First, you’ll need to install the required packages and set up your API key.

pip install openai ipywidgets pandas matplotlib

import openai
import os
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output
import matplotlib.pyplot as plt

# Load your OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY") or input("Enter your OpenAI API key: ")

What this does:

Installs and loads required libraries
Reads your OpenAI key from an environment variable or prompts for it interactively

Step 1: Generate a GPT-3.5 Response

Now, try sending a prompt and seeing what response you get:

def get_gpt_response(prompt):
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        return response['choices'][0]['message']['content'].strip()
    except Exception as e:
        return f"Error: {e}"

What this does:

Uses OpenAI’s GPT-3.5 to generate a response
Handles errors if the API call fails

Step 2: Store Feedback History

You can now track user responses and simulated reward signals like this:

history = []

This code initializes a list to store logs of each interaction.

Step 3: Run Feedback Interaction

Now you can capture the prompt, display the response, and accept feedback.

#  Main interaction logic
def run_interaction(prompt):
    clear_output(wait=True)
    response = get_gpt_response(prompt)
    display(Markdown(f"### Prompt\n`{prompt}`"))
    display(Markdown(f"### GPT-3.5 Response\n> {response}"))

    # Feedback buttons
    good_btn = widgets.Button(description="👍 Good", button_style='success')
    bad_btn = widgets.Button(description="👎 Bad", button_style='danger')

    def on_feedback(feedback):
        reward = 1 if feedback == 'good' else -1
        history.append({
            "prompt": prompt,
            "response": response,
            "feedback": feedback,
            "reward": reward
        })
        display(Markdown(
            f"**Feedback Recorded:** `{feedback}` — Reward = `{reward}`"
        ))
        display(Markdown("---"))
        display(Markdown("### Reward History"))
        df = pd.DataFrame(history)
        display(df.tail(5))
        plot_rewards()

    def on_good(_): on_feedback('good')
    def on_bad(_): on_feedback('bad')

    display(widgets.HBox([good_btn, bad_btn]))
    good_btn.on_click(on_good)
    bad_btn.on_click(on_bad)

What this does:

Shows GPT-3.5’s response to the user’s prompt
Displays feedback buttons
Logs reward and shows feedback history

Step 4: Plot Reward History

You can also visualize reward trends:

def plot_rewards():
    df = pd.DataFrame(history)
    plt.figure(figsize=(6,3))
    plt.plot(df['reward'], marker='o')
    plt.title("Reward Over Time")
    plt.xlabel("Interaction")
    plt.ylabel("Reward")
    plt.grid(True)
    plt.show()

This plots the user’s reward signals over time to simulate policy shaping.

Step 5: Build Input Interface

You can also allow users to type and submit prompts.

prompt_input = widgets.Textarea(
    placeholder="Ask something...",
    description="Prompt:",
    layout=widgets.Layout(width='100%', height='80px'),
    style={'description_width': 'initial'}
)

generate_btn = widgets.Button(
    description="Generate Response", button_style='primary'
)

output_area = widgets.Output()

def on_generate_click(_):
    with output_area:
        run_interaction(prompt_input.value)

generate_btn.on_click(on_generate_click)

display(prompt_input)
display(generate_btn)
display(output_area)

This sets up a simple form to collect prompts and connects the generate button to the main interaction logic.

This gives the output:

This demo captures the fundamentals of preference-based learning using GPT-3.5. It doesn’t update model weights but shows how feedback can be structured as a reward signal. This is the foundation of reinforcement learning in modern LLM pipelines.

Note: This demo only logs feedback. In true RLHF, a second phase fine-tunes the model weights based on it.

A real-world example of this is InstructGPT. This is a version of OpenAI’s GPT models that’s trained to follow instructions written by people. Instead of just predicting the next word, it tries to really figure out and then do what you’ve asked, the way you asked it.

Despite being over 100× smaller than GPT-3, InstructGPT was preferred by humans in 85% of blind comparisons. And one of the key reasons was that is uses RLHF. This made it safer, more truthful, and better at following complex instructions, showing how reward signals like the one simulated here can greatly improve real-world model performance.

Case Studies of DRL in NLU

While DRL is not the default approach for most NLU tasks, it has shown promising results in targeted use cases, especially where learning from interaction or adapting over time adds value. Below are a few examples that illustrate how DRL can enhance language understanding in practice:

1. Welocalize & Global E-Commerce Giant – DRL-Powered Multilingual NLU

A global e-commerce platform partnered with Welocalize to launch a DRL-powered multilingual NLU system capable of interpreting customer intent across 30+ languages and domains. This system used reinforcement learning to adapt to cultural nuances and refine predictions through user interaction. Over 13 million high-quality utterances delivered for culturally adaptive, accurate customer support and product recommendations.

2. Reinforcement Learning with Label-Sensitive Reward (ACL 2024)

Researchers introduced a framework called RLLR (Reinforcement Learning with Label-Sensitive Reward) to improve NLU tasks like sentiment classification, topic labeling, and intent detection. By incorporating label-sensitive reward signals and optimizing via Proximal Policy Optimization (PPO), the model aligned its predictions with both rationale quality and true label accuracy.

These examples show how DRL, when paired with specific feedback signals or interactive goals, can be a useful layer on top of traditional NLU systems. Though still niche, the approach continues to evolve through research and industry experimentation.

Wrapping Up

The integration of DRL with NLU has shown promising results in niche but growing areas. Adaptive learning through various interactions and feedback allows DRL to enhance NLU models’ ability to handle ambiguity, context, and linguistic differences.

As research progresses, the link between DRL and NLU is expected to drive advancements in AI-powered language applications, making them more efficient, scalable, and context-aware.

I hope this was helpful!

Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

Kuriko — Fri, 30 May 2025 18:21:29 +0000

The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design.

In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:

Custom classifier
Scikit-learn’s MLPClassifier
Keras Sequential classifier using SGD and Adam optimizers.

This will help you learn about their various use cases and how they work.

What is a Perceptron?
How to Build a Single-Layered Classifier
What is a Multi-Layer Perceptron?
How to Build Multi-Layered Perceptrons
Understanding Optimizers
How to Build an MLP Classifier with SGD Optimizer
How to Build an MLP Classifier with Adam Optimizer
Final Results: Generalization
Conclusion

Prerequisites

Mathematics (Calculus, Linear Algebra, Statistics)
Coding in Python
Basic understanding of Machine Learning concepts

What is a Perceptron?

A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.

A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.

But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.

The perceptron consists of four main parts:

Input layer: Takes the initial numerical values into the system for further processing.
Weights: Combines input values with weights (and bias terms).
Activation function: Determines whether the neuron should fire based on the threshold value.
Output layer: Produces classification result.

It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.

So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.

Applications of Perceptrons

Perceptrons are applied to tasks such as:

Image classification: Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.
Linear regression: Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.

How the Activation Function Works

For a single perceptron used for binary classification, the most common activation function is the step function (also known as the threshold function):

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq \theta \\ \\ 0 &\text{if } z < \theta \end{cases}$$

where:

ϕ(z): the output of the activation function.
z: the weighted sum of the inputs plus the bias:

$$z = \sum_{i=1}^m w_i x_i + b$$

(xi: input values, w: weight associated with each input, b: bias terms)

θ is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.

In that case, the formula becomes:

$$\phi(z) = \begin{cases} 1 &\text{if } z \geq 0 \\ \\ 0 &\text{if } z < 0 \end{cases}$$

When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.

This occurs when the weighted sum is greater than zero, leading the perceptron to predict the input is in this binary class.

While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.

In modern implementations, we can use other activation functions like the sigmoid function:

$$\sigma (z) = \frac {1} {1 + e^{-z}}$$

The sigmoid function also outputs zero or one depending on the weighted sum (z).

How the Loss Function Works

The loss function is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.

Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.

In a binary classification task, the model may adopt the hinge loss function to penalize misclassifications by incurring an additional cost for incorrect predictions:

$$L(y, h(x)) = max(0, 1- y*h(x))$$

(h(x): prediction label, y: true label)

How to Build a Single-Layered Classifier

Now, let’s build a simple single-layer perceptron for binary classification.

1. Custom Classifier

Initialize the classifier

We’ll first initialize the classifier with weights, bias, number of epochs (n_iterations), and learning_rates.

def __init__(self, learning_rate=0.01, n_iterations=1000):
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = None
    self.bias = None

Define the activation function

Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the threshold is set to zero.

def _step_function(self, x, threshold: int = 0):
     return np.where(x > threshold, 1, 0)

Train the model

Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: weights and bias.

This process is controlled by a specified number of training epochs defined by n_iterations.

In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined learning_rate.

def fit(self, X, y):
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = 0

    for _ in range(self.n_iterations):
        for i in range(n_samples):
            # compute weighted sum (z)
            z = np.dot(X[i], self.weights) + self.bias

            # apply the activation function
            y_pred = self._step_function(z)

            # update weights and bias
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)

How the weights work in the iteration loop

The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.

Its iterative update in the for loop aims to reduce classification errors such that:

$$\begin {align*} w_j &:= w_j + \Delta w_j \\ & := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &= \begin{cases} w_j &\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

(w_j: j-th weight, η: learning rate, (yi−y^i): error)

This means that:

When the prediction is correct, the error is zero, so the weight is unchanged.
When the prediction is too low (yi=1 and y^i=0), the weight is adjusted to the same direction to increase the weighted sum.
When the prediction is too high (yi=0 and y^i=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.

How the bias terms work in the iteration loop

The bias determines the decision boundary’s intercept (position from the origin).

Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:

$$\begin {align*} b &:= b + \Delta b \\ & := b + \eta (y_i - \hat y_i) \\ &= \begin{cases} b &\text{(a) } y_i - \hat y_i = 0\\ b + \eta &\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$

This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.

Make a prediction

Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):

def predict(self, X):
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      return predictions

The entire classifier looks like this:

import numpy as np

class Perceptron:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None

    def _step_function(self, x, threshold: int = 0):
        return np.where(x > threshold, 1, 0)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.n_iterations):
            for i in range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        return self

    def predict(self, X):
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        return y_pred

Simulate with synthetic datasets

First, we generated a synthetic linearly separable dataset using make_blob and computed a decision boundary, then train the classifier we created.

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import numpy as np

# create a mock dataset
X, y = make_blobs(n_features=2, centers=2, n_samples=1000, random_state=12)

# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the model
perceptron = Perceptron(learning_rate=0.1, n_iterations=1000).fit(X_train, y_train)

# make a prediction
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

# evaluate the results
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"Accuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The classifier generated a clear, highly accurate linear decision boundary.

Accuracy (Train): 0.981
Accuracy (Test): 0.975

2. Leverage SckitLearn’s MCP Classifier

For our convenience, we’ll use sckit-learn’s build-in classifier ( MCPClassifier) to build a similar, yet more robust classifier:

model = MLPClassifier(
    hidden_layer_sizes=(), # intentionally set empty to create a single layer perceptron
    activation='logistic', # choosing a sigmoid function as an activation function
    solver='sgd', # choosing SGD optimizer
    max_iter=1000,
    random_state=42, 
    learning_rate='constant', 
    learning_rate_init=0.1
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(f"MCPClassifier\nAccuracy (Train): {acc_train:.3} \nAccuracy (Test): {acc_test:.3}")

Results

The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.

Accuracy (Train): 0.985
Accuracy (Test): 0.995

Limitations of Single-Layer Perceptrons

Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.

Unlike more general neural networks, single-layer perceptrons use a step function as their activation.

Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).

This fundamental property precludes the use of gradient-based optimization algorithms such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.

In contrast, most neural networks employ differentiable activation functions (for example, sigmoid, ReLU) and loss functions (for example, MSE, Cross-Entropy) for effective optimization.

Other challenges of a single-layer perceptron include:

Limited to linear separability: Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.
Lack of depth: Being single-layered, they cannot learn complex hierarchical representations.
Limited optimizer options: As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.

So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.

What is a Multi-Layer Perceptron?

An MLP is a class of feedforward artificial neural network that consists of at least three layers of nodes:

an input layer,
one or more hidden layers, and
an output layer.

Except for the input nodes, each node is a neuron that uses a nonlinear activation function.

MLPs are widely used for classification problems as well as regression:

Classification tasks: MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.
Regression analysis: They are also applied in regression problems where the relationship between input and output is complex.

How to Build Multi-Layered Perceptrons

Let’s handle a binary classification task using a standard MLP architecture.

Outline of the Project

Objective

Detect fraudulent transactions

Evaluation Metrics

Considering the cost of misclassification, we’ll prioritize improving Recall and Precision scores
Then check the accuracy of classification with Accuracy Score (TP + TN / (TP + TN + FP + FN ))

Cost of Misclassification (from high to low):

False Negative (FN): The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)
False Positive (FP): The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)
True Positive (TP): The model correctly identifies a fraudulent transaction as fraud.
True Negative (TN): The model correctly identifies a non-fraudulent transaction as non-fraud.

Planning an MLP Architecture

In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.

Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.

During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.

Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using image source)

Especially in deeper network, ReLU is advantageous in preventing vanishing gradient problems where gradients become extremely small as they are backpropagated from the output layers.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

Preprocessing the Datasets

First, we consolidate three datasets – transaction, customer, and credit card – into a single DataFrame, independently sanitizing numerical and categorical data:

import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# download the raw data to local
import kagglehub
path = kagglehub.dataset_download("computingvictor/transactions-fraud-datasets")
dir = f'{path}/gd_card_flaud_demo'

def sanitize_df(amount_str):
    """Removes '$' and converts the string to a float."""
    if isinstance(amount_str, str):
        return float(amount_str.replace('$', ''))
    return amount_str

# load transaction data
trx_df = pd.read_csv(f'{dir}/transactions_data.csv')

# sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)
trx_df = trx_df[trx_df['errors'].isna()]
trx_df = trx_df.drop(columns=['merchant_city','merchant_state', 'date', 'mcc', 'errors'], axis='columns')
trx_df['amount'] = trx_df['amount'].apply(sanitize_df)

# merge the dataframe with fraud transaction flag.
with open(f'{dir}/train_fraud_labels.json', 'r') as fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get('target', {})
fraud_labels_series = pd.Series(fraud_labels_dict, name='is_fraud')
fraud_labels_series.index = fraud_labels_series.index.astype(int) # convert the datatype from string to integer
merged_df = pd.merge(trx_df, fraud_labels_series, left_on='id', right_index=True, how='left')
merged_df.fillna({'is_fraud': 'No'}, inplace=True)
merged_df['is_fraud'] = merged_df['is_fraud'].map({'Yes': 1, 'No': 0})

# load card data
card_df = pd.read_csv(f'{dir}/cards_data.csv')
card_df = card_df.drop(columns=['client_id', 'acct_open_date', 'card_number', 'expires', 'cvv'], axis='columns')
card_df['credit_limit'] = card_df['credit_limit'].apply(sanitize_df)

# merge transaction and card data
merged_df = pd.merge(left=merged_df, right=card_df, left_on='card_id', right_on='id', how='inner')
merged_df = merged_df.drop(columns=['id_y', 'card_id'], axis='columns')

# converts categorical variables into a new binary column (0 or 1)
categorical_cols = merged_df.select_dtypes(include=['object']).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=False, dtype=float) 
df = df.dropna().drop(['client_id', 'id_x'], axis=1)
print('\nDataFrame: \n', df.head(n=3))

DataFrame:

Our DataFrame shows an extremely skewed data distribution with:

Fraud samples: 1,191
Non-fraud samples: 11,477,397

For classification tasks, it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact on classification model performance, especially regarding the minority class.

For our data, we’ll:

split the 1,191 fraud samples into training, validation, and test sets,
add an equal number of randomly chosen non-fraud samples from the DataFrame, and
adjust split balances later if generalization challenges arise.

# define the desired size of the fraud samples for the validation and test sets
val_size_per_class = 200
test_size_per_class = 200

# create test sets
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=42)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=42)

# combine to form the balanced test set
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_test = X_test['is_fraud']
X_test = X_test.drop('is_fraud', axis=1)

# remove sampled rows from the original dataframes to avoid data leakage
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


# create validation sets
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=42)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=42)

# combine to form the balanced validation set
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_val = X_val['is_fraud']
X_val = X_val.drop('is_fraud', axis=1)

# remove sampled rows from the remaining dataframes
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


# create training sets
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=42)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=42)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=1, random_state=42).reset_index(drop=True)
y_train = X_train['is_fraud']
X_train = X_train.drop('is_fraud', axis=1)


print("\n--- Final Dataset Shapes and Distributions ---")
print(f"X_train shape: {X_train.shape}, y_train distribution: {np.unique(y_train, return_counts=True)}")
print(f"X_val shape: {X_val.shape}, y_val distribution: {np.unique(y_val, return_counts=True)}")
print(f"X_test shape: {X_test.shape}, y_test distribution: {np.unique(y_test, return_counts=True)}")

After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a 50:50 split between fraud and non-fraud transactions:

Considering the high dimensional feature space with 19 input features, we’ll apply SMOTE to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):

from imblearn.over_sampling import SMOTE
from collections import Counter

train_target = 2000

smote_train = SMOTE(
  sampling_strategy={0: train_target, 1: train_target},  # increase sample size to 2,000
  random_state=12
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(f"\nAfter SMOTE with custom sampling_strategy (target train: {train_target}):")
print(f"X_train_oversampled shape: {X_train.shape}")
print(f"y_train_oversampled distribution: {Counter(y_train)}")

We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:

Lastly, we’ll apply column transformers to numerical and categorical features separately.

Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore'))])

numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

Understanding Optimizers

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.

Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.

In this article, we’ll use the SGD Optimizer and Adam Optimizer.

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:

$$\begin{align*} w_j &:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$

(w: weight, b: bias, J: cost function, η: learning rate)

In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:

$$\begin{align*} J(y, \hat y) &=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &= \sum_{i=1}^m w_i x_i + b \end {align*}$$

2. How Adam (Adaptive Moment Estimation) Optimizer Works

Adam is an optimization algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

Adam optimizer combines the advantages of RMSprop (using squared gradients to scale the learning rate) and Momentum (using past gradients to accelerate convergence):

$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$

where:

α: The learning rate (default is 0.001)
ϵ: A small positive constant used to avoid division by zero
m^: First moment (mean) estimate with a bias correction, leveraging Momentum:

$$\begin{align*} \hat m_t &= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$

(β1: Decay rates, typically set to β1=0.9)

v^: Second moment (variance) estimate with a bias correction, leveraging RMSprop:

$$\begin{align*} \hat v_t &= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$

(β2: Decay rates, typically set to β2=0.999)

Since both m and v are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

This process involves a forward pass and backpropagation, during which SGD computes optimal weights and biases using gradients:

for i in range(0, n_samples, self.batch_size):
    # SGD starts with randomly selected mini-batch for the epoch
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    # A. forward pass
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[-1]  # final output of the network

    # B. backpropagation
    # 1) calculating gradients for the output layer)
    delta = y_pred - y_batch
    dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
    db = np.sum(delta, axis=0) / X_batch.shape[0]

    # 2) update output layer parameters
    self.weights[-1] -= self.learning_rate * dW
    self.biases[-1] -= self.learning_rate * db

    # 3) iterate backward from last hidden layer to the input layer
    for l in range(len(self.weights) - 2, -1, -1):
        delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
        dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
        db = np.sum(delta, axis=0) / X_batch.shape[0]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db

In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.

def _forward_pass(self, X):
    activations = [X]
    zs = []

    # forward through hidden layers
    for i in range(len(self.weights) - 1):
        z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) # using ReLU for hidden layers
        activations.append(a)

    # forward through output layer
    z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
    zs.append(z_output)

    # computes the final output using sigmoid function
    y_pred = 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    activations.append(y_pred)
    return activations, zs

So the final classifier looks like this:

from sklearn.metrics import accuracy_score

class MLP_SGD:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.01, n_epochs=1000, batch_size=32):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        for epoch in range(self.n_epochs):
            # shuffle datasets
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1]

                delta = y_pred - y_batch
                dW = np.dot(activations[-2].T, delta) / X_batch.shape[0]
                db = np.sum(delta, axis=0) / X_batch.shape[0]
                self.weights[-1] -= self.learning_rate * dW
                self.biases[-1] -= self.learning_rate * db

                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    db = np.sum(delta, axis=0) / X_batch.shape[0]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self

    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten() # for 1D output

Training / Prediction

Train the model and make a prediction using training and validation datasets:

# 1. define the model
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(30, 30, ), # 2 hidden layers with 30 neurons each
  learning_rate=0.001,           # a step size
  n_epochs=1000,                 # number of epochs
  batch_size=32                  # mini-batch size
)

# 2. train the model
mlp_sgd.fit(X_train_processed, y_train)

# 3. make a prediction with training and validation datasets
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

# 4. compute evaluation matrics
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=1)
recall = recall_score(y_true, y_pred, pos_label=1)
f1 = f1_score(y_true, y_pred, pos_label=1)


print(f"\nMLP (Custom SGD) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom SGD) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.7930 — 0.6650 (from training to validation)
Precision: 0.7790 — 0.6786 (from training to validation)

The model effectively learned and generalized the patterns, achieving a Recall of 79.3% (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.

Loss history:

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.

Leverage SckitLearn’s MCP Classifier

We can use an MCP Classifier to define a similar model, incorporating;

Early stopping using internal validation to prevent overfitting and
L2 regularization with a small tolerance.

from sklearn.neural_network import MLPClassifier

# define a model
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='sgd',
    learning_rate_init=0.001,
    learning_rate='constant',
    momentum=0.9,
    nesterovs_momentum=True,
    alpha=0.00001,           # l2 regulation strength
    max_iter=3000,           # max epochs (keep it high)
    batch_size=16,           # mini-batch size
    random_state=42,
    early_stopping=True,     # apply early stopping
    n_iter_no_change=50,     # stop the iteration if internal validation score doesn't improve for 50 epochs
    validation_fraction=0.1, # proportion of training data for internal validation (default is 0.1)
    tol=1e-4,                # tolerance for optimization
    verbose=False,
)

# training
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

# make a prediction
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)

Results

Recall: 0.7830 - 0.6200 (from training to validation)
Precision: 0.8208 - 0.6703 (from training to validation)

The model showed strong performance during training, achieving a Recall of 78.30%. Its performance declined on the validation set.

This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.

Leverage Keras Sequential Classifier

For the sequential classifier, we can further enhance the classifier by:

Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train) to address dataset imbalance and promote faster convergence,
Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,
Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,
Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and
Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import SGD
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


# calculates an initial bias for the output layer 
initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])


# defines the model
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu'),
    Dropout(0.1), # 10% of the neurons in that layer randomly dropped out
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', # binary classification
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) # to address the imbalanced datasets
])



# compiles the model with the SGD optimizer
opt = SGD(learning_rate=0.001)
model_keras_sgd.compile(
    optimizer=opt, 
    loss='binary_crossentropy',
    metrics=[
        'accuracy', # add several metrics to return
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)


# defines early stopping to prevent overfitting
early_stopping_callback = EarlyStopping(
    monitor='val_recall',  # monitor recall 
    mode='max',         # maximize recall
    patience=50,        # stop after 50 epochs without loss improvement
    min_delta=1e-4,     # minimum change to be considered an improvement (tol)
    verbose=0
)


# compute the class weight
class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


# train the model
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val), # use our external val set
    callbacks=[early_stopping_callback], # early stopping to prevent overfitting
    class_weight=class_weights_dict, # penarlize more misclassification on minority class
    verbose=0
)

# evaluate
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")

# display model summary
model_keras_sgd.summary()

Results

Recall: 0.7125 — 0.7250 (from training to validation)
Precision: 0.7607 — 0.7545 (from training to validation)

Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.

It suggests that the regularization techniques are likely effective in preventing significant overfitting.

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:

# apply Adam updates for output layer parameters
# 1) weights (w)
self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

# 2) bias (b)
self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)

Following the principles of forward and backward passes, we construct the final classifier by initializing it with beta1 and beta2, built upon an MLP_SGD architecture:

class MLP_Adam:
    def __init__(self, hidden_layer_sizes=(10,), learning_rate=0.001, n_epochs=1000, batch_size=32,
                 beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        # Adam optimizer internal states for each parameter (weights and biases)
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def _sigmoid_derivative(self, x):
        s = self._sigmoid(x)
        return s * (1 - s)

    def _relu(self, x):
        return np.maximum(0, x)

    def _relu_derivative(self, x):
        return (x > 0).astype(float)

    def _initialize_parameters(self, n_features):
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [1]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        for i in range(len(layer_sizes) - 1):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+1]
            limit = np.sqrt(6 / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((1, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((1, fan_out)))
            self.v_biases.append(np.zeros((1, fan_out)))


    def _forward_pass(self, X):
        activations = [X]
        zs = []

        for i in range(len(self.weights) - 1):
            z = np.dot(activations[-1], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[-1], self.weights[-1]) + self.biases[-1]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        return activations, zs

    def _compute_loss(self, y_true, y_pred):
        y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return loss

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(-1, 1)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() for w in self.weights])
        self.biases_history.append([b.copy() for b in self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[-1])
        self.loss_history.append(initial_loss)

        # global time step for Adam bias correction
        t = 0

        for epoch in range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            # Mini-batch loop
            for i in range(0, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += 1

                # 1. forward pass
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[-1] # Output of the network

                # 2. backpropagation
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[-2].T, delta) / X_batch.shape[0] # Average over batch
                grad_b_output = np.sum(delta, axis=0) / X_batch.shape[0]

                # apply Adam updates to weights
                self.m_weights[-1] = self.beta1 * self.m_weights[-1] + (1 - self.beta1) * grad_w_output
                self.v_weights[-1] = self.beta2 * self.v_weights[-1] + (1 - self.beta2) * (grad_w_output ** 2)
                m_w_hat = self.m_weights[-1] / (1 - self.beta1**t)
                v_w_hat = self.v_weights[-1] / (1 - self.beta2**t)
                self.weights[-1] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                # apply Adam updates to bias
                self.m_biases[-1] = self.beta1 * self.m_biases[-1] + (1 - self.beta1) * grad_b_output
                self.v_biases[-1] = self.beta2 * self.v_biases[-1] + (1 - self.beta2) * (grad_b_output ** 2)
                m_b_hat = self.m_biases[-1] / (1 - self.beta1**t)
                v_b_hat = self.v_biases[-1] / (1 - self.beta2**t)
                self.biases[-1] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                # Propagate gradients backward through hidden layers
                for l in range(len(self.weights) - 2, -1, -1):
                    delta = np.dot(delta, self.weights[l+1].T) * self._relu_derivative(zs[l]) # d_activation(z)
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[0]
                    grad_b_hidden = np.sum(delta, axis=0) / X_batch.shape[0]

                    # apply Adam updates to weights
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (1 - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (1 - self.beta2) * (grad_w_hidden ** 2)
                    m_w_hat = self.m_weights[l] / (1 - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (1 - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    # apply Adam updates to bias
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (1 - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (1 - self.beta2) * (grad_b_hidden ** 2)
                    m_b_hat = self.m_biases[l] / (1 - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (1 - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() for w in self.weights])
            self.biases_history.append([b.copy() for b in self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[-1])
            self.loss_history.append(epoch_loss)

            if (epoch + 1) % 100 == 0:
                print(f"Epoch {epoch+1}/{self.n_epochs}, Loss: {epoch_loss:.4f}")
        return self


    def predict_proba(self, X):
        activations, _ = self._forward_pass(X)
        return activations[-1]

    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int).flatten()

Training / Prediction

Train the model and make a prediction using training and validation datasets:

mlp_adam = MLP_Adam(hidden_layer_sizes=(30, 10), learning_rate=0.001, n_epochs=500, batch_size=32)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(f"\nMLP (Custom Adam) Accuracy (Train): {acc_train:.3f}")
print(f"MLP (Custom Adam) Accuracy (Validation): {acc_val:.3f}")

Results

Recall: 0.9870–0.6150 (from training to validation)
Precision: 0.9811–0.6474 (from training to validation)

While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.

Loss History

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.

Leverage SckitLearn’s MCP Classifier

We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:

model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(30, 30),
    activation='relu',
    solver='adam',             # update the optimizer from SGD to Adam
    learning_rate_init=0.001,
    learning_rate='constant',
    alpha=0.0001,
    max_iter=3000,
    batch_size=16,
    random_state=42,
    early_stopping=True,
    n_iter_no_change=50,
    validation_fraction=0.1,
    tol=1e-4,
    verbose=False,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)

Results

Recall: 0.8975–0.6400 (from training to validation)
Precision: 0.8864 — 0.6305 (from training to validation)

Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.

Leverage Keras Sequential Classifier

Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Input
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from sklearn.utils import class_weight


initial_bias = np.log([np.sum(y_train == 1) / np.sum(y_train == 0)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[1],)), 
    Dense(30, activation='relu')),
    Dropout(0.1),
    Dense(30, activation='relu'),
    Dropout(0.1),
    Dense(1, activation='sigmoid', 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=0.001)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss='binary_crossentropy', 
    metrics=[
        'accuracy',
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.AUC(name='auc') 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor='val_recall',
    mode='max',
    patience=50,
    min_delta=1e-4,
    verbose=0
)

class_weights = class_weight.compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=1000,
    batch_size=32,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=0
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=0)
print(f"\n--- Keras Model Accuracy (Train) ---")
print(f"Loss: {loss_train:.4f}")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"AUC: {auc_train:.4f}")


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=0)
print(f"\n--- Keras Model Accuracy (Validation) ---")
print(f"Loss: {loss_val:.4f}")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"AUC: {auc_val:.4f}")


model_keras_adam.summary()

Results

Recall: 0.7995–0.7500 (from training to validation)
Precision: 0.8409–0.8065 (from training to validation)

The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).

This indicates good generalization, with only minor performance degradation on unseen data.

Final Results: Generalization

Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.

# Custom classifiers
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# MLPClassifer
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

# Keras Sequential
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=0)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=0)

Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an AUPRC (Area Under Precision-Recall Curve) of 0.72.

Conclusion

In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.

Our findings underscore that effective machine learning hinges on three critical factors:

robust data preprocessing (tailored to objectives and data distribution),
judicious model selection, and
strategic framework or library choices.

Choosing the right framework

Generally speaking, choose MLPClassifier when:

You’re primarily working with tabular data,
You want to prioritize simplicity, quick iteration, and seamless integration,
You have simple, shallow architectures, and
You have a moderate dataset size (manageable on a CPU).

Choose Keras Sequential when:

You’re dealing with image, text, audio, or other sequential data,
You’re building deep learning models such as CNNs, RNNs, LSTMs,
You need fine-grained control over the model architecture, training process, or custom components,
You need to leverage GPU acceleration,
You’re planning for production deployment, and
You want to experiment with more advanced deep learning techniques.

Limitation of MLPs

While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.

Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.

You can find more info about me on my Portfolio / LinkedIn / Github.

How to Create a DeepSeek R1 API in R with Plumber

Adejumo Ridwan Suleiman — Thu, 20 Feb 2025 23:22:01 +0000

To create an AI chatbot and integrate it with another platform, you have to communicate with large language model using an API. This API receives prompts from the client and sends them to the model to generate answers.

In this tutorial, you will learn how to create such an API using the DeepSeek R1 large language model so external applications can call it. We will use the DeepSeek R1 model, available on HuggingFace, and the Plumber R package to deploy it as an API.

HuggingFace is an open source platform for building, training, and deploying machine learning models, while Plumber is an R package that expose R code as a RESTful APIs accessible to other applications through HTTP requests.

With this API, you can:

Build AI applications
Connect to external data and extract meaningful insights
Integrate into existing applications to provide customer support, create documentations, and so on.

What is the DeepSeek R1 Model?

DeepSeek R1 is the latest large language model from the Chinese company DeepSeek. It was designed to enhance the problem-solving and analytic capabilities of AI systems.

DeepSeek-R1 uses reinforcement learning and supervised fine-tuning to handle complex reasoning tasks. Unlike proprietary models, DeepSeek R1 is open-source and free to use.

Prerequisites

Sign up for a HuggingFace account if you don’t already have one
Install R and R Studio.
Install the plumber R package to build the API endpoint
Install the httr2 R package to work with HTTP requests and interact with the Hugging Face API

Step 1: Create Your Project Repository

You need to create an R project to create an API application in R. This ensures that all the files needed to keep your API working are kept together under the same directory. R Studio already has a template provided for API projects, so you can follow the steps below to create yours.

In your R Studio IDE, click on the File menu and go to New Project to open the New Project Wizard. Once in the wizard, select New Directory, then click New Plumber API Project. Inside the directory name field, give it a name (for example DeepSeek-R1 API), and then click on Create Project.

You will see a file called plumber.R with a sample API template. This is where you’ll write the code to connect to the DeepSeek R1 model on HuggingFace. Make sure that you clear this template before proceeding.

Next, go to your terminal and create a .env file. This is where you will store the Hugging Face API key.

touch .env

Create a .gitignore file and add the .env file to it. This ensures that sensitive information like access tokens and API keys are not pushed to your Git repository.

Step 2: Create a Hugging Face Access Token

We need to create an access token to connect to Hugging Face models. Go to your profile, click Settings, and click Create New Token to create your access token for the Hugging Face repository.

Copy the access token and paste it into your .env file, and give it the name HUGGINGFACE_ACCESS_TOKEN.

HUGGINGFACE_ACCESS_TOKEN=""

Next is to install the dotenv package, and paste the following code at the top of your plumber.R file:

# Load environment variables from .env
dotenv::load_dot_env()

dotenv::load_dot_env() loads all environment variables in the .env file, making them available to the plumber.R script.

Step 3: Build the DeepSeek API Endpoint

Now that we have our project environment set up and API token ready, we’ll write the code to build the API application by connecting to the DeepSeek R1 model on HuggingFace.

Go to the plumber.R file and load the following libraries:

library(plumber)
library(httr2)

Copy and paste the following code into plumber.R:


api_key <- Sys.getenv("HUGGINGFACE_ACCESS_TOKEN")



#* @post /deepseek_chat
function(prompt) {
  url <- "https://huggingface.co/api/inference-proxy/together/v1/chat/completions"

  # Create a request object
  req <- request(url) |>
    req_auth_bearer_token(api_key) |>
    req_body_json(list(
      model = "deepseek-ai/DeepSeek-R1",
      messages = list(
        list(role = "user", content = prompt)
      ),
      max_tokens = 500,
      stream = FALSE
    ))

  # Perform the request and capture the response
  res <- req_perform(req)

  # Parse the JSON response
  parsed_data <- res |>
    resp_body_json()

  # Extract the content from the response
  content <- parsed_data$choices
  return(content)
}

Here’s what’s going on in the above code:

Sys.getenv gets the HuggingFace access token and stores it in the variable access_token.
The url variable contains the API link to access the DeepSeek model on HuggingFace. You can get this by searching the model name deepseek-ai/DeepSeek-R1 on HuggingFace. Go to the View Code button, and under the cURL tab, copy the API URL
#* @post /deepseek_chat means that the endpoint makes a POST request through the path /deepseek_chat.
This endpoint takes an argument prompt, a text, or a question a user is expected to give.
The req object is a chain of various operations, which makes a request() to the url, and then takes the api_key inside the req_auth_bearer_token() function. Model properties such as model name, role, prompt, and max_tokens are passed to the req object through the req_body_json function.
The headers variable contains the authorization required to make a request to HuggingFace API.
The request is performed and captured in a response object res using the req_perform() function.
The res object returns a JSON object, which is now parsed to R using theresp_body_json() function.
The content of the parsed_data is now returned so you can extract the information you need from the application for which you want to use the API.

Step 4: Test the API Endpoint

Let’s run the API endpoint to see how the application performs. Click on Run API. This will automatically open the API endpoint on your browser on the URL http://127.0.0.1:8634/docs/.

Click on the API endpoint dropdown, provide a prompt, and click the Execute button. You should receive a reply in a few minutes.

Conclusion

With your API, you can make inferences to the Hugging Face model and build AI applications in R or other programing languages. You need to host your API to make it accessible to clients online. There are various ways of hosting an R Plumber application: you can use Docker or host it on DigitalOcean using the plumberDeploy R package. However, the simplest and easiest way is to use Posit Connect.

You can use the same approach used in this tutorial to try out other HuggingFace models, build an API to generate images or translate different languages. R Plumber is easy to use, and the documentation provides many resources.

If you are interested in model deployment using R Plumber, you can check out this article on how to deploy a Time Series model built on Prophet using R Plumber.

If you find this article interesting, please check my other articles on learndata.xyz.

Understanding Deep Learning Research Tutorial - Theory, Code, and Math

Beau Carnes — Thu, 16 Jan 2025 15:12:48 +0000

Understanding deep learning research can often feel like unraveling a dense and intricate puzzle. From decoding mathematical notation to navigating complex code bases, the process can be daunting, especially for newcomers. But with the right guidance, you can build the skills necessary to break down cutting-edge AI research and make it accessible.

We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to read, understand, and implement deep learning research. Taught by Yacine, a published researcher and machine learning practitioner, this tutorial provides a step-by-step approach to mastering essential skills like interpreting technical papers, understanding advanced mathematics, and navigating research codebases. With practical examples and a focus on recent AI papers, this course empowers you to confidently engage with the latest developments in machine learning.

What You’ll Learn in This Course

The course is structured to address the key challenges that aspiring researchers and practitioners face when diving into deep learning:

1. How to Read Research Papers

This section provides a comprehensive framework for effectively breaking down research papers:

Learn how to get external context and perform an initial casual read to grasp the paper’s main idea.
Dive deeper into filling knowledge gaps and achieving conceptual understanding.
Explore how to conduct a code deep dive and meticulously analyze the paper’s methods and results.
Develop strategies to identify and address weird gaps or inconsistencies in the paper.

2. Understanding Deep Learning Math

Many papers rely heavily on mathematical notation, which can be intimidating. Yacine simplifies this process by teaching:

Techniques to relax and approach formulas systematically.
How to translate symbols into meaning and build intuition around complex equations (e.g., the QHAdam optimizer).
Methods to summarize mathematical insights for practical application.

3. Learning Math Efficiently

Mastering the mathematics behind deep learning doesn’t have to be overwhelming. This section focuses on:

Selecting the right subfields of math to study based on your goals.
Leveraging exercise-rich resources to reinforce learning.
Using the Green-Yellow-Red method to identify strengths and weaknesses.
Fixing gaps in understanding through targeted study of theory.

4. Navigating Deep Learning Codebases

Research codebases are often sprawling and complex. Yacine walks you through:

How to map the structure of a codebase after reading the related research paper.
Strategies to run, debug, and understand the code.
Methods to elucidate unclear components and take detailed notes for clarity.

5. Segment Anything Model (SAM) Deep Dive

The course culminates in an in-depth exploration of the Segment Anything Model (SAM), a groundbreaking approach to segmentation in computer vision. You’ll learn about:

The task and testing process for SAM.
Its theoretical underpinnings and key model components, including the image encoder, prompt encoder, and mask decoder.
How the data pipeline and engine are structured.
Insights into SAM’s zero-shot results and limitations.

Why This Course

Whether you're a beginner curious about deep learning or an experienced developer aiming to engage with AI research, this course equips you with practical tools and methodologies to demystify deep learning research. By combining theory, hands-on practice, and real-world examples, Yacine ensures that you’ll walk away with actionable insights and confidence in your ability to tackle even the most complex papers.

Check out the Deep Learning Research Tutorial now on the freeCodeCamp.org YouTube channel (2-hour course).

How Do Generative Models Work in Deep Learning? Generative Models For Data Augmentation Explained

Oyedele Tioluwani — Fri, 26 Jul 2024 12:22:23 +0000

Data is at the heart of model training in the world of deep learning. The quantity and quality of training data determine the effectiveness of machine learning algorithms.

On the other hand, obtaining massive amounts of precisely categorized data is a difficult and resource-intensive operation. This is where data augmentation comes into play as an appealing solution, with the innovative potential of generative models at its forefront.

In this article, we'll look into the fundamental relevance of generative models in data augmentation for deep learning, such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).

What are Generative Models?

Generative models are a type of machine learning model that create new data samples that are similar to those in a given dataset. They discover hidden trends and structures in the data, allowing them to generate synthetic data points that are similar to the actual data.

These models are used in a variety of applications, such as image generation, text generation, data augmentation, and others. For example, in an image generation project, a generative model could be trained on images of cats and dogs to learn how to generate new images of cats and dogs.

They learn patterns and styles from existing data and apply that information to create similar things. It’s like your computer having a creative engine that generates fresh ideas after studying the tactics utilized in prior ones.

What is Data Augmentation?

Data augmentation is a machine learning and deep learning technique that uses various transformations and adjustments to existing data to improve the quality and quantity of a training dataset. This entails generating new data samples from existing ones to expand the size and diversity of a dataset.

The basic purpose of data augmentation is to increase a machine learning models’ performance, generalization, and robustness, notably in computer vision tasks and other data-driven areas.

Data augmentation can be used to improve datasets for a wide range of machine-learning applications, such as image classification, object detection, and natural language processing. Data augmentation, for example, can be used to create synthetic photos of faces, which can then be used to train a deep-learning model to detect faces in real-world images.

Data augmentation is an important method in the data world because it addresses the underlying concerns of data quantity and quality. Access to large amounts of diverse, well-labeled data is required for building strong and accurate models in many machine learning and deep learning applications.

Data augmentation is a beneficial method for expanding limited datasets by creating new samples, which improves model generalization and performance. Furthermore, it improves the ability of machine learning algorithms to manage real-world fluctuations, resulting in more trustworthy and flexible AI systems.

Why Use Generative Models for Data Augmentation?

There are several reasons why generative models are employed for data augmentation in machine learning:

Increased Data Diversity: Generative models can help boost dataset variety, making machine learning models more resilient to real-world fluctuations. A generative model could be used to generate synthetic images of faces with various expressions, ages, and ethnicities. This could help a machine learning model learn to detect faces more reliably in a wide range of real-world scenarios.
Improved Model Generalization: Using generative models to augment data exposes machine learning models to a broader collection of data variables during training. This procedure improves the model’s ability to generalize to new, previously unknown data and its overall performance. This is particularly relevant for deep learning models, which require vast volumes of data to adequately train.
Overcoming Data Scarcity: Obtaining a large and diverse labeled dataset can be a substantial issue in many machine learning applications. By developing synthetic data, generative models can assist in managing data scarcity by lowering reliance on limited real data.
Reduction of Bias: By generating new data samples that address underrepresented or biased categories, generative models can be used to eliminate bias in training data, improving balance in AI applications.

Generative Models for Data Augmentation

Two main types of generative models can be used for data augmentation:

Generative Adversarial Networks (GANs)
Variational AutoEncoders (VAEs)

Generative Adversarial Networks (GANs)

GANs are neural network designs that are used to create fresh data samples that are comparable to the training data. They are learning models that can construct new items that appear to be drawn from a certain dataset. GANs, for example, can be trained on a group of photos and then used to produce new images that look like they came from the original set.

Here’s a short explanation of how GANs work:

A new data sample is generated by the generator. The discriminator is provided with both new and real data samples.
The discriminator attempts to determine which samples are real and which are fabricated.
The output of the discriminator is used to update both the generator and the discriminator.

The generator creates a synthetic image by taking noisy data as input. The discriminator tries to correctly categorize both the generator’s fake image and an actual image from the training set.

The generator tries to improve its variables to produce a more convincing false image that can mislead the discriminator. The discriminator seeks to improve by adjusting its variables to distinguish between actual and fraudulent images. The two networks continue to compete and improve until the generator produces data that is similar to real data.

It is suitable for data augmentation due to its capacity to generate synthetic data indistinguishable from genuine data samples. This is significant because machine learning algorithms learn from data, and the more data used to train a model, the better it will perform. On the other hand, collecting enough real-world data to train a machine-learning model may be costly and time-consuming.

GANs can help to reduce the cost and time required to collect data by producing synthetic data that is similar to real-world data. This is especially beneficial for applications when collecting real-world data is difficult or expensive, such as medical imaging or video surveillance data.

GANs can also be used because of their variety. This is because GANs can be used to produce data samples that did not exist in the original dataset. This can help improve the robustness of machine learning models for real-world variations.

Variational AutoEncoders (VAEs)

VAEs are a type of generative model and a variation of autoencoders used in machine learning and deep learning. They are a form of generative model that may generate fresh data samples that are comparable to the data on which they were trained.

VAEs are a sort of Bayesian model, which implies that they employ probability distributions to represent the uncertainty in the data. This allows VAEs to create data samples that are more realistic than other types of generative models.

VAEs work by learning about data representation in latent space. The latent space is a compressed representation of data that captures the data’s most relevant qualities. By sampling from the latent space and decoding the samples back into the original data space, VAEs can then be utilized to produce new data samples.

Here’s a simple illustration of how a VAE works:

As input, the encoder receives a data sample, such as an image of an animal.
The encoder generates a latent space representation of the data, which is a compressed version of the image that captures the cat’s most relevant characteristics, such as shape, size, and fur color.
The latent space representation is fed into the decoder.
The decoder generates a reconstructed data sample, which is a new image of an animal that resembles the original image.

The encoder and decoder are taught to reduce the difference between the reconstructed and original images. This is accomplished by employing a loss function that compares the similarity of the two photos.

VAEs are a strong generative modeling tool that can be used for image production, text generation, data compression, and data denoising. They provide a probabilistic framework for modeling and producing complex data distributions while preserving a structured latent space for data production and interpolation.

The ability to generate data that is similar to real-world data also qualifies it for data augmentation. This means that the augmented data produced by VAEs is highly realistic and aligned with the underlying data distribution, which is required for effective data augmentation.

Each point in the structured latent space of VAEs represents a meaningful data variation. This enables controlled data creation. Users can build new data instances with specific attributes or variants by sampling different places in the latent space, making it suited for targeted data augmentation.

VAEs can address data scarcity issues by generating synthetic data when real data is limited. This is particularly valuable in scenarios where collecting more real data is impractical or expensive.

As VAEs continue to improve, they will likely play an increasingly important role in training machine learning models.

Conclusion

Generative models have played a significant part in the practice of data augmentation in the machine-learning field.

For instance, GANs have been used to generate synthetic images of faces, which have been used to train machine learning models to detect faces in real-world images.

VAEs were also utilized to create synthetic images of automobiles that were then used to train machine-learning models to recognize autos in real-world photographs.

These are all real-life applications of generative models in data Augmentation.

I hope this article was helpful.

How to Build an Interpretable Artificial Intelligence Model – Simple Python Code Example

Tiago Capelo Monteiro — Tue, 23 Jul 2024 22:11:31 +0000

Artificial Intelligence is being used everywhere these days. And many of the groundbreaking applications come from Machine Learning, a subfield of AI.

Within Machine Learning, a field called Deep Learning represents one of the main areas of research. It is from Deep Learning that most new, truly effective AI systems are born.

But typically, the AI systems born from Deep Learning are quite narrow, focused systems. They can outperform humans in one very specific area for which they were made.

Because of this, many new developments in AI come from specialized systems or a combination of systems working together.

One of the bigger problems in the field of Deep Learning models is their lack of interpretability. Interpretability means understanding how decisions are made.

This is a big problem that has its own field, called explainable AI. This is the field within AI that focuses on making an AI model's decisions more easily understandable.

Here's what we'll cover in this article:

Artificial Intelligence and the Rise of Deep Learning
A big problem in deep learning: Lack of interpretability
A solution to interpretability: Glass Box models
Code example: Solving the problem with Explainable AI
Conclusion: KAN (Kolmogorov–Arnold Networks)

This article won't cover dropout or other regularization techniques, hyperparameter optimization, complex architectures like CNNs, or detailed differences in gradient descent variants.

We'll just discuss the basics of deep learning, the lack of interpretability problem, and a code example.

Artificial Intelligence and the Rise of Deep Learning

Photo by Tara Winstead

What is Deep Learning in Artificial Intelligence?

Deep Learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.

The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.

To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.

Deep Learning = Training Neural Networks

Simple neural network

At the core of deep learning is the training of neural networks.

That means basically using data to get the right values of each neuron to be able to predict what we want.

Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.

This layered structure allows deep learning models to analyze and interpret complex data.

A Big Problem in Deep Learning: Lack of Interpretability

_Photo by Koshevaya_k_

Deep Learning has revolutionized many fields by achieving great results in very complex tasks.

However, there is a big problem: the lack of interpretability

While it is true that neural networks can perform every well, we don't understand internally how neural networks can achieve great results.

In other words, we know they do very well with the tasks we give them, but not how they do them in detail.

It is important to know how the model thinks in fields such as healthcare and autonomous driving.

By understanding how a model thinks, we can be more confident in its reliability in certain critical areas.

So models that work in fields with strict regulations are more transparent to the law and build more trust when they're interpretable.

Models that allow interpretability are called glass box models. On the other hand, models that do not have this capability (that is, most of them) are called black box models.

A Solution to Interpretability: Glass Box Models

Glass Box Models

Photo by Pixabay: https://www.pexels.com/photo/fluid-pouring-in-pint-glass-416528/

Glass box models are machine learning models designed to be easily understood by humans.

Glass box models provide clear insights into how they make their decisions.

This transparency in the decision-making process is important for trust, compliance, and improvement.

Below we will see a code example of an AI model that, based on a dataset to predict breast cancer, it achieves an accuracy of 97%.

We'll also find, based on the characteristics of the data, which were of greater importance in predicting the cancer.

Black Box Models

In addition to glass box models, there are also black box models.

These models are essentially different neural network architectures used in various datasets. Some examples are:

CNN (Convolutional Neural Networks): Designed specifically for image classification and interpretation.
RNN (Recurrent Neural Networks) and LSTM (Long Short Term Memory): Primarily used for sequential data – text and time series data. In 2017, they were surpassed by a neural network architecture called transformers in a paper called Attention is All You Need.
Transformer-based architectures: Revolutionized AI in 2017 due to their ability to handle sequential data more efficiently. RNN and LSTM have limited capabilities in this regard.

Nowadays, most models that process text are transformer-based models.

For instance, in ChatGPT, GPT stands for Generative Pre-trained Transformer, indicating a transformer neural network architecture that generates text.

All these models—CNN, RNN, LSTM and Transformers—are examples of narrow artificial intelligence (AI).

Achieving general intelligence, in my view, involves combining many of these narrow AI models to mimic human behavior.

Code Example: Solving the Problem with Explainable AI

Photo by Chokniti Khongchum: https://www.pexels.com/photo/person-holding-laboratory-flask-2280571/

In this code example, we will create an interpretable AI model based on 30 characteristics.

We'll also learn what the 5 characteristics are that are more important in the detection of breast cancer, based on this dataset.

We will use a machine learning glass box model called the Explainable Boosting Machine

Here is the code below, which we will see block by block below:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from interpret.glassbox import ExplainableBoostingClassifier
import matplotlib.pyplot as plt
import numpy as np

# Load a sample dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

# Make predictions
y_pred = ebm.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

# Interpret the model
ebm_global = ebm.explain_global(name='EBM')

# Extract feature importances
feature_names = ebm_global.data()['names']
importances = ebm_global.data()['scores']

# Sort features by importance
sorted_idx = np.argsort(importances)
sorted_feature_names = np.array(feature_names)[sorted_idx]
sorted_importances = np.array(importances)[sorted_idx]

# Increase spacing between the feature names
y_positions = np.arange(len(sorted_feature_names)) * 1.5  # Increase multiplier for more space

# Plot feature importances
plt.figure(figsize=(12, 14))  # Increase figure height if necessary
plt.barh(y_positions, sorted_importances, color='skyblue', align='center')
plt.yticks(y_positions, sorted_feature_names)
plt.xlabel('Importance')
plt.title('Feature Importances from Explainable Boosting Classifier')
plt.gca().invert_yaxis()

# Adjust spacing
plt.subplots_adjust(left=0.3, right=0.95, top=0.95, bottom=0.08)  # Fine-tune the margins if needed

plt.show()

Full Code

Alright, now let's break it down.

Importing Libraries

First, we'll import the libraries we need for our example. You can do that with the following code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from interpret.glassbox import ExplainableBoostingClassifier
import matplotlib.pyplot as plt
import numpy as np

Importing libraries

These are the libraries we are going to use:

Pandas: This is a Python library used for data manipulation and analysis.
sklearn: The scikit-learn library is used to implement machine learning algorithms. We're importing it for data pre processing and model evaluation.
Interpret: The interpretAI Python library is what we'll use to import the model we'll use.
Matplotlib: A Python library used to make graphs in Python.
Numpy: Used for very fast numerical computations.

Loading, Preparing the Dataset, and Splitting the Data

# Load a sample dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Loading, Preparing the Dataset, and Splitting the Data

First, we load a sample dataset: We import a breast cancer dataset using the Interpret library.

Next, we prepare the data: The features (data points) from the dataset are organized into a table format, where each column is labeled with a specific feature name. The target outcomes (labels) from the dataset are stored separately.

Then we split the data into training and testing sets: The data is divided into two parts: one for training the model and one for testing the model. 80% of the data is used for training, while 20% is reserved for testing.

A specific random seed is set to ensure that the data split is consistent every time the code is run.

Quick note: In real life, the dataset is pre-processed with data manipulation techniques to make the AI model faster and to make it smaller.

Training the Model, Making Predictions, and Evaluating the Model

# Train an EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

# Make predictions
y_pred = ebm.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Training the Model, Making Predictions and Evaluating the Model

First, we train an EBM model: We initialize an Explainable Boosting Machine model and then train it using the training data. In this step, with the data we have, we create the model.

This way, with one line of code, we create the AI model based on the dataset that will predict breast cancer.

Then we make our predictions: The trained EBM model is used to make predictions on the test data. Next, we calculate and print the accuracy of the model's predictions.

Interpreting the Model, Extracting, and Sorting Feature Importances

# Interpret the model
ebm_global = ebm.explain_global(name='EBM')

# Extract feature importances
feature_names = ebm_global.data()['names']
importances = ebm_global.data()['scores']

# Sort features by importance
sorted_idx = np.argsort(importances)
sorted_feature_names = np.array(feature_names)[sorted_idx]
sorted_importances = np.array(importances)[sorted_idx]

Interpreting the Model, Extracting and Sorting Feature Importances

At this point, we need to interpret the model: The global explanation of the trained Explainable Boosting Machine (EBM) model is obtained, providing an overview of how the model makes decisions.

In this model, we conclude that the accuracy is approximately 0.9736842105263158 – which means the model is accurate 97 % of the time.

Of course, this only applies to the breast cancer data from this dataset – not for every single case of breast cancer detection. Since this is a sample, the dataset does not represent the full population of people seeking to detect breast cancer.

Quick note: In the real world, for classification, we'd use the F1 score instead of accuracy to predict how accurate a model is due to its consideration of both precision and recall.

Next, we extract feature importances: We extract the names and corresponding importance scores of the features used by the model from the global explanation.

Then we sort the features by importance: The features are sorted based on their importance scores, resulting in a list of feature names and their respective importance scores ordered from least to most important.

Plotting Feature Importances

# Increase spacing between the feature names
y_positions = np.arange(len(sorted_feature_names)) * 1.5  # Increase multiplier for more space

# Plot feature importances
plt.figure(figsize=(12, 14))  # Increase figure height if necessary
plt.barh(y_positions, sorted_importances, color='skyblue', align='center')
plt.yticks(y_positions, sorted_feature_names)
plt.xlabel('Importance')
plt.title('Feature Importances from Explainable Boosting Classifier')
plt.gca().invert_yaxis()

# Adjust spacing
plt.subplots_adjust(left=0.3, right=0.95, top=0.95, bottom=0.08)  # Fine-tune the margins if needed

plt.show()

Plotting Feature Importances

Now we need to increase the spacing between feature names: The positions of the feature names on the y-axis are adjusted to increase the spacing between them.

Then we plot feature importances: A horizontal bar plot is created to visualize the feature importances. The plot's size is set to ensure it is clear and readable.

The bars represent the importance scores of the features, and the feature names are displayed along the y-axis.

The plot's x-axis is labeled "Importance," and the title "Feature Importances from Explainable Boosting Classifier" is added. The y-axis is inverted to have the most important features at the top.

Then we adjust the spacing: The margins around the plot are fine-tuned to ensure proper spacing and a neat appearance.

Finally, we display the olot: The plot is displayed to visualize the feature importances effectively.

The final result should look like this:

Features importance graph

This way, we can conclude from an artificial intelligence model that is interpretable and has an accuracy of 97%, that the five most important factors in detecting breast tumors are:

Worst concave points
Worst texture
Worst area
Mean concave points
Area error & worst concavity

Again, this is according to the provided dataset.

So according to the population that this sample dataset represents, we can conclude in a data-driven way that these factors are key indicators for breast cancer tumor detection.

This way, we can conclude from an artificial intelligence model, which methods interpret the model, that it provides clear insights into the significant features for prediction.

Conclusion: KAN (Kolmogorov–Arnold Networks)

Thanks to explainable AI, we can study populations using new data-driven methods.

Instead of only using traditional statistics, surveys, and manual data analysis, we can draw conclusions more accurately using an AI programming library and a database or Excel file.

But this is not the only way to have models built with explainable AI.

In April 2024, a paper called KAN: Kolmogorov–Arnold Networks was published that might shake up the field even more.

Kolmogorov–Arnold Networks (KANs) promise to be more accurate and easier to understand than traditional models and perform better.

They are also easier to visualize and interact with. So we'll see what happens with them.

You can find the full code here:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How to Build a Quantum Artificial Intelligence Model – With Python Code Examples

Tiago Capelo Monteiro — Tue, 23 Jul 2024 18:28:43 +0000

Machine learning (ML) is one of the most important subareas of AI used in building great AI systems.

In ML, deep learning is a narrow area focused solely on neural networks. Through the field of deep learning, systems like ChatGPT and many other AI models can be created. In other words, ChatGPT is just a giant system based on neural networks.

However, there is a big problem with deep learning: computational efficiency. Creating big and effective AI systems with neural networks often requires a lot of energy, which is expensive.

So, the more efficient the hardware is, the better. There are many solutions to solve this problem, one of which is quantum computing.

This article hopes to show, in plain English, the connection between quantum computing and artificial intelligence.

We'll talk about these:

Artificial Intelligence and the Rise of Deep Learning
A Big Problem in Deep Learning: Computational Efficiency
A Solution: Quantum Computing
Code Example: A Quantum AI Model for Quantum Chemistry
Conclusion: Limitations of Quantum Computing and Development

Artificial Intelligence and the Rise of Deep Learning

What is Deep Learning in Artificial Intelligence?

Deep learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.

The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.

To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.

Deep Learning = Training Neural Networks

Simple neural network

At the core of deep learning is the training of neural networks. That means using data to get the right values for each neuron to be able to predict what we want.

Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.

This layered structure allows deep learning models to analyze and interpret complex data.

A Big Problem in Deep learning: Computational Efficiency

Photo by Brett Sayles: https://www.pexels.com/photo/black-hardwares-on-data-server-room-4597280/

Deep learning powers a lot of the transformation AI makes in the society. However, it comes with a big problem: computational efficiency.

Training deep learning AI systems requires massive amounts of data and computational power. This can take minutes to weeks and in the process, it consumes a lot of energy and computational resources.

There are many solutions to this problem, such as better algorithmic efficiency.

In large language models, this has been the focus of much AI research. To make smaller models have the same performance as larger ones.

Another solution, besides algorithmic efficiency, is better computational efficiency. Quantum computing is one of the solutions related to better computational efficiency.

A Solution: Quantum Computing

Quantum computing is a promising solution to the computational efficiency problem in deep learning.

While normal computers work in bits (either 0 or 1), quantum computers work with qubits – can be 0 and 1 at the same time.

With the qubits representing 0 and 1 at the same time, it is possible to process many possibilities simultaneously, thanks to a property called superposition in quantum physics.

This makes the quantum computers, for certain tasks, far more efficient than normal computers.

This way, it is also possible to have quantum-based algorithms that are more efficient than normal algorithms. This way, reducing the energy consumption used when creating AI models.

Why Are Quantum Computers Not So Widely Used?

The problem with quantum computation is that there isn't a good, cheap physical representation of qubits.

Bits are created and managed with logic gates made from tiny transistors, which can be easily created by the billions.

Qubits are created and managed by superconducting circuits, trapped ions, and topological qubits, which are all very expensive.

This is the biggest problem in quantum computation. However, IBM, Amazon, and many others in cloud services allow people to run code on their quantum computers.

Code Example: A Quantum AI Model for Quantum Chemistry

Photo by Pixabay: https://www.pexels.com/photo/two-clear-glass-jars-beside-several-flasks-248152/

In this code example, we'll solve a quantum chemistry problem:

What is the lowest energy level of the H₂ molecule using quantum computing?

Before understanding the problem at hand, let's discuss quantum chemistry.

What is Quantum Chemistry?

Quantum chemistry is a field of science that looks at how electrons behave in atoms and molecules.

It is about using quantum physics to understand how electrons, atoms, molecules and many more tiny particles interact and form different chemical substances.

The Problem We Want to Solve

We want to find the "ground state energy" of the H₂ molecule.

The H₂ molecule means hydrogen gas, which is present in:

Water
Organic compounds
Stars

Actually, life on Earth would not be possible without it.

By finding the "ground state energy," which is the lowest possible energy that the molecule can have, we can know its most stable form and properties.

This allows scientists to better understand chemical reactions related to H₂.

With classical computers, this problem can be very complex due to a huge number of possibilities and intricate interactions.

With quantum computers, qubits are good representations of electrons, which can directly simulate the behavior of electrons in molecules.

Approximating with the VQE (Variational Quantum Eigensolver (VQE)

The Variational Quantum Eigensolver (VQE) is a hybrid algorithm that leverages both quantum and classical computing.

In this example, the VQE algorithm is used to find the ground state energy of a simple H₂ molecule.

The code is designed to run on a quantum simulator (which is a classical computer running a quantum algorithm).

However, it can be adapted to run on actual quantum hardware through a cloud-based quantum computing service.

This would involve using both quantum and classical resources in practice. Let’s go through the code step by step!

import pennylane as qml
import numpy as np
import matplotlib.pyplot as plt

# Define the molecule (H2 at bond length of 0.74 Å)
symbols = ["H", "H"]
coordinates = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.74])

# Generate the Hamiltonian for the molecule
hamiltonian, qubits = qml.qchem.molecular_hamiltonian(
    symbols, coordinates
)

# Define the quantum device
dev = qml.device("default.qubit", wires=qubits)

# Define the ansatz (variational quantum circuit)
def ansatz(params, wires):
    qml.BasisState(np.array([0] * qubits), wires=wires)
    for i in range(qubits):
        qml.RY(params[i], wires=wires[i])
    for i in range(qubits - 1):
        qml.CNOT(wires=[wires[i], wires[i + 1]])

# Define the cost function
@qml.qnode(dev)
def cost_fn(params):
    ansatz(params, wires=range(qubits))
    return qml.expval(hamiltonian)

# Set a fixed seed for reproducibility
np.random.seed(42)

# Set the initial parameters
params = np.random.random(qubits, requires_grad=True)

# Choose an optimizer
optimizer = qml.GradientDescentOptimizer(stepsize=0.4)

# Number of optimization steps
max_iterations = 100
conv_tol = 1e-06

# Optimization loop
energies = []

for n in range(max_iterations):
    params, prev_energy = optimizer.step_and_cost(cost_fn, params)

    energy = cost_fn(params)
    energies.append(energy)
    if np.abs(energy - prev_energy) < conv_tol:
        break

    print(f"Step = {n}, Energy = {energy:.8f} Ha")

print(f"Final ground state energy = {energy:.8f} Ha")

# Visualize the results
import matplotlib.pyplot as plt

iterations = range(len(energies))

plt.plot(iterations, energies)
plt.xlabel('Iteration')
plt.ylabel('Energy (Ha)')
plt.title('Convergence of VQE for H2 Molecule')
plt.show()

Full Code Image

Importing Libraries

import pennylane as qml
import numpy as np
import matplotlib.pyplot as plt

Importing libraries

pennylane: A library for quantum computing that provides tools for creating and optimizing quantum circuits, and for running machine learning quantum based algorithms.
numpy: A library for numerical operations in Python, used here for handling arrays and mathematical computations.
matplotlib: A library for creating visualizations and plots in Python, used here to graph the convergence of the VQE algorithm.

Defining the Molecule and Generating the Hamiltonian

# Define the molecule (H2 at bond length of 0.74 Å)
symbols = ["H", "H"]
coordinates = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.74])

# Generate the Hamiltonian for the molecule
hamiltonian, qubits = qml.qchem.molecular_hamiltonian(
    symbols, coordinates
)

Defining the Molecule and generating the Hamiltonian

Defining the Molecule:

We are defined a hydrogen molecule (H₂).
symbols = ["H", "H"]: This means the molecule consists of two hydrogen (H) atoms.
coordinates = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.74]): This gives the positions of the two hydrogen atoms. The first hydrogen atom is at the origin (0.0, 0.0, 0.0), and the second hydrogen atom is at (0.0, 0.0, 0.74), which means it is 0.74 angstroms away from the first atom along the z-axis.

Generating the Hamiltonian:

hamiltonian, qubits = qml.qchem.molecular_hamiltonian(symbols, coordinates): This line generates the Hamiltonian for the hydrogen molecule. The Hamiltonian is a mathematical object used to describe the energy of the molecule.
hamiltonian: Represents the energy operator for the molecule.
qubits: Represents the number of quantum bits (qubits) needed to simulate the molecule on a quantum computer.

Defining the Quantum Device and Ansatz (Variational Quantum Circuit)

# Define the quantum device
dev = qml.device("default.qubit", wires=qubits)

# Define the ansatz (variational quantum circuit)
def ansatz(params, wires):
    qml.BasisState(np.array([0] * qubits), wires=wires)
    for i in range(qubits):
        qml.RY(params[i], wires=wires[i])
    for i in range(qubits - 1):
        qml.CNOT(wires=[wires[i], wires[i + 1]])

Defining the Quantum Device and Ansatz (Variational Quantum Circuit)

Defining the Quantum Device:

dev = qml.device("default.qubit", wires=qubits): This line sets up a quantum computing device to simulate our molecule.
"default.qubit": This specifies the type of quantum simulator we are using (a default qubit-based simulator).
wires=qubits: This tells the simulator how many qubits (quantum bits) it needs to use, based on the number of qubits we determined earlier.

Defining the Ansatz (Variational Quantum Circuit):

def ansatz(params, wires): This defines a function named ansatz which describes the variational quantum circuit. This circuit will be used to find the ground state energy of the molecule.
qml.BasisState(np.array([0] * qubits), wires=wires): This initializes the qubits in the state 0. The np.array([0] * qubits) creates an array with zeros, one for each qubit.
for i in range(qubits): qml.RY(params[i], wires=wires[i]): This loop applies a rotation around the Y-axis to each qubit. params[i] provides the angle for each rotation.
for i in range(qubits - 1): qml.CNOT(wires=[wires[i], wires[i + 1]]): This loop applies Controlled-NOT (CNOT) gates between consecutive qubits, entangling them.

Defining the Cost Function, Setting Initial Parameters and Optimizer

# Define the cost function
@qml.qnode(dev)
def cost_fn(params):
    ansatz(params, wires=range(qubits))
    return qml.expval(hamiltonian)

# Set a fixed seed for reproducibility
np.random.seed(42)

# Set the initial parameters
params = np.random.random(qubits, requires_grad=True)

# Choose an optimizer
optimizer = qml.GradientDescentOptimizer(stepsize=0.4)

Defining the Cost Function, Setting Initial Parameters and Optimizer

Defining the Cost Function:

@qml.qnode(dev): This line is a decorator that transforms the cost_fn function into a quantum node, allowing it to run on the quantum device dev.
def cost_fn(params): This defines a function named cost_fn that takes some parameters (params) as input.
ansatz(params, wires=range(qubits)): Inside this function, we call the previously defined ansatz function, passing in the parameters and specifying that it should use all the qubits.
return qml.expval(hamiltonian): This line returns the expected value of the Hamiltonian, which represents the energy of the molecule. The cost function is what we aim to minimize to find the ground state energy.

Setting a Fixed Seed for Reproducibility:

np.random.seed(42): This line sets a fixed seed for the random number generator. This ensures that the random numbers generated will be the same each time the code is run, making the results reproducible.

Setting the Initial Parameters:

params = np.random.random(qubits, requires_grad=True): This line initializes the parameters for the ansatz with random values. The number of parameters is equal to the number of qubits. The requires_grad=True part indicates that these parameters can be adjusted during optimization.

Choosing an Optimizer:

optimizer = qml.GradientDescentOptimizer(stepsize=0.4): This line creates an optimizer that will adjust the parameters to minimize the cost function. Specifically, it uses gradient descent with a step size of 0.4.

Optimization Loop

# Number of optimization steps
max_iterations = 100
conv_tol = 1e-06

# Optimization loop
energies = []

for n in range(max_iterations):
    params, prev_energy = optimizer.step_and_cost(cost_fn, params)

    energy = cost_fn(params)
    energies.append(energy)
    if np.abs(energy - prev_energy) < conv_tol:
        break

    print(f"Step = {n}, Energy = {energy:.8f} Ha")

print(f"Final ground state energy = {energy:.8f} Ha")

Optimization Loop

Setting the Number of Optimization Steps:

max_iterations = 100: This sets the maximum number of steps the optimization will take. In this case, it is 100 steps.
conv_tol = 1e-06: This defines the convergence tolerance. If the change in energy between steps is less than this value, the optimization will stop.

Optimization Loop:

energies = []: This initializes an empty list to store the energies calculated at each step.

Looping Through Optimization Steps:

for n in range(max_iterations):: This starts a loop that will run up to max_iterations times.
params, prev_energy = optimizer.step_and_cost(cost_fn, params): This line performs one step of optimization. It updates the parameters and returns the new parameters and the previous energy.
energy = cost_fn(params): This calculates the current energy using the updated parameters.
energies.append(energy): This adds the current energy to the energies list.
if np.abs(energy - prev_energy) < conv_tol: break: This checks if the absolute difference between the current energy and the previous energy is less than the convergence tolerance. If it is, the loop stops early because the optimization has converged.
print(f"Step = {n}, Energy = {energy:.8f} Ha"): This prints the current step number and the energy in Hartree (Ha) to eight decimal places.

Printing the Final Energy:

print(f"Final ground state energy = {energy:.8f} Ha"): After the loop, this prints the final ground state energy.

Visualizing the Results

# Visualize the results
iterations = range(len(energies))

plt.plot(iterations, energies)
plt.xlabel('Iteration')
plt.ylabel('Energy (Ha)')
plt.title('Convergence of VQE for H2 Molecule')
plt.show()

Visualizing the Results

Setting Up the Data for Visualization:

iterations = range(len(energies)): This creates a range object representing the number of iterations (steps) the optimization went through. len(energies) gives the number of energy values recorded.

Plotting the Results:

plt.plot(iterations, energies): This line creates a plot with the iteration numbers on the x-axis and the corresponding energy values on the y-axis.
plt.xlabel('Iteration'): This sets the label for the x-axis to "Iteration".
plt.ylabel('Energy (Ha)'): This sets the label for the y-axis to "Energy (Ha)", where "Ha" stands for Hartree, a unit of energy.
plt.title('Convergence of VQE for H2 Molecule'): This sets the title of the plot to "Convergence of VQE for H2 Molecule".
plt.show(): This displays the plot.

The graph titled "Convergence of VQE for H2 Molecule" shows the energy (in Hartree, Ha) of the H2 molecule plotted against the number of iterations of the Variational Quantum Eigensolver (VQE) algorithm.

X-Axis (Iteration): Number of VQE iterations.
Y-Axis (Energy (Ha)): Energy of the H2 molecule in Hartree.

Key Points:

Initial Energy: Approximately 1.4 Ha at iteration 0.
Rapid Decrease: Energy quickly drops within the first 20 iterations.
Plateau: Energy stabilizes around 0.4 Ha after 20 iterations, indicating convergence to an optimal or near-optimal solution.

Conclusion: Limitations of Quantum Computing and Development

Photo by Richa Sharma: https://www.pexels.com/photo/ceramic-mug-on-black-laptop-on-table-in-office-4247412/

Besides making AI algorithms far more computationally efficient, quantum computing can revolutionize many fields like:

Drug discovery
Material science
Cryptography
Financial modeling
Optimization problems
Climate modeling
Machine learning

However, for all of us to use quantum computing, a way to physically represent qubits small enough to fit on our laptops is needed. That will take years.

The full code can be found here:

https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code

How Does Knowledge Distillation Work in Deep Learning Models?

Oyedele Tioluwani — Tue, 09 Jul 2024 13:35:16 +0000

Deep learning models have transformed several industries, including computer vision and natural language processing. However, the rising complexity and resource requirements of these models have motivated academics to look into ways to condense their knowledge into more compact and efficient forms.

Knowledge distillation, a strategy for transferring knowledge from a complicated model to a simpler one has emerged as an effective instrument for accomplishing this goal. In this article, we’ll look at the notion of knowledge distillation in deep learning models and its applications.

Concept of Knowledge Distillation

Knowledge distillation is a deep learning process in which knowledge is transferred from a complicated, well-trained model, known as the “teacher,” to a simpler and lighter model, known as the “student.”

The basic purpose of knowledge distillation is to produce a more efficient model that retains the important information and performance of the bigger model while being computationally less demanding.

The process consists of two steps:

1. Training the “teacher” Model

The teacher model is trained on labeled data to discover patterns and correlations within it.
The teacher model’s large capacity allows it to capture minute details, resulting in superior performance on the assigned task.
The instructor model’s predictions on the training data provide a source of knowledge that the student model seeks to imitate.

2. Transferring Knowledge to the “student” Model:

The student model is then trained using the same data as the teacher but with a difference.
Instead of typical hard labels (a data point’s final class assignment), the student model is trained with soft labels (a significantly richer representation of the data), which are probability distributions over the classes supplied by the teacher model.
Using soft labels, the student learns not just to copy the teacher’s final judgments, but also to understand the uncertainty and logic behind those predictions.
The goal is for the student model to generalize and approximate the knowledge encoded in the teacher model, resulting in a more compact representation of the data.

Knowledge distillation uses the teacher model soft targets to reflect not just the anticipated class, but also the probability distribution across all conceivable classes. These soft targets provide subtle indications, exposing not just the objective but also the terrain that the student model must negotiate. By adding these cues into its training, the student learns to not only replicate the teacher model outcomes but also to recognize the larger patterns and correlations buried in the data.

The soft labels give a smoother gradient during training, allowing the student model to benefit more from the teacher’s knowledge. This procedure helps the student model to generalize more well and frequently results in a smaller model that retains a considerable percentage of the teacher’s performance.

The temperature parameter used in the softmax function during the knowledge distillation process influences the sharpness of the probability distributions. Higher temperatures cause softer probability distributions, emphasizing information transfer, whereas lower temperatures produce sharper distributions, favoring precise predictions.

Overall, knowledge distillation is the process of transferring gained knowledge from a powerful and complicated model to a smaller one, making it more suitable for use in circumstances with limited computational resources.

Relevance of Knowledge Distillation in Deep Learning

Knowledge distillation is important in deep learning for a variety of reasons, and its applications encompass multiple fields. Here are some major factors that demonstrate the importance of knowledge distillation in the field of deep learning:

Model Compression: Model compression is a fundamental motivator for knowledge distillation. Deep learning models, particularly those with millions of parameters, can be computationally expensive and resource-consuming. Knowledge distillation allows for the production of smaller, more lightweight models that retain a significant fraction of the performance of their larger counterparts.
Model Pruning: Knowledge distillation can be used to find and eliminate duplicate or irrelevant neurons and connections in a deep learning model. Training a student model to emulate the behavior of a teacher model allows the student model to learn which aspects of the teacher model are most important and which can be safely deleted.
Enhanced Generalization: Knowledge distillation frequently produces student models with increased generalization capabilities. By learning not only the final predictions but also the logic and uncertainty from the teacher model, the student may better generalize to previously unseen data, making it a powerful strategy for boosting model resilience.
Transfer Learning: Knowledge distillation can be used to transfer knowledge from a pre-trained deep learning model to a new model trained on a separate but related problem. By training a student model to imitate the behavior of a pre-trained teacher model, the student model can learn the broad characteristics and patterns common to both tasks, allowing it to perform effectively on the new task with less data and computational resources.
Scalability and Accessibility: Knowledge distillation helps to make complex artificial intelligence technology more accessible to a wider audience. Smaller models demand fewer computational resources, making it easier for researchers, developers, and businesses to implement and incorporate deep learning technologies into their applications.
Performance Improvement: In special cases, knowledge distillation can even result in enhanced performance on specific tasks, particularly when data is scarce. The student model benefits from the teacher’s deeper understanding of data distribution, resulting in improved generalization and robustness.

Applications of Knowledge Distillation

Knowledge distillation can be applied in a variety of fields in deep learning, providing advantages such as model compression, enhanced generalization, and efficient deployment. Here are some notable applications for knowledge distillation:

Computer Vision: Object detection uses knowledge distillation to compress large and complicated object identification models, making them acceptable for deployment on devices with limited processing resources, such as security cameras and drones.
Natural Language Processing (NLP): Knowledge distillation is used to generate compact models for text classification, sentiment analysis, and other NLP applications. These models are more suitable for real-time applications and can be implemented on platforms such as chatbots and mobile devices.
Distilled models in NLP are also utilized for language translation, enabling effective language processing across multiple platforms.
Recommendation Systems: Knowledge distillation is used in recommendation systems to build efficient models capable of providing individualized recommendations depending on user behavior. These models are better suited for distribution across several platforms.
Edge Computing: Knowledge-distilled models enable the deployment of deep learning models on edge devices with low resources. This is critical for applications such as real-time video analysis, edge-based image processing, and IoT devices.
Anomaly Detection: In cybersecurity and anomaly detection, knowledge distillation is used to generate lightweight models for detecting unexpected patterns in network traffic or user behavior. These models help to detect threats quickly and efficiently.
Quantum Computing: In the growing field of quantum computing, knowledge distillation is being investigated to create more compact quantum models that can run efficiently on quantum hardware.
Transfer Learning: Knowledge distillation enhances transfer learning, allowing pre-trained models to quickly apply their knowledge to new tasks. This is useful in cases where labeled data for the target job is limited.

There are numerous case studies demonstrating the effectiveness of knowledge distillation in diverse fields. These case studies highlight the versatility of knowledge distillation across different domains, including natural language processing, computer vision, and finance. Examples include:

In the healthcare industry, knowledge distillation is being used to train smaller, faster models for medical image analysis and illness detection. Early research indicates that lowering model size while retaining diagnostic accuracy is a promising approach.
Knowledge distillation has been used to increase speech recognition models’ accuracy and resilience, particularly for low-resource languages with limited data. Baidu and Google have shown considerable improvements in word error rate (WER) by extracting information from large pre-trained models.
Knowledge distillation can be used to train robot gripping devices to handle a variety of things efficiently. By extracting knowledge from a pre-trained model that has gripped a variety of items, a smaller model can acquire efficient grasping methods with less training data and processing resources.
Knowledge distillation can help train AI models for resource-constrained IoT devices. A smaller variant can run on low-power devices while still performing important activities like sensor data analysis and anomaly detection.

These examples demonstrate knowledge distillation’s adaptability beyond its conventional use in vision and language tasks. Its capacity to bridge the gap between model accuracy and efficiency has major real-world applications, allowing AI solutions to function effectively in diverse and resource-constrained situations.

Techniques and Methods for Knowledge Distillation

To ensure effective knowledge distillation, a variety of strategies and tactics are used. Here are some important strategies for knowledge distillation:

1. Soft Target Labels: Soft target labels in knowledge distillation include utilizing probability distributions, known as soft labels, instead of standard hard labels during the training of a student model. These soft labels are created by using a softmax function on the output logits of a more advanced instructor model. The temperature parameter in the softmax function affects the smoothness of probability distributions.

By training the student model to match these soft target labels, it learns not only the teacher’s final predictions but also the level of confidence and uncertainty in each session. This refined approach improves the student model’s capacity to generalize and capture the complex knowledge embedded in the instructor model, yielding a more efficient and compact model.

2. Feature Mimicry: Feature mimicry is a knowledge distillation technique in which a simpler student model is trained to replicate the intermediate feature representations of a more complex teacher model.

Rather than just reproducing the teacher’s final predictions, the student model is instructed to match its internal feature maps at various layers with those of the teacher.

This method tries to convey both the high-level information embodied in the teacher’s predictions and the deep hierarchical features learned throughout the network. By including feature mimicry, the student model can capture deeper information and linkages in the teacher’s representations, resulting in better generalization and performance.

3. Self Distillation: This is a knowledge distillation technique in which a model converts its knowledge to a simplified version of itself. The instructor and student models share the same architecture. This process can be iterative, with the distilled student serving as the instructor for the subsequent round of distillation.

Self-distillation uses the model’s inherent complexity to guide the learning of a more compact version, allowing for a gradual refining of understanding. This strategy is especially beneficial when a model needs to adapt and reduce its information into a smaller form, resulting in a balance of model size and performance.

4. Multi-Teacher Distillation: Multi-teacher distillation is a method for transferring knowledge from many teacher models to a single student model. Each teaching model brings a distinct viewpoint or skill to the task at hand.

The student model learns from the combined knowledge of these varied teachers, intending to capture a more complete comprehension of facts.

This method frequently improves the robustness and generality of the student model by combining information from different sources. Multi-teacher distillation is especially useful when the work requires complicated and diverse patterns that can be better grasped from multiple perspectives.

5. Attention Transfer: Attention transfer is a knowledge distillation technique that trains a simpler student model to emulate the attention mechanisms of a more complicated teacher model.

Attention mechanisms highlight relevant portions of the input data, allowing the model to concentrate on key elements. In this strategy, the student model learns not only to imitate the teacher’s final predictions but also to emulate attention patterns.

This improves the student model’s interpretability and performance by capturing the selective focus and reasoning used by the instructor model during decision-making.

Challenges and Limitations of Knowledge Distillation

While knowledge distillation is a strong process with many benefits, it also has its drawbacks and limitations. Understanding these difficulties is critical for professionals hoping to use knowledge distillation effectively. Here are some obstacles and constraints related to knowledge distillation:

Computational Overhead: Knowledge distillation necessitates training both a teacher and a student model, potentially increasing the overall computational burden. The technique requires more steps than training a solo model, which may make it less suitable for resource-constrained applications.
Finding the Optimal Teacher-Student Pair: It is critical to select the correct instructor model who has qualities that complement the student’s. A mismatch might result in poor performance or overfitting to the teacher’s biases.
Hyperparameter Tuning: The performance of knowledge distillation depends on the hyperparameters used, such as the temperature parameter in soft label production. Finding the ideal balance can be difficult and may necessitate significant tinkering.
Risk of Overfitting to Teacher’s Biases: If the teacher model has biases or was trained on biased data, the student model may inherit them throughout the distillation process. Care must be taken to address and reduce any potential biases in the teacher model.
Sensitivity to Noisy Labels: Knowledge distillation can be vulnerable to noisy labels in training data, potentially resulting in the transmission of incorrect or unreliable data from the instructor to the student.

Despite these obstacles and limits, knowledge distillation is nevertheless an effective method for moving knowledge from a large, complicated model to a smaller, simpler model.

With careful consideration and modification, knowledge distillation can improve the performance of machine learning models in a variety of applications.

Conclusion

Knowledge distillation is a powerful technique in the field of deep learning, providing a road to more efficient, compact, and flexible models.

Knowledge distillation solves model size, computational efficiency, and generalization issues by transferring knowledge from large instructor models to simpler student models in a nuanced way.

The distilled models not only preserve their professors’ prediction capabilities, but they frequently perform better, have faster inference times, and are more adaptable.

I hope this article was helpful!

Deep Learning - freeCodeCamp.org

AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)

Paper Overview

Table of Contents

Prerequisites

Executive Summary

Goals of the Paper

Methodology

Pre-Training

Fine-Tuning (Adapting to Tasks)

Transformer vs. BERT vs. GPT

Transformer vs BERT vs GPT: Key Differences

Model Architecture

Key Techniques

Key Findings

Conclusions

Limitations

Related Work & Context

Final Insight

Resources:

Contact Me

How Neural Networks Work – Explained Using the Straight Line Equation y = ax + b

Table of Contents

Prerequisites

y=ax+b

Linear Regression

Linear Classification

Comparison

Key Additions to Help Build Deep Neural Networks

Layering

Non-Linearity

Modelling a Deep Neural Network

Step #1 - Define the Problem Clearly

Step #2 - Define the Input Layer

Step #3 - Define the First Hidden Layer

Step #4 - Stack Layers One Above the Other

Step #5 - Define the Output Feature(s)

Final Thoughts

How to Set Up CUDA and WSL2 for Windows 11 (including PyTorch and TensorFlow GPU)

Table of Contents

Prerequisites

Windows Terminal

Windows PowerShell (Latest & Greatest)

Configure Windows Terminal

Configuration of My Computer

CPU Virtualization

Install WSL2

Install Latest LTS Ubuntu via WSL2

Update & Upgrade Ubuntu Packages

Install and Configure Miniconda

Install Jupyter & Ipykernel

Nvidia Driver

Install CUDA Dependencies

CUDA Toolkit

Add Path to Shell Profile for CUDA

nvcc Version

cuDNN SDK

TensorFlow GPU

Check TensorFlow GPU

PyTorch GPU

Check PyTorch GPU

Check PyTorch & TensorFlow GPU inside Jupyter Notebook

1. Test PyTorch GPU

2. Test TensorFlow GPU

Conclusion

How to Build End-to-End Machine Learning Lineage

Table of Contents

Prerequisites:

Tools we’ll use:

What is Machine Learning Lineage?

What We’ll Build

The System Architecture: AI Pricing for Retailers

The ML Lineage

Workflow in Action

Step 1: Initiating a DVC Project

Step 2: The ML Lineage

Stage 1: The ETL Pipeline

DVC Configuration

Python Scripts

Outputs

Training Script: `train.py`