<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Deep Learning - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Deep Learning - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sat, 16 May 2026 22:22:32 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/deep-learning/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)
 ]]>
                </title>
                <description>
                    <![CDATA[ We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/</link>
                <guid isPermaLink="false">69fb84ad50ecad45335e5367</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ academic writing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Wed, 06 May 2026 18:13:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/0998e844-4017-49b9-a68d-2d6c73fceb78.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on research papers where the original ideas were developed and tested.</p>
<p>Now, not everyone enjoys reading research papers or has the time to comb through and digest all that (sometimes very dense) info. So I decided to do the hard work for you and share the key insights in a series of AI paper reviews.</p>
<p>The goal isn’t to turn this into a heavy academic discussion, but to explain the main ideas in a clear and practical way. You'll learn what problem the paper was trying to solve, what approach it introduced, and why it mattered.</p>
<p>In each article, you’ll get a simple breakdown of the paper, how it works, and what you should take away from it. By the end, you should understand the key idea without needing to go through the full research paper yourself.</p>
<h2 id="heading-paper-overview">Paper Overview</h2>
<p>The first paper I'll be reviewing is "Improving Language Understanding by Generative Pre-Training", by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.</p>
<p>Here's the actual paper if you want to read it yourself: <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">Read the paper</a>.</p>
<p>And here's a little infographic of what we'll cover here:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/0466e09f-c2a3-41fa-939d-f67d53f900e1.png" alt="0466e09f-c2a3-41fa-939d-f67d53f900e1" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-goals-of-the-paper">Goals of the Paper</a></p>
</li>
<li><p><a href="#heading-methodology">Methodology</a></p>
</li>
<li><p><a href="#heading-transformer-vs-bert-vs-gpt">Transformer vs. BERT vs. GPT</a></p>
</li>
<li><p><a href="#heading-model-architecture">Model Architecture</a></p>
</li>
<li><p><a href="#heading-key-techniques">Key Techniques</a></p>
</li>
<li><p><a href="#heading-key-findings">Key Findings</a></p>
</li>
<li><p><a href="#heading-conclusions">Conclusions</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-related-work-amp-context">Related Work &amp; Context</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to be familiar with a few basic ideas:</p>
<ul>
<li><p>A general understanding of natural language processing (NLP) and how machines work with text</p>
</li>
<li><p>A high-level idea of what a Transformer model is (you don’t need deep details, just the concept)</p>
</li>
<li><p>The difference between supervised and unsupervised learning</p>
</li>
<li><p>Basic machine learning concepts like training data and models</p>
</li>
</ul>
<p>If you’re not fully comfortable with all of these, that’s okay, you can still follow along. The goal here is to keep things clear and intuitive.</p>
<h2 id="heading-executive-summary">Executive Summary</h2>
<p>Before models like GPT became what we know today, there was a key limitation: AI systems were good at specific tasks, but struggled with general understanding.</p>
<p>In this paper, the authors introduce a simple but powerful idea. Instead of training a model separately for each task, they first train it on a large amount of unlabeled text to learn the structure of language. Then, they adapt it to specific tasks using smaller labeled datasets.</p>
<p>According to the authors, this two-step approach (pre-training followed by fine-tuning) allows a single model to handle many different tasks with minimal changes.</p>
<p>In practice, this marked a major shift: rather than building a new model for every problem, we can train one general model that learns language itself and then reuse it across tasks.</p>
<h2 id="heading-goals-of-the-paper">Goals of the Paper</h2>
<p>To understand the motivation behind this work, it helps to look at the main limitations in NLP at the time.</p>
<p>Most models depended heavily on large labeled datasets, which weren’t always available. Many tasks simply didn’t have enough labeled data to train effective systems. On top of that, existing models were usually designed for a single task, making them hard to reuse or adapt.</p>
<p>Because of this, the authors aimed to reduce the reliance on labeled data and move toward a more general approach. Their goal was to build a language model that could learn from large amounts of raw text and then be applied across different tasks.</p>
<p>According to the paper, they also wanted to enable transfer learning: the ability to take knowledge learned from one task and apply it to others. They also wanted to improve performance without needing to redesign a new model each time.</p>
<h2 id="heading-methodology">Methodology</h2>
<p>To understand how the authors approached this problem, let’s look at the core idea behind their method.</p>
<h3 id="heading-pre-training">Pre-Training</h3>
<p>At the heart of the paper is a simple but powerful approach built in two stages. The first stage is pre-training, where the model learns directly from raw text.</p>
<p>According to the authors, the model is trained on a large corpus of unlabeled text using a language modeling objective (predicting the next word in a sequence) – specifically, predicting the next word based on the previous ones to solve the intractable problem of <a href="https://en.wikipedia.org/wiki/High-dimensional_statistics">high dimension probabilities</a>. Through this process, the model gradually learns important aspects of language, such as grammar, context, structure, and general patterns.</p>
<p>The paper highlights that datasets like BooksCorpus are used in this stage because they contain long, continuous text. This is important, since it helps the model understand relationships across sentences rather than just short fragments.</p>
<h3 id="heading-fine-tuning-adapting-to-tasks">Fine-Tuning (Adapting to Tasks)</h3>
<p>Once the model has learned general language patterns, the next step is fine-tuning, where it is adapted to specific tasks using labeled data.</p>
<p>According to the authors, this includes tasks like question answering, text classification, natural language inference, and semantic similarity. Instead of building a new model for each task, the same pre-trained model is reused with only small adjustments.</p>
<p>In practice, this is what makes the approach powerful: the model already understands language at a general level, so it can quickly adapt to different tasks without needing to be redesigned from scratch.</p>
<h2 id="heading-transformer-vs-bert-vs-gpt">Transformer vs. BERT vs. GPT</h2>
<p>Before diving into GPT-1, it helps to understand how modern language models are structured. Most of them are based on the Transformer architecture, but they use it in different ways: encoder-only models (like BERT), decoder-only models (like GPT), or full encoder–decoder models.</p>
<p>The original encoder–decoder Transformer was mainly used for tasks like machine translation. Encoder-only models are typically used for understanding tasks such as text classification and sentiment analysis, while decoder-only models are designed for generation tasks like text creation, powering systems such as ChatGPT, Gemini, and Claude.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/e7348479-5fa0-4adf-92e1-644ae2039b03.png" alt="e7348479-5fa0-4adf-92e1-644ae2039b03" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><em>Illustration comparing Transformer, GPT, and BERT architectures, adapted from</em> <a href="https://automotivevisions.wordpress.com/2025/03/21/comparing-large-language-models-gpt-vs-bert-vs-t5/">Comparing Large Language Models: GPT vs. BERT vs. T5</a> <em>showing encoder-decoder, decoder-only, and encoder-only designs</em></p>
<h3 id="heading-transformer-vs-bert-vs-gpt-key-differences">Transformer vs BERT vs GPT: Key Differences</h3>
<table style="min-width:100px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>Transformer (Original)</strong></p></td><td><p><strong>BERT</strong></p></td><td><p><strong>GPT</strong></p></td></tr><tr><td><p><strong>Paper</strong></p></td><td><p>Attention Is All You Need (2017)</p></td><td><p>BERT (2018)</p></td><td><p>GPT (2018–2019)</p></td></tr><tr><td><p><strong>Architecture Type</strong></p></td><td><p>Encoder + Decoder</p></td><td><p>Encoder-only</p></td><td><p>Decoder-only</p></td></tr><tr><td><p><strong>Primary Goal</strong></p></td><td><p>Sequence-to-sequence tasks (for example, translation)</p></td><td><p>Language understanding</p></td><td><p>Language generation</p></td></tr><tr><td><p><strong>Training Objective</strong></p></td><td><p>Predict next token (seq2seq setup)</p></td><td><p>Masked language modeling (fill in blanks)</p></td><td><p>Predict next token (autoregressive)</p></td></tr><tr><td><p><strong>Directionality</strong></p></td><td><p>Bidirectional (encoder) + left-to-right (decoder)</p></td><td><p>Fully bidirectional</p></td><td><p>Left-to-right only</p></td></tr><tr><td><p><strong>Context Understanding</strong></p></td><td><p>Strong (via attention)</p></td><td><p>Very strong (full bidirectional context)</p></td><td><p>Strong (but only past context)</p></td></tr><tr><td><p><strong>Input/Output Style</strong></p></td><td><p>Input → Output sequence</p></td><td><p>Input → Representation</p></td><td><p>Input → Generated text</p></td></tr><tr><td><p><strong>Fine-tuning</strong></p></td><td><p>Required for each task</p></td><td><p>Required for each task</p></td><td><p>Optional (GPT-2+ supports zero-shot)</p></td></tr><tr><td><p><strong>Typical Tasks</strong></p></td><td><p>Translation, summarization</p></td><td><p>Classification, QA, NLI</p></td><td><p>Text generation, QA, chat</p></td></tr><tr><td><p><strong>Strength</strong></p></td><td><p>Flexible architecture foundation</p></td><td><p>Deep understanding of text</p></td><td><p>General-purpose generation</p></td></tr><tr><td><p><strong>Limitation</strong></p></td><td><p>Not directly usable without adaptation</p></td><td><p>Cannot generate text naturally</p></td><td><p>Limited bidirectional context</p></td></tr><tr><td><p><strong>Key Innovation</strong></p></td><td><p>Self-attention mechanism</p></td><td><p>Deep bidirectional encoding</p></td><td><p>Scaled generative pre-training</p></td></tr><tr><td><p><strong>Evolution Role</strong></p></td><td><p>Foundation of all modern LLMs</p></td><td><p>Specialized understanding models</p></td><td><p>Path to general-purpose AI</p></td></tr></tbody></table>

<h2 id="heading-model-architecture">Model Architecture</h2>
<p>To support this pre-training and fine-tuning approach, the GPT-1 model is built on a Transformer (decoder) architecture.</p>
<p>According to the authors, this choice is important for a few reasons. Unlike older models such as LSTMs, Transformers handle long-range dependencies more effectively, meaning they can better understand relationships between words that are far apart in a sentence.</p>
<p>They also rely on self-attention, a mechanism that allows the model to focus on the most relevant parts of the text when processing each word. This helps the model capture context more accurately.</p>
<p>Another key advantage is that Transformers make transfer learning more effective, since the same learned representations can be reused across different tasks with minimal changes.</p>
<p>The paper highlights that, in these transfer learning scenarios, Transformers outperform LSTM-based models.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/59df10f6-d843-4db7-9def-e302594d0b7e.png" alt="59df10f6-d843-4db7-9def-e302594d0b7e" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><em>Figure 1 from</em> “Improving Language Understanding by Generative Pre-Training” <em>(Radford et al., 2018), showing the Transformer architecture and task-specific input transformations.</em></p>
<h2 id="heading-key-techniques">Key Techniques</h2>
<p>Along with the main approach, the authors introduce a few practical techniques that make the model more flexible across tasks.</p>
<p>According to the paper, different tasks are handled by converting them into text-based formats, so they can all be processed in a similar way. This makes it easier to use the same model across multiple problems without redesigning it each time.</p>
<p>Another important point is that the model requires only minimal architectural changes when switching between tasks. Most of the knowledge learned during pre-training is reused as-is.</p>
<p>The authors also include an auxiliary language modeling objective during fine-tuning, which helps the model retain its general understanding of language while adapting to specific tasks.</p>
<h2 id="heading-key-findings">Key Findings</h2>
<p>After training and evaluation, the results weren't just strong – they were surprisingly competitive.</p>
<p>According to the authors, the model outperformed state-of-the-art systems in 9 out of 12 tasks. It also showed clear improvements, including +8.9% in commonsense reasoning and +5.7% in question answering.</p>
<p>Another important observation is that the model performed well across datasets of different sizes, although performance was weaker on some smaller datasets.</p>
<p>This suggests that the pre-training step helped it generalize better, even when labeled data was limited.</p>
<p>In practice, what makes these results significant is that a single model was able to compete with specialized systems that were specifically designed for each individual task.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/14e5a9dd-9919-4b2a-ad42-6b011770b7fe.png" alt="14e5a9dd-9919-4b2a-ad42-6b011770b7fe" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><em>Figure 2 from</em> “Improving Language Understanding by Generative Pre-Training” <em>(Radford et al., 2018), illustrating performance gains from layer transfer and zero-shot learning behavior.</em></p>
<h2 id="heading-conclusions">Conclusions</h2>
<p>To wrap things up, this paper introduced a major shift in how AI systems are built.</p>
<p>According to the authors, instead of training a new model from scratch for every task, we can first teach a model the structure of language through pre-training, and then adapt it to specific tasks through fine-tuning. This simple idea turns out to be highly effective.</p>
<p>The key takeaway is that language models can develop a general understanding of text, especially when combined with Transformer architectures and large-scale data. This makes transfer learning practical across many different tasks.</p>
<p>In my view, this is what makes the paper so impactful. It doesn’t just improve performance on a few benchmarks. It changes the overall approach to building AI systems.</p>
<p>This idea later became the foundation for models like GPT-2, GPT-3, and ChatGPT, and continues to shape modern large language models today.</p>
<h2 id="heading-limitations">Limitations</h2>
<p>Like any approach, this method comes with its own limitations.</p>
<p>According to the paper, one of the main challenges is the need for large amounts of unlabeled data during the pre-training stage, which may not always be easy to get. The model’s performance also depends heavily on how well the fine-tuning step is done.</p>
<p>The authors also note that multi-task learning was not fully explored in this work, leaving some open questions about how well the model can handle multiple tasks at the same time.</p>
<p>In practice, another limitation is that performance can be weaker when working with very small datasets, especially if the fine-tuning process is not carefully handled.</p>
<h2 id="heading-related-work-amp-context">Related Work &amp; Context</h2>
<p>To better understand where this paper fits, it helps to look at the ideas it builds on.</p>
<p>According to the authors, earlier approaches such as word embeddings (like Word2Vec and GloVe), LSTM-based language models, and semi-supervised learning had already made progress in understanding language. But these methods were often limited to learning representations at the word level or required more task-specific design.</p>
<p>What this paper does differently is move beyond that. Instead of focusing only on individual words, it learns broader language representations that capture context and meaning across entire sequences. This shift is what enables the model to generalize better across different tasks.</p>
<h2 id="heading-final-insight">Final Insight</h2>
<p>If there’s one idea to take away from this paper, it’s this: you don’t need to teach an AI system every task separately.</p>
<p>According to the authors, once a model learns the structure of language, it can adapt to a wide range of tasks with minimal changes. That shift – from task-specific models to general language understanding – is what makes this work so important.</p>
<p>In my view, this is the moment where things really changed. What started here with GPT-1 became the foundation for the systems we use today, including ChatGPT and other modern language models.</p>
<h2 id="heading-resources">Resources:</h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1301.3781">Word2Vec (Mikolov et al., 2013)</a></p>
</li>
<li><p><a href="https://aclanthology.org/D14-1162.pdf">GloVe (Pennington et al., 2014)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1706.03762">Attention Is All You Need (Vaswani et al., 2017)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1511.01432">Semi-supervised Sequence Learning (Dai and Le, 2015)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1801.06146">Universal Language Model Fine-tuning for Text Classification (Howard and Ruder, 2018)</a></p>
</li>
<li><p><a href="https://aclanthology.org/N18-1202.pdf">Deep Contextualized Word Representations (Peters et al., 2018)</a></p>
</li>
<li><p><a href="https://aclanthology.org/P17-1194.pdf">Semi-supervised Multitask Learning for Sequence Labeling (Rei, 2017)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1506.06726">Skip-Thought Vectors (Kiros et al., 2015)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1705.02364">Supervised Learning of Universal Sentence Representations (Conneau et al., 2017)</a></p>
</li>
</ul>
<h3 id="heading-contact-me">Contact Me</h3>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Neural Networks Work – Explained Using the Straight Line Equation y = ax + b ]]>
                </title>
                <description>
                    <![CDATA[ Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“ A straight line equation y = ax+b answers it in the simplest way possible. y can incre... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/neural-networks-explained-using-y-ax-b/</link>
                <guid isPermaLink="false">695ef4246f1bfe13bf31abe9</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Samyukta Hegde ]]>
                </dc:creator>
                <pubDate>Thu, 08 Jan 2026 00:02:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767800625537/5bb99a58-d247-4933-b60b-fd2c14651542.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“</p>
<p>A straight line equation <code>y = ax+b</code> answers it in the simplest way possible. <code>y</code> can increase, decrease, or stay the same when <code>x</code> changes.</p>
<p>On the other hand, a deep neural network tries to answer it in a flexible way. It’s only possible because of multiple layers of straight line calculations stacked one over another along with non linear adjustments to help the network adapt and produce the desired result.</p>
<p>Since a straight line is the essence of neural networks, I think it’s time we try to understand the subtle details of <code>y = ax+b</code>, which I refer to as the <strong>magical equation</strong>. We’ll also go through the basics of linear regression and classification, which should help you understand the progression of a simple straight line to a complex deep neural network.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-yaxb">y=ax+b</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-linear-regression">Linear Regression</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-linear-classification">Linear Classification</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-comparison">Comparison</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-key-additions-to-help-build-deep-neural-networks">Key Additions to Help Build Deep Neural Networks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-modelling-a-deep-neural-network">Modelling a Deep Neural Network</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A basic understanding of linear algebra, particularly <code>y=ax+b</code>.</p>
</li>
<li><p>General idea about linear regression and classification.</p>
</li>
<li><p>Familiarity with the concept of deep neural networks.</p>
</li>
</ul>
<h2 id="heading-yaxb">y=ax+b</h2>
<p>A straight line simply means that output changes steadily as input changes. There are no surprises (that is, no non linearity). Let’s analyze it properly.</p>
<pre><code class="lang-plaintext">y =&gt; Output variable
x =&gt; Input variable
a =&gt; Amount by which y changes when x changes (slope)
b =&gt; Value of y when x is 0 (y intercept)
</code></pre>
<p>We can take an example and model it in the same form to understand it better.</p>
<p>Ms. Poly is a math teacher who wants to formulate a study plan for her students to excel in an upcoming final exam. For simplicity, she creates a rule of thumb using only one factor: the number of hours studied per week. It has a direct impact on the marks scored by a student.</p>
<p>Before beginning, she makes certain assumptions:</p>
<ul>
<li><p>Every student is capable of scoring at least 30 without studying.</p>
</li>
<li><p>For every hour a student studies, an additional 3 marks can be scored.</p>
</li>
</ul>
<p>She then comes up with the following equation based on her ideas: <code>y = 3x+30</code></p>
<pre><code class="lang-plaintext">y =&gt; Marks scored.
x =&gt; Number of hours studied.
a=3 =&gt; Increase in marks for every hour studied
b=30 =&gt; Minimum marks
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764650083131/997f2a53-78ac-4b6f-a0c1-b995fb515075.png" alt="Plot of y=3x+30" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>In the above graph, she plots the points based on the results of the equation. As expected, it is a straight line. If she needs the marks scored for <code>9</code> hours of study, she can get it by just substituting <code>x=9</code> in <code>y=3x+30</code>. Note that the data (<code>x</code> and <code>y</code>) are just based on her hunch and aren’t real.</p>
<p>But Ms. Poly wants to guide her students on how to prepare for the final exam based on actual data. So she conducts a pop quiz and grades it. In order to formulate a study plan, she interviews her students and collects information on how many hours they study math per week. She creates a table with two columns: number of hours studied (<code>x</code>) per week and marks scored (<code>y</code>). She tries her old formula <code>y=3x+30</code>, but it doesn’t seem to work. Thus, she doesn’t have any sensible equation describing the relation between <code>x</code> and <code>y</code>.</p>
<p>Let’s assume that a new student who hasn’t attended any exam (no <code>y</code> available) joins the class the next day, and Ms. Poly only knows the number of hours dedicated per week (<code>x</code>). How can she answer the question below?</p>
<p><em>If the new student studies for a certain number of hours (</em><code>x</code><em>), what can be the marks scored (</em><code>y</code><em>) in the exam?</em></p>
<p>It’s impossible unless there’s an equation defining the sample data. So, her task is to find one that fits the given points. This process is called curve fitting or regression.</p>
<h2 id="heading-linear-regression">Linear Regression</h2>
<p>The core idea of linear regression to find a straight line that captures the trend of the existing data to facilitate predictions for new input data. Now, let’s dive straight into the example to understand the concept better.</p>
<p>Ms. Poly is determined to arrive at a solution. She plots the collected data on a graph to get a better picture.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651274954/0aa2dfc2-d846-40e6-872d-e7d5abe598a8.png" alt="Input Data" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>She has absolutely no idea how <code>x</code> and <code>y</code> are related. So, she must figure out a formula, by trial and error, that roughly fits the points. She has to start with an intuitive guess, try to improve it in the subsequent steps and then arrive at the best possible solution.</p>
<p><strong>Trial 1</strong>: Ms. Poly begins with her previous straight line equation.</p>
<p><code>y = 3x+30</code></p>
<p>She substitutes different values of <code>x</code> and plots it alongside the collected input data. This way she can get a clear picture of the differences in her assumption and reality.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651323645/a3e79765-99bc-42be-8836-82119d7fbf66.png" alt="Linear Regression-Trial 1" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Trial 2</strong>: She observes that the line needs a little more slope. This simply means that, in reality, more marks are being scored for every additional hour of study. By changing it from <code>3</code> to <code>4</code>, the equation becomes:</p>
<p><code>y = 4x+30</code></p>
<p>The following graph depicts the new line alongside the sample data:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651379913/42a8fc61-7927-46de-aadf-b691544b9a1b.png" alt="Linear Regression-Trial 2" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Trial 3:</strong> It looks better but she feels there is a need to shift the whole line upwards. This means that higher marks are being scored even if a student doesn’t dedicate any time for math in a week. She decides to retain the previous slope but changes the starting marks by <code>10</code>, thus arriving at:</p>
<p><code>y = 4x+40</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651454435/5fea2d39-8254-48e6-be14-69c803982ec7.png" alt="Linear Regression-Trial 3" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This particular line covers most of the points and can be considered the best possible solution.</p>
<p>Now, if she wishes to ascertain the marks scored by the new student who studied for <code>3.5</code> hours, she pins the value inside the formula and calculates the answer: <code>y = 4*(3.5)+40=54</code></p>
<p>We saw how Ms. Poly arrived at a straight line equation to predict the output for an unknown input. Now she can chalk out a study plan for her class based on the equation.</p>
<p>Here, an expression is formulated to ascertain the change in output when the input changes. It looks like Ms. Poly is thinking like a data scientist. She has in fact modelled a very simple neural network for regression. The equation <code>y=4x+40</code> can be considered as the only neuron (processing unit) within it. She’s adjusted the parameters <code>a</code> (weight) and <code>b</code> (bias) to arrive at the final formula which covers most of the points (thus minimizing the loss).</p>
<p>Here’s a breakdown of the <code>y = 4x+40</code> equation:</p>
<pre><code class="lang-plaintext">y =&gt; Marks scored.
x =&gt; Number of hours studied.
a=4 =&gt; Increase in marks for every hour studied
b=40 =&gt; Minimum marks
</code></pre>
<p>At present, it is a rudimentary neural network which has no layering and non-linearity.</p>
<p>Now let’s shift our attention to a completely different scenario. Ms. Poly, being a teacher, wants to ensure that all her students pass the exam. Assuming, as an end result, she’s not interested in predicting the marks scored. She just wants to know:</p>
<p><em>If a student studies for a certain number of hours (</em><code>x</code><em>), will the student pass/fail(y) the exam?</em></p>
<p>This leads her to the process of classification.</p>
<h2 id="heading-linear-classification">Linear Classification</h2>
<p>The linear classification process uses a simple straight line to divide the data into categories or classes. The line acts as a boundary so that the classes fall on either side of it. First, Ms. Poly defines the boundary condition for pass and fail.</p>
<p><em>If marks scored&gt;=50, pass</em></p>
<p><em>If marks scored&lt;50, fail</em></p>
<p>According to the data table, <code>x=3</code> corresponds to <code>y=52</code> (boundary condition). Therefore she considers <code>x=3</code> as the classification line***.***</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651531018/e669ed7b-1c86-4093-b7e5-feb06464ebfe.png" alt="Linear Classification" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><code>x=3</code> seems to segregate the points into the categories properly. She tries to confirm it by substituting another value. Thus, if a student studied for <code>9</code> hours, the score would lie towards the right side of <code>x=3</code>. So, they’d pass as per the classification equation.</p>
<p>Again, she’s arrived at an expression to ascertain the change in output when the input changes. But here, she has modelled a basic neural network for classification. The equation x=3 is the only neuron within it. It can be considered to be having two parts as explained below.</p>
<ol>
<li><p><strong>Pre-Activation Part:</strong> This portion of the neuron computes an intermediate value which is helpful in further processing. She’s figured out the parameters <code>a</code> (weight) and <code>b</code> (bias) to arrive at the following formula: <code>z = x-3</code></p>
<pre><code class="lang-plaintext"> z =&gt; Intermediate Value.
 x =&gt; Number of hours studied.
 a=1 =&gt; Influence of the number of hours studied on the marks scored
 b=-3 =&gt; Minimum number of hours to study to pass the exam = 3
</code></pre>
</li>
<li><p><strong>Activation Part:</strong> This portion triggers the neuron to make decisions based on a threshold value. The following equation segregates the points into two classes.</p>
<pre><code class="lang-plaintext"> y = 1 (Pass) if z&gt;=0
 y = 0 (Fail) if z&lt;0
</code></pre>
</li>
</ol>
<p>This is a very plain neural network which has no layering and non-linearity but has pre-activation and activation parts inside a neuron.</p>
<h2 id="heading-comparison">Comparison</h2>
<p>We looked at the examples of both linear regression and classification used by Ms. Poly. <strong>Regression</strong> helps in predicting a value while <strong>Classification</strong> helps in decision making. Let’s draw a small table to summarize the differences.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764652317811/f4411011-fcd3-4a53-b116-a3c8a27c81d8.png" alt="Comparison between Linear Regression and Classification" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Upon careful observation we notice that both answer the question of how input change affects output.</p>
<p>But at a slightly higher level of complexity than a straight line. Because in the case of both regression and classification, we try to figure out the equation parameters by trial and error.</p>
<p>Here, since the requirements are simple, Ms. Poly just uses a straight line to solve both. A simple linear equation can handle only one steady trend. But in real life, problems that need solving are far more challenging and unpredictable. Some examples are:</p>
<p><strong>Image Classification</strong>: An output label is produced based on the input images.</p>
<p><strong>Text Translation</strong>: An English sentence can be given as an input to be translated to say, Spanish.</p>
<p><strong>Chatbots</strong>: A text prompt is typed in by a user and a meaningful and relevant output is generated.</p>
<p>She probably should have to use a deep neural network if both data and task were complex. That presents another question: <strong>How does one build a deep neural network?</strong></p>
<p>We will explore it further by extending the same example to a more realistic version.</p>
<h2 id="heading-key-additions-to-help-build-deep-neural-networks">Key Additions to Help Build Deep Neural Networks</h2>
<p>In the above sections, we noted that Ms. Poly was interested in predicting the exam results of a student using just one factor - number of hours studied. However, in practice, is that one factor sufficient in determining the marks scored or whether the student passes the exam?</p>
<p>No. It’s not enough. She needs to take into account a lot of aspects like:</p>
<ul>
<li><p>Number of hours studied</p>
</li>
<li><p>Number of hours of sleep/rest</p>
</li>
<li><p>Burnout due to over-studying</p>
</li>
<li><p>Difficulty level of topics in math</p>
</li>
<li><p>Pattern of the exam, and so on.</p>
</li>
</ul>
<p>All the above neither act independently nor do they have a simple linear relation with the marks scored. So, she has to solve this problem by stacking the contributing factors one above the other in layers and also adding the element of non linearity. Let’s take a look at each in detail.</p>
<h3 id="heading-layering">Layering</h3>
<p>Burnout leads to lower score whereas good sleep increases score. But burnout can be reduced if the student is well rested. So, the impact on the final score when these two factors interact should be taken into account. This is possible only when the system solves it in layers. The first layer can deal with how they independently influence the score, the next layer can explore the interaction between them.</p>
<h3 id="heading-non-linearity">Non-Linearity</h3>
<p>If the number of hours studied increases, the score might increase but when burnout overpowers the effect of study hours, the score reduces. The combined effect results in a non-linear graph. There is a rise and then dip in the score based on number of hours studied. It’s evident that the relationship is not straightforward as in a straight line. That’s where it becomes necessary to add non-linearity in the calculations. It helps the system to respond differently according to the conditions, allowing for flexibility in dealing with real world data and conditions.</p>
<p>Thus, Ms. Poly would have to extend the idea of linear regression/classification by including layering and non-linearity to build a fully functional neural network to help build a practical study plan.</p>
<h2 id="heading-modelling-a-deep-neural-network">Modelling a Deep Neural Network</h2>
<p>Ms. Poly should start the work on modelling a deep neural network by following the steps mentioned below:</p>
<h3 id="heading-step-1-define-the-problem-clearly"><strong>Step #1 - Define the Problem Clearly</strong></h3>
<p>The following factors should be considered before she begins the process of modelling:</p>
<ul>
<li><p>What are the input features?</p>
</li>
<li><p>What are the output features?</p>
</li>
<li><p>What type of problem is it (regression/classification)?</p>
</li>
</ul>
<h3 id="heading-step-2-define-the-input-layer"><strong>Step #2 - Define the Input Layer</strong></h3>
<p>The input features form the first layer. There is no computation in this stage. They are represented as:</p>
<pre><code class="lang-plaintext">x1: Number of hours studied
x2: Number of hours of sleep/rest
x3: Burnout due to over-studying
x4: Difficulty level of topics in Maths
x5: Pattern of the exam
</code></pre>
<h3 id="heading-step-3-define-the-first-hidden-layer"><strong>Step #3 - Define the First Hidden Layer</strong></h3>
<p>This step consists of two parts:</p>
<p><strong>Apply Linear Transformation</strong>: The actual learning begins here. A straight line equation is used to understand the combined effect of the inputs. The general formula is <code>z=Wx+b</code>.</p>
<pre><code class="lang-plaintext">z: Intermediate value or Pre-activation
W: Weight matrix which consists of values corresponding to the impact of
each input feature
x: Matrix consisting of input features, [x1, x2, x3, x4, x5]
b: Bias which represents the initial assumptions of the teacher(when x=0)
</code></pre>
<p>It looks similar to a linear regression/classification equation. At first <code>W</code> and <code>b</code> are initialized to random values. Then in the subsequent steps, they are adjusted like it was done in earlier examples. We can consider the following combinations assuming we have two neurons in this layer:</p>
<p><strong>Neuron 1:</strong> It can focus on study hours, burnout, and rest, with other features contributing less significantly.</p>
<p><strong>Neuron 2</strong>: It can emphasize more on the difficulty level of the topic and the exam type compared to other inputs.</p>
<p>It’s important to note that this layer doesn’t calculate the interactions between the features but only on the way different linear combinations work together but independently. To make it clearer, how they contribute independently are added together. We don’t know how one input feature influences the other. For example, we know sleep increases score and burnout reduces score, but what we don’t know at this stage is if sleep reduces burnout, which in turn can influence the final score.</p>
<p><strong>Add Non-Linearity</strong>: This step, also called activation, helps in capturing the complexities in different combinations of the features. Less study results in low marks, and too much burnout also results in low marks. It means there is a curve in the score graph which can’t be represented by a linear equation. The activation function is applied to the intermediate value and can be expressed as:</p>
<p><strong>a = g(z)</strong></p>
<pre><code class="lang-plaintext">a: Activation output
g: Activation function
z: Intermediate value or Pre-activation
</code></pre>
<p>For example: <code>ReLU</code> is an activation function which outputs <code>z</code> only if <code>z</code> is positive, else <code>0</code>.</p>
<p><strong>y = ReLU(z)=max(0,z)</strong></p>
<p>We can see that it has no steady slope and is a non-linear activation function. It can suit this scenario as it lets the value pass through to the next layer only if the combined effect of features is greater than 0. Neuron 1 will let it’s output go to the next layer only if the intermediate value (<code>z</code>) that results from study hours, burnout and rest, is large enough to be influencing the final decision, else it’s ignored. There are multiple options for non-linear activation functions that one can choose from.</p>
<h3 id="heading-step-4-stack-layers-one-above-the-other"><strong>Step #4 - Stack Layers One Above the Other</strong></h3>
<p>This step helps in learning the mutual interactions between the inferences learned from the first hidden layer. The network attempts to understand the intricate details of the influencing factors and build a stable system. It is here that details of whether sleep reduces burnout are figured out. Every layer consists of linear and non linear transformations applied on the input, which are values obtained from the previous layer. Likewise multiple layers can be stacked one over the other based on the requirements. In this example, for representation, we have taken two hidden layers with two neurons each. The number of layers and neurons can vary based on requirements.</p>
<h3 id="heading-step-5-define-the-output-features"><strong>Step #5 - Define the Output Feature(s)</strong></h3>
<p>This appears to be the final stage in a deep neural network. Ms. Poly can decide what she wants for output: predict the marks scored by a student or predict if the student passes/fails the exam. If she wants the final marks scored, she just has to apply linear transformation in the neuron in the final layer to produce the output. If she wants pass/fail status, she has to apply both linear and non-linear transformations to achieve the desired results.</p>
<p>The diagram below shows an abstract representation of the deep neural network.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766153114888/1e513840-483a-43cf-b062-ce5af886a04e.png" alt="Abstract Representation of a Deep Neural Network" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>The next steps are:</p>
<p><strong>Training the model</strong>: The network is trained in the following way:</p>
<ul>
<li><p>Random weights and biases are assigned to the linear transformation portions of the network.</p>
</li>
<li><p>Then the network makes a prediction which is compared with the expected result.</p>
</li>
<li><p>If there are gaps between the actual result and the predicted result, corrections are made in weights and biases (this step is similar to what was done in linear regression and classification).</p>
</li>
<li><p>The steps above are repeated until the results improve.</p>
</li>
</ul>
<p><strong>Using the model</strong>: After the model has been trained, it is capable of yielding results for new input values.</p>
<h2 id="heading-final-thoughts"><strong>Final Thoughts</strong></h2>
<p>In this article, we began with the basics of a straight line equation. Then we gradually navigated through slightly more elaborate concepts like linear regression and classification. They laid the groundwork for delving into the seemingly mysterious deep neural networks. But they are in fact built by stacking layers of linear transformations and non-linear activations, which help understand sophisticated real world patterns.</p>
<p>Despite all the complexities and layers, we can see that the straight line remains the foundation upon which neural networks are built. As we saw earlier, the equation that a deep neural network begins with is our <em>magical equation:</em> <code>y = ax+b</code>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Set Up CUDA and WSL2 for Windows 11 (including PyTorch and TensorFlow GPU) ]]>
                </title>
                <description>
                    <![CDATA[ If you’re working on complex Machine Learning projects, you’ll need a good Graphics Processing Unit (or GPU) to power everything. And Nvidia is a popular option these days, as it has great compatibility and widespread support. If you’re new to Machin... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-set-up-cuda-and-wsl2-for-windows-11-including-pytorch-and-tensorflow-gpu/</link>
                <guid isPermaLink="false">69309b9e8c594b8177306456</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Windows ]]>
                    </category>
                
                    <category>
                        <![CDATA[ WSL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cuda ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Md. Fahim Bin Amin ]]>
                </dc:creator>
                <pubDate>Wed, 03 Dec 2025 20:20:46 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764786287487/f0c28401-ce77-4873-b238-59fc6b737ce7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re working on complex Machine Learning projects, you’ll need a good Graphics Processing Unit (or GPU) to power everything. And Nvidia is a popular option these days, as it has great compatibility and widespread support.</p>
<p>If you’re new to Machine Learning and are just getting started, then a free <a target="_blank" href="https://www.kaggle.com/">Kaggle</a> or <a target="_blank" href="https://colab.research.google.com/">Colab</a> might be enough for you. But that won’t be the case when you want to go deeper. You’ll need a GPU, which can get costly if you’re continuously using it on the cloud.</p>
<p>But there’s some good news: you can utilize your computer’s Nvidia GPU (GTX/RTX) quite easily and perform machine learning-related tasks right on your local machine. The cool thing is, it won’t cost you anything other than the electricity it uses!</p>
<p>When you’re running Machine Learning models on your local machines, the most suitable operating system is a Linux-based one, like Ubuntu. But Windows has improved a lot for this purpose. If you’re using the latest Windows 11, you can leverage Windows Subsystem for Linux (WSL) and use your GPU directly for Machine Learning-related workflows.</p>
<p>This process can be quite tricky, though, as can making two popular Machine Learning frameworks, TensorFlow and PyTorch, compatible with your system GPU in Windows 11. That’s why I have written this comprehensive guide to ease your pain.</p>
<p>In it, I’ll help you set up CUDA on Windows Subsystem for Linux 2 (WSL2) so you can leverage your Nvidia GPU for machine learning tasks.</p>
<p>By following these steps, you’ll be able to run ML frameworks like TensorFlow and PyTorch with GPU acceleration on Windows 11.</p>
<p>Keep in mind that this guide assumes you have a compatible Nvidia GPU. Make sure to check <a target="_blank" href="https://developer.nvidia.com/cuda-gpus">Nvidia's official compatibility list</a> before proceeding.</p>
<p>I have also prepared a video for you that’ll help you follow proper guidelines throughout this article.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/qOJ49nkU4rY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>Also, if this tutorial helps you, then don’t forget to add a star to the GitHub repository <a target="_blank" href="https://github.com/FahimFBA/CUDA-WSL2-Ubuntu-v2">CUDA-WSL2-Ubuntu-v2</a>. If you face any issues or have any suggestions/improvements, then please raise an issue in the GitHub repository. Currently, the live website is available at <a target="_blank" href="https://ml-win11-v2.fahimbinamin.com/">ml-win11-v2.fahimbinamin.com</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-windows-terminal">Windows Terminal</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-windows-powershell-latest-amp-greatest">Windows PowerShell (Latest &amp; Greatest)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-windows-terminal">Configure Windows Terminal</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configuration-of-my-computer">Configuration of my computer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cpu-virtualization">CPU Virtualization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-wsl2">Install WSL2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-latest-lts-ubuntu-via-wsl2">Install Latest LTS Ubuntu via WSL2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-update-amp-upgrade-ubuntu-packages">Update &amp; Upgrade Ubuntu Packages</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-and-configure-miniconda">Install and Configure Miniconda</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-jupyter-amp-ipykernel">Install Jupyter &amp; Ipykernel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-nvidia-driver">Nvidia Driver</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cuda-toolkit">CUDA Toolkit</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-add-path-to-shell-profile-for-cuda">Add Path to Shell Profile for CUDA</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-nvcc-version">nvcc Version</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cudnn-sdk">cuDNN SDK</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tensorflow-gpu">TensorFlow GPU</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-check-tensorflow-gpu">Check TensorFlow GPU</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-pytorch-gpu">PyTorch GPU</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-check-pytorch-gpu">Check PyTorch GPU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-check-pytorch-amp-tensorflow-gpu-inside-jupyter-notebook">Check PyTorch &amp; TensorFlow GPU inside Jupyter Notebook</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following requirements met:</p>
<ul>
<li><p>Windows 11 operating system</p>
</li>
<li><p>Nvidia GPU (GTX/RTX series)</p>
</li>
<li><p>Administrator access to your PC</p>
</li>
<li><p>At least 30 GB of free disk space</p>
</li>
<li><p>Internet connection for downloads</p>
</li>
<li><p>Latest Nvidia drivers installed</p>
</li>
</ul>
<h2 id="heading-windows-terminal">Windows Terminal</h2>
<p>First, you’ll need to ensure that you have Windows Terminal installed properly in your operating system. It is the newest terminal application for users of command-line tools and shells like Command Prompt, PowerShell, and WSL. You can download it from the <a target="_blank" href="https://apps.microsoft.com/detail/9N0DX20HK701?hl=en-us&amp;gl=BD&amp;ocid=pdpshare">Microsoft Store</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094104150/c73ae561-6888-4eea-9419-186c6659a62f.png" alt="Preview of Windows Terminal on Windows 11" class="image--center mx-auto" width="1133" height="641" loading="lazy"></p>
<p>After ensuring that it’s installed properly, you can proceed to the next steps.</p>
<h2 id="heading-windows-powershell-latest-amp-greatest">Windows PowerShell (Latest &amp; Greatest)</h2>
<p>Windows PowerShell is a modern and updated command-line shell from Microsoft. You can use some Linux specific commands directly on it. It comes with built-in command suggestions. You can download it from the <a target="_blank" href="https://github.com/PowerShell/PowerShell/releases/">official GitHub page</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094138179/78315197-f4f2-4df4-b022-37cb9e74cda2.png" alt="Preview of Windows PowerShell on GitHub" class="image--center mx-auto" width="1519" height="904" loading="lazy"></p>
<p>Download the latest x64 installer and install it. After ensuring that it is installed properly, you can proceed to the next steps.</p>
<h2 id="heading-configure-windows-terminal">Configure Windows Terminal</h2>
<p>Now you’ll need to configure your Windows Terminal to use PowerShell as the default shell. It’s optional and you might skip this step. But I recommend doing it for a better experience.</p>
<p>Open Windows Terminal. Click on the down arrow icon in the title bar and select "Settings".</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094162440/6ea767c8-da3b-4280-84f8-0eb2b0647a46.png" alt="Preview of Windows PowerShell settings window" class="image--center mx-auto" width="1166" height="660" loading="lazy"></p>
<p>In the Settings tab, under "Startup", find the "Default profile" dropdown menu. Select "PowerShell" from the list.</p>
<p>Now for the "Default terminal application", select "Windows Terminal".</p>
<p>By default, Windows PowerShell always shows the version number in the title bar. If you want to disable it, select the "PowerShell" profile from the left sidebar. Click on the "Command Line" field and add an <code>--nologo</code> argument at the end of the command. After this, the line becomes <code>"C:\Program Files\PowerShell\7\pwsh.exe" --nologo</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094185648/3641d5f0-ba34-44b9-8a63-86b53068d02e.png" alt="Preview of Windows PowerShell --nologo setting" class="image--center mx-auto" width="1170" height="654" loading="lazy"></p>
<p>If you don’t use other shells frequently and want to hide them in the dropdown, then you’ll need to select those profiles one by one from the left sidebar. Scroll down to the bottom and find the "Hide profile from dropdown" toggle and enable it. It will hide that specific shell from the dropdown menu.</p>
<p>For example, I am hiding the <strong>Azure Cloud Shell</strong> profile as I don't use it frequently:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094214632/73add1b7-bcdd-4368-86a6-975fa2f72b54.png" alt="Preview of hiding profiles in Windows Terminal" class="image--center mx-auto" width="1151" height="657" loading="lazy"></p>
<p>Now click on the "Save" button at the bottom right corner to apply the changes. Close the Windows Terminal for now.</p>
<h2 id="heading-configuration-of-my-computer">Configuration of My Computer</h2>
<p>I figured it’d be helpful to share my current computer’s configuration so you can have a clear idea of which setup I’m using in this guide. Here are the details:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Component</strong></td><td><strong>Specification</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Processor</strong></td><td>AMD Ryzen 7 7700 8-Core Processor (8 Core 16 Threads)</td></tr>
<tr>
<td><strong>RAM</strong></td><td>64GB DDR5 6000MHz</td></tr>
<tr>
<td><strong>Storage</strong></td><td>1 TB Samsung 980 NVMe SSD, 4 TB HDD, 2 TB SATA SSD</td></tr>
<tr>
<td><strong>GPU</strong></td><td>NVIDIA GeForce RTX 3060 12GB GDDR6</td></tr>
<tr>
<td><strong>Operating System</strong></td><td>Windows 11 Pro Version 25H2</td></tr>
</tbody>
</table>
</div><p>Now that you have an idea about my computer’s configuration, we can proceed to the next steps.</p>
<h2 id="heading-cpu-virtualization">CPU Virtualization</h2>
<p>As we are going to use WSL2, we’ll need to make sure that the CPU virtualization is enabled. To check whether virtualization is enabled or not from Windows, simply open the Windows Task Manager. Go to the Performance tab and select CPU from the left sidebar. In the bottom right corner, you will see the Virtualization status. If it shows "Enabled", then you are good to go. If it shows "Disabled", then you need to enable it from the BIOS.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094252181/29efa40c-ec0a-4d99-adb7-50596348a1aa.png" alt="Preview of Virtualization enabled status in Windows Task Manager" class="image--center mx-auto" width="824" height="760" loading="lazy"></p>
<p>⚠️ You have to ensure that CPU Virtualization is enabled in your BIOS settings. Different manufacturers have different ways to access the BIOS. Usually, you can access the BIOS by pressing the Delete or F2 key during the boot process. Once in BIOS, look for settings related to "Virtualization Technology" or "Intel VT-x"/"AMD-V" and make sure it is enabled. Save the changes and exit the BIOS.</p>
<h2 id="heading-install-wsl2">Install WSL2</h2>
<p>Open the Windows Terminal or Windows PowerShell as an administrator. Run the following command to install WSL2 along with the latest Ubuntu LTS distribution:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span>
</code></pre>
<p>It will install Windows Subsystem for Linux 2 (WSL2). After the installation is complete, you will be prompted to restart your computer. Do so to finalize the installation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094306994/41db30c0-ecb9-4436-a425-8a059b199c42.png" alt="Preview of WSL installation in Windows PowerShell" class="image--center mx-auto" width="1295" height="656" loading="lazy"></p>
<p>⚠️ If you encounter any issues during installation, refer to the <a target="_blank" href="https://learn.microsoft.com/en-us/windows/wsl/troubleshooting">official Microsoft documentation</a> for troubleshooting WSL installation problems.</p>
<h2 id="heading-install-latest-lts-ubuntu-via-wsl2">Install Latest LTS Ubuntu via WSL2</h2>
<p>Open the Windows Terminal or Windows PowerShell again with the administrator privileges. If you want to check the available Linux distributions to install via WSL, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-list</span> -<span class="hljs-literal">-online</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094455888/8f1f2382-41cc-410f-a7b9-a47d3bb634b6.png" alt="Preview of available WSL distributions in Windows PowerShell" class="image--center mx-auto" width="1291" height="660" loading="lazy"></p>
<p>For installing any specific distribution, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span> &lt;DistroName&gt;
</code></pre>
<p>We are going to install the latest LTS Ubuntu distribution. As of now, the latest LTS version is Ubuntu 24.04. But I prefer to install the <code>Ubuntu</code> directly as it always points to the latest LTS version. So, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span> Ubuntu
</code></pre>
<p>You need to give it a default user account name. For me, I am going with <code>fahim</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094505280/9beb24de-54da-4e0c-993d-b15f985867e3.png" alt="Preview of Ubuntu installation in Windows PowerShell" class="image--center mx-auto" width="1666" height="858" loading="lazy"></p>
<p>It also comes with a nice GUI management tool for WSL.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094530944/89073fb9-881f-48bd-b5ef-a0b08f74e4c5.png" alt="Preview of WSL GUI management tool" class="image--center mx-auto" width="1114" height="724" loading="lazy"></p>
<p>You can configure a lot of stuff in it including restricting core, RAM, disk space and a lot of specifications from the settings GUI window.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094551095/66aea1e1-e204-4115-80e0-b3dea2d7a2ac.png" alt="Preview of WSL GUI settings window (Memory &amp; Processor)" class="image--center mx-auto" width="1919" height="1024" loading="lazy"></p>
<h2 id="heading-update-amp-upgrade-ubuntu-packages">Update &amp; Upgrade Ubuntu Packages</h2>
<p>Open your Ubuntu terminal from Windows Terminal. First, we need to update and upgrade the existing packages to their latest versions.</p>
<p>To update the Ubuntu system, simply use the following command:</p>
<pre><code class="lang-bash">sudo apt update -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094594281/be41e056-7e55-4139-b84b-6b7921a2d435.png" alt="Preview of apt update command in Ubuntu terminal" class="image--center mx-auto" width="1649" height="888" loading="lazy"></p>
<p>To upgrade all the packages at once, simply use the following command:</p>
<pre><code class="lang-bash">sudo apt upgrade -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094627958/b1c17b1c-5290-470b-aafe-5b89bb03bd01.png" alt="Preview of apt upgrade command in Ubuntu terminal" class="image--center mx-auto" width="1659" height="934" loading="lazy"></p>
<p>⚠️ Make sure that you have a stable internet connection during the update and upgrade process to avoid any interruptions.</p>
<h2 id="heading-install-and-configure-miniconda">Install and Configure Miniconda</h2>
<p>In Machine Learning, we need to manage multiple environments with different package versions. Conda is a popular package and environment management system that makes it easy to create and manage isolated environments for different projects. We will install Miniconda, a minimal installer for Conda, to manage our Python environments. But if you prefer Anaconda, you can install it instead.</p>
<p>Go to the official website of Miniconda. Currently the Miniconda installer is inside Anaconda <a target="_blank" href="https://www.anaconda.com/docs/getting-started/miniconda/install">here</a>. If the official website gets updated, you can always search for "Miniconda installer" on Google to find the latest version. Also, you can create an issue in the <a target="_blank" href="https://github.com/FahimFBA/CUDA-WSL2-Ubuntu-v2/issues">official GitHub repository of this project</a> to notify me about it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094667031/7ee2c854-88b6-49ce-8c04-41bf0a052c90.png" alt="Preview of Miniconda official website" class="image--center mx-auto" width="1895" height="935" loading="lazy"></p>
<p>As we are installing it inside WSL, we have to select the macOS/Linux Installation. Then select Linux Terminal Installer and choose Linux x86 for downloading the installer.</p>
<pre><code class="lang-bash">wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
</code></pre>
<p>It will download the installer to your WSL directory. Then use the following command to install it properly:</p>
<pre><code class="lang-bash">bash ~/Miniconda3-latest-Linux-x86_64.sh
</code></pre>
<p>⚠️ Make sure that you are in the correct directory where the installer is downloaded. If you downloaded it to a different location, adjust the path accordingly. Also, replace bash with zsh or sh if you are using a different shell.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094706995/3a317eb9-0340-4a84-8826-45324c93dd2f.png" alt="Preview of Miniconda installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1794" height="922" loading="lazy"></p>
<p>Make sure to choose the initialization option properly. I prefer to keep the conda env active whenever I open a new shell. Therefore, I chose "Yes".</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094727839/f3fc8902-0c37-432c-a912-a92810e89fd1.png" alt="Preview of Miniconda initialization option during installation" class="image--center mx-auto" width="1656" height="924" loading="lazy"></p>
<p>Make sure that the installation succeeds without any errors.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094754454/53dfd998-62c9-4c2a-a71e-0d33e123e027.png" alt="Preview of successful Miniconda installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1652" height="914" loading="lazy"></p>
<p>For the changes to take effect, you can close and reopen the current shell. But you can also do that without closing and reopening the shell by applying the command below.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> ~/.bashrc
</code></pre>
<p>⚠️ If you’re using a different shell like zsh or fish, make sure to source the appropriate configuration file (e.g., ~/.zshrc for zsh).</p>
<h2 id="heading-install-jupyter-amp-ipykernel">Install Jupyter &amp; Ipykernel</h2>
<p>I prefer to use Jupyter Notebook for running my machine learning experiments. It provides an interactive environment for coding and data analysis. We’ll install Jupyter Notebook and Ipykernel to run Jupyter notebooks in our conda environment. We will do that in all conda environments starting with the <strong>base</strong> environment. It also helps us to keep the conda environment kernel inside Jupyter Notebook.</p>
<p>First, make sure that you are in the base conda environment. You will see (base) on the left side of the terminal.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094812122/66ad5de8-7553-42da-b920-78d20c3bdc9a.png" alt="Preview of conda base environment in WSL Ubuntu terminal" class="image--center mx-auto" width="1917" height="1027" loading="lazy"></p>
<p>Now install Jupyter and Ipykernel both by applying the following command:</p>
<pre><code class="lang-bash">conda install jupyter ipykernel -y
</code></pre>
<p>Make sure that you accept the terms of service of Conda.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094839808/90fe3dcf-053d-4bc7-a031-22f81eb706ca.png" alt="Preview of Jupyter and Ipykernel installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1659" height="927" loading="lazy"></p>
<p>Now, I will create a separate conda environment for both TensorFlow and the PyTorch GPU. You can directly install them in the base environment or in any other environment as per your preference. I am not specifying any specific Python version while creating the environment. It will automatically install the latest stable version of Python.</p>
<pre><code class="lang-bash">conda create -name ml -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094865498/ac9ef1f1-4494-4221-8376-5e257c4f9243.png" alt="Preview of creating a new conda environment named 'ml' in WSL Ubuntu terminal" class="image--center mx-auto" width="1659" height="925" loading="lazy"></p>
<p>To activate any specific conda environment, you have to use the following command:</p>
<pre><code class="lang-bash">conda activate &lt;conda-env-name&gt;
</code></pre>
<p>For example, if I want to activate my newly created <strong>ml</strong> environment, I will use this command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>If you’re not sure which conda environments are installed in your system, you can check all available and installed conda environments in your system by running the following command:</p>
<pre><code class="lang-bash">conda env list
</code></pre>
<h2 id="heading-nvidia-driver">Nvidia Driver</h2>
<p>Ensure that you have the latest Nvidia drivers installed on Windows. WSL2 uses the Windows driver, so no separate driver installation is needed in Ubuntu. You can download the latest drivers from the <a target="_blank" href="https://www.nvidia.com/Download/index.aspx">official Nvidia website</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094915617/cd9b0bfc-77a1-45f1-9dab-4349c8f489ef.png" alt="Preview of Nvidia driver download page" class="image--center mx-auto" width="1750" height="916" loading="lazy"></p>
<p>If you are just installing the latest GPU driver, then after installing the drivers, restart your computer to ensure the changes take effect. You can either use the GeForce Game Ready Driver or the NVIDIA Studio Driver. But I recommend using the Studio Driver for better stability with creative and ML applications.</p>
<h2 id="heading-install-cuda-dependencies">Install CUDA Dependencies</h2>
<p>You might face some issues if you do not have the CUDA dependencies installed properly. I recommend that you install the required dependencies before proceeding further:</p>
<pre><code class="lang-bash">sudo apt install gcc g++ build-essential
</code></pre>
<p>After installing the dependencies, you can then verify the CUDA installation if you had any issues earlier.</p>
<h2 id="heading-cuda-toolkit">CUDA Toolkit</h2>
<p>TensorFlow GPU is very picky about the CUDA version. So we need to install a specific version of CUDA Toolkit that is compatible with the TensorFlow version we are going to install.</p>
<p>To understand exactly which CUDA version is compatible with which TensorFlow version, you can check the official TensorFlow GPU support matrix <a target="_blank" href="https://www.tensorflow.org/install/pip">here</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095089103/87a44961-9426-4d20-95ac-cde06961b41a.png" alt="Preview of TensorFlow GPU support in official docs" class="image--center mx-auto" width="1879" height="931" loading="lazy"></p>
<p>At the time I’m writing this article, the TensorFlow GPU documentation says that we should have CUDA Toolkit 12.3. So I will ensure that I install exactly that version. You can simply click on that version link in the official docs and it will redirect you to the official Nvidia CUDA Toolkit download page. But if the link gets updated in the future, you can always search for "Nvidia CUDA Toolkit" on Google to find the latest version.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095106589/19689d63-5ebd-4783-8da4-e3dedd277efb.png" alt="Preview of Nvidia CUDA Toolkit official website" class="image--center mx-auto" width="1620" height="925" loading="lazy"></p>
<p>As TensorFlow GPU is asking for exact Version 12.3, I will select version 12.3.0 exactly.</p>
<p>In the CUDA Toolkit download page, make sure to choose the operating system as Linux, Architecture as x86_64, Distribution as WSL-Ubuntu, Version as 2.0 and the Installer type as runfile(local).</p>
<p>⚠️ As we are using Ubuntu in our WSL2, you can also choose Ubuntu as your operating system. But I prefer to choose WSL-Ubuntu for better compatibility.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095151533/b6996611-d4ce-4e07-9c73-30bdc93dbf19.png" alt="Preview of CUDA Toolkit 12.3 download page for WSL-Ubuntu" class="image--center mx-auto" width="1311" height="898" loading="lazy"></p>
<p>After selecting those, it will give you the download commands. You have to apply them sequentially. Make sure that you <strong>don't keep the checkmark in "Kernel Objects" during installing CUDA</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095169368/c2f81594-536f-4788-b765-1aab3b040fa7.png" alt="Preview of CUDA Toolkit 12.3 download commands for WSL-Ubuntu" class="image--center mx-auto" width="1895" height="1001" loading="lazy"></p>
<p>⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the CUDA Toolkit properly. If you face any issues related to CUDA dependency, then quickly go through the <a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a> section, where I have explained how to install the CUDA dependencies properly.</p>
<h2 id="heading-add-path-to-shell-profile-for-cuda">Add Path to Shell Profile for CUDA</h2>
<p>After installing CUDA Toolkit, we need to add the CUDA binaries to our shell profile for easy access. This will allow us to run CUDA commands from any directory in the terminal.</p>
<p>Note that, depending on the shell you are using (bash, zsh, and so on), you need to add the CUDA path to the appropriate configuration file. Make sure to replace <strong>.bashrc</strong> with <strong>.zshrc</strong> or other configuration files if you are using a different shell.</p>
<p>To add the CUDA binary path, follow the command below:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">'export PATH=/usr/local/cuda-12.3/bin:$PATH'</span> &gt;&gt; ~/.bashrc
</code></pre>
<p>You have to use the updated path where you installed it. Your terminal will show it after installing the CUDA:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095215437/15768563-c956-472e-9633-95b3dd1cb7a3.png" alt="Preview of CUDA installation path in WSL Ubuntu terminal" class="image--center mx-auto" width="1912" height="1011" loading="lazy"></p>
<p>Now, you need to add the path inside the Library path. Just use the exact path where you installed CUDA. Your terminal will list the path properly.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">'export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH'</span> &gt;&gt; ~/.bashrc
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095242744/3c708db4-d267-4043-aa11-d04d890904f9.png" alt="Preview of CUDA library path in WSL Ubuntu terminal" class="image--center mx-auto" width="1284" height="693" loading="lazy"></p>
<p>After adding those paths, you need to source the shell profile for the changes to take effect. You can do that by running the following command:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> ~/.bashrc
</code></pre>
<h2 id="heading-nvcc-version">nvcc Version</h2>
<p>NVCC stands for Nvidia CUDA Compiler. It is basically a compiler driver for the CUDA platform that allows developers to write parallel programs to run on Nvidia GPUs. As we have already installed the CUDA toolkit, we need to see whether the compiler is also properly activated. To check that, we need to verify the version.</p>
<p>Verify that CUDA is properly installed by checking the version:</p>
<pre><code class="lang-bash">nvcc --version
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095277858/2d1ded0a-01ac-4f78-9f6c-ac499d623207.png" alt="Preview of nvcc version check in WSL Ubuntu terminal" class="image--center mx-auto" width="1839" height="946" loading="lazy"></p>
<p>If the output shows the correct CUDA version, then you have successfully installed CUDA Toolkit in your WSL2 Ubuntu environment.</p>
<h2 id="heading-cudnn-sdk">cuDNN SDK</h2>
<p>The cuDNN (CUDA Deep Neural Network) SDK is a <a target="_blank" href="https://developer.nvidia.com/cudnn">GPU accelerated library of primitives for deep neural networks</a>, developed by Nvidia. It provides highly optimized building blocks for common deep learning operations, significantly speeding up the training and inference processes of AI models on Nvidia GPUs.</p>
<p>Note: Even though TensorFlow GPU suggests a specific cuDNN version, it’s often compatible with multiple versions. Because of this, I recommend downloading the latest cuDNN version that is compatible with your installed CUDA version. You can find the cuDNN download page <a target="_blank" href="https://developer.nvidia.com/cudnn-downloads">here</a>.</p>
<p>Select the Operating System as Linux, Architecture as x86_64, Distribution as Ubuntu, Version as 24.04, Installer Type as deb (local), Configuration as FULL. After selecting those, it will give you the download commands. You have to apply them sequentially.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095312370/1fca5959-f492-4160-8027-deec0674863b.png" alt="Preview of cuDNN download commands for Ubuntu 24.04" class="image--center mx-auto" width="1543" height="938" loading="lazy"></p>
<p>⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the cuDNN SDK properly. If you face any issues related to CUDA dependency, then quickly go through the <a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a> section, where I have explained how to install the CUDA dependencies properly.</p>
<h2 id="heading-tensorflow-gpu">TensorFlow GPU</h2>
<p>Now, we are going to install TensorFlow GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created <strong>ml</strong> environment. To activate it, I’ll use the following command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>⚠️ Make sure that you have activated the correct conda environment before installing TensorFlow GPU. You will see the environment name in the terminal prompt.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095398777/0c7d8813-eb6c-4e2e-bad9-1fc7d344d7a2.png" alt="Preview of activating 'ml' conda environment in WSL Ubuntu terminal" class="image--center mx-auto" width="1227" height="692" loading="lazy"></p>
<p>I will install ipykernel and jupyter in this new environment.</p>
<pre><code class="lang-bash">conda install jupyter ipykernel -y
</code></pre>
<p>Now, to install TensorFlow GPU, I will simply use the following command:</p>
<pre><code class="lang-bash">pip install tensorflow[and-cuda]
</code></pre>
<p>It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.</p>
<h3 id="heading-check-tensorflow-gpu">Check TensorFlow GPU</h3>
<p>After installing TensorFlow GPU, we need to verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:</p>
<pre><code class="lang-bash">python3 -c <span class="hljs-string">"import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"</span>
</code></pre>
<p>If the output shows a list of available GPU devices, then TensorFlow GPU is successfully installed and working properly.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095453933/ccda58fc-9ae9-4185-9c78-6196c98d8b7c.png" alt="Preview of TensorFlow GPU check in WSL Ubuntu terminal" width="1903" height="1029" loading="lazy"></p>
<h2 id="heading-pytorch-gpu">PyTorch GPU</h2>
<p>Now, we’re going to install PyTorch GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created ml environment. To activate it, I will use the following command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>Installing PyTorch GPU is very straightforward. You can use the official PyTorch installation command generator <a target="_blank" href="https://pytorch.org/get-started/locally/">here</a>.</p>
<p>Make sure to select PyTorch Build as the latest Stable one, Your OS as Linux, Package as Pip, Language as Python. For the Compute Platform, select the CUDA version that matches your installed CUDA Toolkit. For me, it is CUDA 12.3. But, if you can not find the exact one then choose the closest. As CUDA 12.3 is not available for me now, I am choosing CUDA 12.6.</p>
<p>After selecting those, it will give you the installation command. You have to apply it in your WSL Ubuntu terminal.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095511862/6f631369-c8db-4681-9d1c-669ad88df69d.png" alt="Preview of PyTorch installation command generator" class="image--center mx-auto" width="1618" height="911" loading="lazy"></p>
<p>It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095532246/56232263-36ea-4043-9881-df162965c514.png" alt="Preview of PyTorch GPU installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1280" height="689" loading="lazy"></p>
<h3 id="heading-check-pytorch-gpu">Check PyTorch GPU</h3>
<p>After installing PyTorch GPU, verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:</p>
<pre><code class="lang-bash">python3 - &lt;&lt; <span class="hljs-string">'EOF'</span>
import torch
<span class="hljs-built_in">print</span>(torch.cuda.is_available())
<span class="hljs-built_in">print</span>(torch.cuda.device_count())
<span class="hljs-built_in">print</span>(torch.cuda.current_device())
<span class="hljs-built_in">print</span>(torch.cuda.device(0))
<span class="hljs-built_in">print</span>(torch.cuda.get_device_name(0))
EOF
</code></pre>
<p>The output should look similar to the screenshot, showing:</p>
<ul>
<li><p><strong>True</strong>: GPU is available for PyTorch</p>
</li>
<li><p><strong>1</strong>: Number of detected CUDA devices</p>
</li>
<li><p><strong>0</strong>: Index of the current active CUDA device</p>
</li>
<li><p>A device object representation</p>
</li>
<li><p><strong>NVIDIA GeForce RTX 3060</strong> (or your GPU name)</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095584921/69269152-7ea6-404b-b1ca-8534b51f2491.png" alt="Preview of PyTorch GPU check in WSL Ubuntu terminal" class="image--center mx-auto" width="1917" height="937" loading="lazy"></p>
<h3 id="heading-check-pytorch-amp-tensorflow-gpu-inside-jupyter-notebook">Check PyTorch &amp; TensorFlow GPU inside Jupyter Notebook</h3>
<p>Now that the environment is fully configured, we will verify GPU support directly inside Jupyter Notebook. This ensures both PyTorch and TensorFlow can successfully detect and use your GPU.</p>
<h4 id="heading-1-test-pytorch-gpu">1. Test PyTorch GPU</h4>
<p>Create a new Jupyter Notebook and run the following commands one by one:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.device(<span class="hljs-number">0</span>))
print(torch.cuda.get_device_name(<span class="hljs-number">0</span>))
</code></pre>
<p>If everything is configured correctly, you will see your GPU (for example <strong>NVIDIA GeForce RTX 3060</strong>) detected properly:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095624229/f94c97a0-2e44-45ad-a2a8-52f40c922482.png" alt="Preview of PyTorch GPU check inside Jupyter Notebook" class="image--center mx-auto" width="1861" height="743" loading="lazy"></p>
<h4 id="heading-2-test-tensorflow-gpu">2. Test TensorFlow GPU</h4>
<p>Next, run the following code to check whether TensorFlow detects your GPU:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

print(tf.config.list_physical_devices(<span class="hljs-string">'GPU'</span>))
</code></pre>
<p>You can also check the number of GPUs detected:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"Num GPUs Available:"</span>, len(tf.config.list_physical_devices(<span class="hljs-string">'GPU'</span>)))
</code></pre>
<p>Finally, run TensorFlow’s built-in GPU validation (warnings are normal):</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

<span class="hljs-keyword">assert</span> tf.test.is_gpu_available()
<span class="hljs-keyword">assert</span> tf.test.is_built_with_cuda()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095666216/f9017979-b5c9-4b86-9f60-d9aaa2fe8ac1.png" alt="TensorFlow GPU initialization and CUDA validation output" class="image--center mx-auto" width="1638" height="935" loading="lazy"></p>
<p>If TensorFlow logs show your GPU model (such as <strong>RTX 3060</strong>), then TensorFlow GPU is successfully installed and fully working inside Jupyter Notebook.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Thank you so much for reading all the way through. I hope you have been able to configure your Windows 11 computer properly for running almost any kind of Machine Learning-based experiments.</p>
<p>To get more content like this, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/fahimfba/">LinkedIn</a> and <a target="_blank" href="https://x.com/Fahim_FBA">X</a>. You can also check <a target="_blank" href="https://www.fahimbinamin.com/">my website</a> and follow me on <a target="_blank" href="https://github.com/FahimFBA">GitHub</a> if you are into open source and development.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build End-to-End Machine Learning Lineage ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance. While many services for tracking ML lineage exist, creating a comprehensive and manageabl... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-end-to-end-machine-learning-lineage/</link>
                <guid isPermaLink="false">68f0f6719ac2ae80d4c5be03</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Thu, 16 Oct 2025 13:43:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760622158648/b990ff01-06f0-495d-8554-f832813609ab.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance.</p>
<p>While many services for tracking ML lineage exist, creating a comprehensive and manageable lineage often proves complicated.</p>
<p>In this article, I’ll walk you through integrating a comprehensive ML lineage solution for an ML application deployed on serverless AWS Lambda, covering the end-to-end pipeline stages:</p>
<ul>
<li><p>ETL pipeline</p>
</li>
<li><p>Data drift detection</p>
</li>
<li><p>Preprocessing</p>
</li>
<li><p>Model tuning</p>
</li>
<li><p>Risk and fairness evaluation.</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-well-build">What We’ll Build</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture - AI Pricing for Retailers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-ml-lineage">The ML Lineage</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-workflow-in-action">Workflow in Action</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-the-ml-lineage">Step 2: The ML Lineage</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-3-preprocessing">Stage 3: Preprocessing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-5-performing-inference">Stage 5: Performing Inference</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-configuring-scheduled-run-with-prefect">Step 4: Configuring Scheduled Run with Prefect</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local-1">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-deploying-the-application">Step 5: Deploying the Application</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-test-in-local-2">Test in Local</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<ul>
<li><p>Knowledge of key Machine Learning / Deep Learning concepts including the full lifecycle: data handling, model training, tuning, and validation.</p>
</li>
<li><p>Proficiency in Python, with experience using major ML libraries.</p>
</li>
<li><p>Basic understanding of DevOps principles.</p>
</li>
</ul>
<h3 id="heading-tools-well-use">Tools we’ll use:</h3>
<p>Here is a summary of the tools we’re going to use to track the ML lineage:</p>
<ul>
<li><p><strong>DVC</strong>: An open-source version system for data. Used to track the ML lineage.</p>
</li>
<li><p><strong>AWS S3</strong>: A secure object storage service from AWS. Used as a remote storage.</p>
</li>
<li><p><strong>Evently AI</strong>: An open-source ML and LLM observability framework. Used to detect data drift.</p>
</li>
<li><p><strong>Prefect</strong>: A workflow orchestration engine. Used to manage the schedule run of the lineage.</p>
</li>
</ul>
<h2 id="heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</h2>
<p><strong>Machine learning (ML) lineage</strong> is a framework for tracking and understanding the complete lifecycle of a machine learning model.</p>
<p>It contains information at different levels such as:</p>
<ul>
<li><p><strong>Code:</strong> The scripts, libraries, and configurations for model training.</p>
</li>
<li><p><strong>Data:</strong> The original data, transformations, and features.</p>
</li>
<li><p><strong>Experiments:</strong> Training runs, hyperparameter tuning results.</p>
</li>
<li><p><strong>Models:</strong> The trained models and their versions.</p>
</li>
<li><p><strong>Predictions:</strong> The outputs of deployed models.</p>
</li>
</ul>
<p>ML lineage is essential for multiple reasons:</p>
<ul>
<li><p><strong>Reproducibility:</strong> Recreate the same model and prediction for validation.</p>
</li>
<li><p><strong>Root cause analysis:</strong> Trace back to the data, code, or configuration change when a model fails in production.</p>
</li>
<li><p><strong>Compliance:</strong> Some regulated industries require proof of model training to ensure fairness, transparency, and adherence to laws like GDPR and the EU AI Act.</p>
</li>
</ul>
<h2 id="heading-what-well-build">What We’ll Build</h2>
<p>In this project, I’ll integrate an ML lineage into <a target="_blank" href="https://levelup.gitconnected.com/building-a-dynamic-pricing-system-with-a-multi-layered-neural-network-c2a4c70bfcec">this price prediction system built on AWS Lambda architecture</a> using DVC, an open-source version control system for ML applications.</p>
<p>The below diagram illustrates the system architecture and the ML lineage we’ll integrate:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759825040233/5027e5dd-a2fc-4d35-b7a3-4d9184f5f179.png" alt="Figure A. A comprehensive ML lineage for an ML application on serverless Lambda (Created by Kuriko IWAI)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Figure A:</strong> A comprehensive ML lineage for an ML application on serverless Lambda (Created by <a target="_blank" href="https://kuriko-iwai.vercel.app/">Kuriko IWAI</a>)</p>
<h3 id="heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture: AI Pricing for Retailers</h3>
<p>The system operates as a containerized, serverless microservice designed to provide optimal price recommendations to maximize retailer sales.</p>
<p>Its core intelligence comes from AI models trained on historical purchase data to predict the quantity of the product sold at various prices, allowing sellers to determine the best price.</p>
<p>For consistent deployment, the prediction logic and its dependencies are packaged into a Docker container image and stored in AWS ECR (Elastic Container Registry).</p>
<p>The prediction is then served by an AWS Lambda function, which retrieves and runs the container from ECR and exposes the result via AWS API Gateway for the Flask application to consume.</p>
<p>If you want to see how to build this from the ground up, you can follow along with my tutorial <a target="_blank" href="https://www.freecodecamp.org/news/how-to-build-a-machine-learning-system-on-serverless-architecture/">How to Build a Machine Learning System on Serverless Architecture</a>.</p>
<h3 id="heading-the-ml-lineage">The ML Lineage</h3>
<p>In the system, GitHub handles the code lineage, while DVC captures the lineage of:</p>
<ul>
<li><p><strong>Data</strong> (blue boxes): ETL and preprocessing.</p>
</li>
<li><p><strong>Experiments</strong> (light orange): Hyperparamters tuning and validation.</p>
</li>
<li><p><strong>Models</strong> and <strong>Prediction</strong> (dark orange): Final model artifacts and prediction results.</p>
</li>
</ul>
<p><strong>DVC</strong> tracks the lineage through separate stages, from data extraction to fairness testing (yellow rows in Figure A).</p>
<p>For each stage, DVC uses an <strong>MD5</strong> or <strong>SHA256 hash</strong> to track and push metadata like artifacts, metrics, and reports to its remote on <strong>AWS S3</strong>.</p>
<p>The pipeline incorporates <strong>Evently AI</strong> to handle data drift tests, which are essential for identifying shifts in data distributions that could compromise the model's generalization capabilities in production.</p>
<p>Only models that successfully pass both the data drift and fairness tests can serve predictions via the AWS API gateway (red box in Figure A).</p>
<p>Lastly, this entire lineage process is triggered weekly by the open-source workflow scheduler, <strong>Prefect</strong>.</p>
<p>Prefect prompts DVC to check for updates in data and scripts, and executes the full lineage process if changes are detected.</p>
<h2 id="heading-workflow-in-action">Workflow in Action</h2>
<p>The building process involves five main steps:</p>
<ol>
<li><p>Initiate a DVC project</p>
</li>
<li><p>Define the lineage stages with the DVC script <code>dvc.yaml</code> and corresponding Python script</p>
</li>
<li><p>Deploy the DVC project</p>
</li>
<li><p>Configure scheduled run with Prefect</p>
</li>
<li><p>Deploy the application</p>
</li>
</ol>
<p>Let’s walk through each step together.</p>
<h2 id="heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</h2>
<p>The first step is to initiate a DVC project:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> init
</code></pre>
<p>This command automatically creates a <code>.dvc</code> directory at the root of the project folder:</p>
<pre><code class="lang-bash">.
.dvc/
│
└── cache/         <span class="hljs-comment"># [.gitignore] store dvc caches (cached actual data files)</span>
└── tmp/           <span class="hljs-comment"># [.gitignore]</span>
└── .gitignore     <span class="hljs-comment"># gitignore cache, tmp, and config.local</span>
└── config         <span class="hljs-comment"># dvc config for production</span>
└── config.local   <span class="hljs-comment"># [.gitignore] dvc config for local</span>
</code></pre>
<p>DVC maintains a fast, lightweight Git repository by separating the original data in large files from the repository.</p>
<p>The process involves caching the original data in the local <code>.dvc/cache</code> directory, creating a small <code>.dvc</code> metadata file which contains an MD5 hash and a link to the original data file path, pushing <em>only</em> the small metadata files to Git, and pushing the original data to the DVC remote.</p>
<h2 id="heading-step-2-the-ml-lineage">Step 2: The ML Lineage</h2>
<p>Next, we’ll configure the ML lineage with the following stages:</p>
<ol>
<li><p><code>etl_pipeline</code>: Extract, clean, impute the original data and perform feature engineering.</p>
</li>
<li><p><code>data_drift_check</code>: Run data drift tests. If they fail, the system exits.</p>
</li>
<li><p><code>preprocess</code>: Create training, validation, and test datasets.</p>
</li>
<li><p><code>tune_primary_model</code>: Tune hyperparameters and train the model.</p>
</li>
<li><p><code>inference_primary_model</code>: Perform inference on the test dataset.</p>
</li>
<li><p><code>assess_model_risk</code>: Runs risk and fairness tests.</p>
</li>
</ol>
<p>Each stage requires defining the DVC command and its corresponding Python script.</p>
<p>Let’s get started.</p>
<h3 id="heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</h3>
<p>The first stage is to extract, clean, impute the original data, and perform feature engineering.</p>
<h4 id="heading-dvc-configuration"><strong>DVC Configuration</strong></h4>
<p>We’ll create the <code>dvc.yaml</code> file at the root of the project directory and add the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code></p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-comment"># output paths for dvc to track</span>
    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/original_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/processed_df.parquet</span>
</code></pre>
<p>The <code>dvc.yaml</code> file defines a sequence of steps (stages) using sections like:</p>
<ul>
<li><p><code>cmd</code>: The shell command to be executed for that stage</p>
</li>
<li><p><code>deps</code>: Dependencies that need to run the <code>cmd</code></p>
</li>
<li><p><code>prams</code>: Default parameters for the <code>cmd</code> defined in the <code>params.yaml</code> file</p>
</li>
<li><p><code>metrics</code>: The metrics files to track</p>
</li>
<li><p><code>reports</code>: The report files to track</p>
</li>
<li><p><code>plots</code>: The DVC plot files for visualization</p>
</li>
<li><p><code>outs</code>: The output files produced by the <code>cmd</code>, which DVC will track</p>
</li>
</ul>
<p>The configuration helps DVC ensure reproducibility by explicitly listing dependencies, outputs, and the commands of each stage. It also helps it manage the lineage by establishing a <strong>Directed Acyclic Graph (DAG)</strong> of the workflow, linking each stage to the next.</p>
<h4 id="heading-python-scripts"><strong>Python Scripts</strong></h4>
<p>Next, let’s add Python scripts, ensuring the data is stored using the file paths specified in the <code>outs</code> section of the <code>dvc.yaml</code> file:</p>
<p><code>src/data_handling/etl_pipeline.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">etl_pipeline</span>():</span>
    <span class="hljs-comment"># extract the entire data</span>
    df = scripts.extract_original_dataframe()

    <span class="hljs-comment"># load perquet file</span>
    ORIGINAL_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'original_df.parquet'</span>)
    df.to_parquet(ORIGINAL_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>

    <span class="hljs-comment"># transform</span>
    df = scripts.structure_missing_values(df=df)
    df = scripts.handle_feature_engineering(df=df)

    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df.to_parquet(PROCESSED_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>
    <span class="hljs-keyword">return</span> df

<span class="hljs-comment"># for dvc execution</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:  
    parser = argparse.ArgumentParser(description=<span class="hljs-string">"run etl pipeline"</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">"specific stockcode to process. empty runs full pipeline."</span>)
    parser.add_argument(<span class="hljs-string">'--impute'</span>, action=<span class="hljs-string">'store_true'</span>, help=<span class="hljs-string">"flag to create imputation values"</span>)
    args = parser.parse_args()

    etl_pipeline(stockcode=args.stockcode, impute_stockcode=args.impute)
</code></pre>
<h4 id="heading-outputs"><strong>Outputs</strong></h4>
<p>The original and structured data in Pandas’ DataFrames are stored in the DVC cache:</p>
<ul>
<li><p><code>data/original_df.parquet</code></p>
</li>
<li><p><code>data/processed_df.parquet</code></p>
</li>
</ul>
<h3 id="heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</h3>
<p>Before jumping into preprocessing, we’ll run data drift tests to ensure any notable drift is in the data. To do this, we’ll use <strong>EventlyAI</strong>, an open-source ML and LLM observability framework.</p>
<h4 id="heading-what-is-data-drift">What is Data Drift?</h4>
<p>Data drift refers to any changes in the statistical properties like the mean, variance, or distribution of the data that the model is trained on.</p>
<p>There are three main types of data drift:</p>
<ul>
<li><p><strong>Covariate Drift</strong> (Feature Drift): A change in the input feature distribution.</p>
</li>
<li><p><strong>Prior Probability Drift</strong> (Label Drift): A change in the target variable distribution.</p>
</li>
<li><p><strong>Concept Drift</strong>: A change in the relationship between the input data and the target variable.</p>
</li>
</ul>
<p>Data drift compromises the model's generalization capabilities over time, making its detection after deployment crucial.</p>
<h4 id="heading-dvc-configuration-1">DVC Configuration</h4>
<p>We’ll add the <code>data_drift_check</code> stage right after the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
     <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/report_data_drift.py
      data/processed/processed_df.csv 
      data/processed_df_${params.stockcode}.parquet
      reports/data_drift_report_${params.stockcode}.html
      metrics/data_drift_${params.stockcode}.json
      ${params.stockcode}
</span>
    <span class="hljs-comment"># default values to the parameters (defined in the param.yaml file)</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/report_data_drift.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-comment"># output file pathes for dvc to track</span>
    <span class="hljs-attr">plots:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/data_drift_report_${params.stockcode}.html:</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/data_drift_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then, add default values to the parameters passed to the DVC command:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">stockcode:</span> <span class="hljs-string">&lt;STOCKCODE</span> <span class="hljs-string">OF</span> <span class="hljs-string">CHOICE&gt;</span>
</code></pre>
<h4 id="heading-python-scripts-1">Python Scripts</h4>
<p>After <a target="_blank" href="https://docs.evidentlyai.com/quickstart_ml#1-1-set-up-evidently-cloud">generating an API token from the EventlyAI workplace,</a> we’ll add a Python script to detect data drift and store the results in the <code>metrics</code> variable:</p>
<p><code>src/data_handling/report_data_drift.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> evidently <span class="hljs-keyword">import</span> Dataset, DataDefinition, Report
<span class="hljs-keyword">from</span> evidently.presets <span class="hljs-keyword">import</span> DataDriftPreset
<span class="hljs-keyword">from</span> evidently.ui.workspace <span class="hljs-keyword">import</span> CloudWorkspace

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># initiate evently cloud workspace</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ws = CloudWorkspace(token=os.getenv(<span class="hljs-string">'EVENTLY_API_TOKEN'</span>), url=<span class="hljs-string">'https://app.evidently.cloud'</span>)

    <span class="hljs-comment"># retrieve evently project</span>
    project = ws.get_project(<span class="hljs-string">'EVENTLY AI PROJECT ID'</span>)

    <span class="hljs-comment"># retrieve paths from the command line args</span>
    REFERENCE_DATA_PATH = sys.argv[<span class="hljs-number">1</span>]
    CURRENT_DATA_PATH = sys.argv[<span class="hljs-number">2</span>]
    REPORT_OUTPUT_PATH = sys.argv[<span class="hljs-number">3</span>]
    METRICS_OUTPUT_PATH = sys.argv[<span class="hljs-number">4</span>]
    STOCKCODE = sys.argv[<span class="hljs-number">5</span>]

    <span class="hljs-comment"># create folders if not exist</span>
    os.makedirs(os.path.dirname(REPORT_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    os.makedirs(os.path.dirname(METRICS_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># extract datasets</span>
    reference_data_full = pd.read_csv(REFERENCE_DATA_PATH)
    reference_data_stockcode = reference_data_full[reference_data_full[<span class="hljs-string">'stockcode'</span>] == STOCKCODE]
    current_data_stockcode = pd.read_parquet(CURRENT_DATA_PATH)

    <span class="hljs-comment"># define data schema</span>
    nums, cats = scripts.categorize_num_cat_cols(df=reference_data_stockcode)
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> nums: current_data_stockcode[col] = pd.to_numeric(current_data_stockcode[col], errors=<span class="hljs-string">'coerce'</span>)

    schema = DataDefinition(numerical_columns=nums, categorical_columns=cats)

    <span class="hljs-comment"># define evently dataset w/ the data schema</span>
    eval_data_1 = Dataset.from_pandas(reference_data_stockcode, data_definition=schema)
    eval_data_2 = Dataset.from_pandas(current_data_stockcode, data_definition=schema)

    <span class="hljs-comment"># execute drift detection</span>
    report = Report(metrics=[DataDriftPreset()])
    data_eval = report.run(reference_data=eval_data_1, current_data=eval_data_2)
    data_eval.save_html(REPORT_OUTPUT_PATH)

    <span class="hljs-comment"># create metrics for dvc tracking</span>
    report_dict = json.loads(data_eval.json())
    num_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'count'</span>]
    shared_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'share'</span>]
    metrics = dict(
        drift_detected=bool(num_drifts &gt; <span class="hljs-number">0.0</span>), num_drifts=num_drifts, shared_drifts=shared_drifts,
        num_cols=nums,
        cat_cols=cats,
        stockcode=STOCKCODE,
        timestamp=datetime.datetime.now().isoformat(),
    )

    <span class="hljs-comment"># load metrics file</span>
    <span class="hljs-keyword">with</span> open(METRICS_OUTPUT_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... drift metrics saved to <span class="hljs-subst">{METRICS_OUTPUT_PATH}</span>... '</span>)

    <span class="hljs-comment"># stop the system if data drift is found</span>
    <span class="hljs-keyword">if</span> num_drifts &gt; <span class="hljs-number">0.0</span>: sys.exit(<span class="hljs-string">'❌ FATAL: data drift detected. stopping pipeline'</span>)
</code></pre>
<p>If data drift is found, the script immediately exits using the final <code>sys.exit</code> command.</p>
<h4 id="heading-outputs-1">Outputs</h4>
<p>The script generates two files that DVC will track:</p>
<ul>
<li><p><code>reports/data_drift_report.html</code>: The data drift report in a HTML file.</p>
</li>
<li><p><code>metrics/data_drift.json</code>: The data drift metics in a JSON file including drift results along with feature columns and a timestamp:</p>
</li>
</ul>
<p><code>metrics/data_drift.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"drift_detected"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-attr">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>
}
</code></pre>
<p>The drift test results are also available on the Evently workplace dashboard for further analysis:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*2C1ICzvVazAUH7fk.png" alt="Figure B. Screenshot of the Evently workspace dashboard" width="600" height="400" loading="lazy"></p>
<p><strong>Figure B.</strong> Screenshot of the Evently workspace dashboard</p>
<h3 id="heading-stage-3-preprocessing">Stage 3: Preprocessing</h3>
<p>If no data drift is detected, the linage moves onto the preprocessing stage.</p>
<h4 id="heading-dvc-configuration-2">DVC Configuration</h4>
<p>We’ll add the <code>preprocess</code> stage right after the <code>data_drift_check</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/preprocess.py --target_col ${params.target_col} --should_scale ${params.should_scale} --verbose ${params.verbose}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/preprocess.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils</span>

    <span class="hljs-comment"># params from params.yaml</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.target_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.should_scale</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.verbose</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-comment"># train, val, test datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_test_df.parquet</span>

      <span class="hljs-comment"># preprocessed input datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_processed.parquet</span>

      <span class="hljs-comment"># trained preprocessor and human readable feature names for shap analysis</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/column_transformer.pkl</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/feature_names.json</span>
</code></pre>
<p>And then add default values of the parameters used in the <code>cmd</code>:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-2">Python Scripts</h4>
<p>Next, we’ll add a Python script to create training, validation, and test datasets and preprocess input data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess</span>(<span class="hljs-params">stockcode: str = <span class="hljs-string">''</span>, target_col: str = <span class="hljs-string">'quantity'</span>, should_scale: bool = True, verbose: bool = False</span>):</span>
    <span class="hljs-comment"># initiate metrics to track (dvc)</span>
    DATA_DRIFT_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'data_drift_<span class="hljs-subst">{args.stockcode}</span>.json'</span>)

    <span class="hljs-keyword">if</span> os.path.exists(DATA_DRIFT_METRICS_PATH):
        <span class="hljs-keyword">with</span> open(DATA_DRIFT_METRICS_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
            metrics = json.load(f)
    <span class="hljs-keyword">else</span>: metrics = dict()

    <span class="hljs-comment"># load processed df from dvc cache</span>
    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df = pd.read_parquet(PROCESSED_DF_PATH)

    <span class="hljs-comment"># categorize num and cat columns</span>
    num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
    <span class="hljs-keyword">if</span> verbose: main_logger.info(<span class="hljs-string">f'num_cols: <span class="hljs-subst">{num_cols}</span> \ncat_cols: <span class="hljs-subst">{cat_cols}</span>'</span>)

    <span class="hljs-comment"># structure cat cols</span>
    <span class="hljs-keyword">if</span> cat_cols:
        <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cat_cols: df[col] = df[col].astype(<span class="hljs-string">'string'</span>)

    <span class="hljs-comment"># initiate preprocessor (either load from the dvc cache or create from scratch)</span>
    PREPROCESSOR_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'column_transformer.pkl'</span>)
    <span class="hljs-keyword">try</span>:
        preprocessor = joblib.load(PREPROCESSOR_PATH)
    <span class="hljs-keyword">except</span>:
        preprocessor = scripts.create_preprocessor(num_cols=num_cols <span class="hljs-keyword">if</span> should_scale <span class="hljs-keyword">else</span> [], cat_cols=cat_cols)

    <span class="hljs-comment"># creates train, val, test datasets</span>
    y = df[target_col]
    X = df.copy().drop(target_col, axis=<span class="hljs-string">'columns'</span>)

    <span class="hljs-comment"># split</span>
    test_size, random_state = <span class="hljs-number">50000</span>, <span class="hljs-number">42</span>
    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)
    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># store train, val, test datasets (dvc track)</span>
    X_train.to_parquet(<span class="hljs-string">'data/x_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_val.to_parquet(<span class="hljs-string">'data/x_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_test.to_parquet(<span class="hljs-string">'data/x_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_train.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_val.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_test.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># preprocess</span>
    X_train = preprocessor.fit_transform(X_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    <span class="hljs-comment"># store preprocessed input data (dvc track)</span>
    pd.DataFrame(X_train).to_parquet(<span class="hljs-string">f'data/x_train_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_val).to_parquet(<span class="hljs-string">f'data/x_val_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_test).to_parquet(<span class="hljs-string">f'data/x_test_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># save feature names (dvc track) for shap</span>
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'preprocessors/feature_names.json'</span>, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        feature_names = preprocessor.get_feature_names_out()
        json.dump(feature_names.tolist(), f)

    <span class="hljs-keyword">return</span>  X_train, X_val, X_test, y_train, y_val, y_test, preprocessor


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'run data preprocessing'</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">'specific stockcode'</span>)
    parser.add_argument(<span class="hljs-string">'--target_col'</span>, type=str, default=<span class="hljs-string">'quantity'</span>, help=<span class="hljs-string">'the target column name'</span>)
    parser.add_argument(<span class="hljs-string">'--should_scale'</span>, type=bool, default=<span class="hljs-literal">True</span>, help=<span class="hljs-string">'flag to scale numerical features'</span>)
    parser.add_argument(<span class="hljs-string">'--verbose'</span>, type=bool, default=<span class="hljs-literal">False</span>, help=<span class="hljs-string">'flag for verbose logging'</span>)
    args = parser.parse_args()

    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = preprocess(
        target_col=args.target_col,
        should_scale=args.should_scale,
        verbose=args.verbose,
        stockcode=args.stockcode,
    )
</code></pre>
<h4 id="heading-outputs-2">Outputs</h4>
<p>This stage generates the necessary datasets for both model training and inference:</p>
<p>Input features:</p>
<ul>
<li><p><code>data/x_train_df.parquet</code></p>
</li>
<li><p><code>data/x_val_df.parquet</code></p>
</li>
<li><p><code>data/x_test_df.parquet</code></p>
</li>
</ul>
<p>Preprocessed input features:</p>
<ul>
<li><p><code>data/x_train_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_val_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_test_processed_df.parquet</code></p>
</li>
</ul>
<p>Target variables:</p>
<ul>
<li><p><code>data/y_train_df.parquet</code></p>
</li>
<li><p><code>data/y_val_df.parquet</code></p>
</li>
<li><p><code>data/y_test_df.parquet</code></p>
</li>
</ul>
<p>The preprocessor and human-readable feature names are also stored in cache for inference and SHAP feature impact analysis later:</p>
<ul>
<li><p><code>preprocessors/column_transformer.pk</code></p>
</li>
<li><p><code>preprocessors/feature_names.json</code></p>
</li>
</ul>
<p>Lastly, DVC adds the <code>preprocess_status</code> , <code>x_train_processed_path</code>, and <code>preprocessor_path</code> to the data summary metrics file <code>data.json</code> created in Step 2 to track the end-to-end process of Steps 2 and 3:</p>
<p><code>metrics/data.json</code>:</p>
<pre><code class="lang-python">{
    <span class="hljs-string">"drift_detected"</span>: false,
    <span class="hljs-string">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-string">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-string">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>,

    <span class="hljs-comment"># updates</span>
    <span class="hljs-string">"preprocess_status"</span>: <span class="hljs-string">"completed"</span>,
    <span class="hljs-string">"x_train_processed_path"</span>: <span class="hljs-string">"data/x_train_processed_85123A.parquet"</span>,
    <span class="hljs-string">"preprocessor_path"</span>: <span class="hljs-string">"preprocessors/column_transformer.pkl"</span>
}
</code></pre>
<p>Next, let’s move onto the model/experiment lineage.</p>
<h3 id="heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</h3>
<p>Now that we’ve created the datasets, we’ll tune and train the primary model. It’s a multi-layered feedforward network on <strong>PyTorch</strong>, using training and validation datasets created in the <code>preprocess</code> stage.</p>
<h4 id="heading-dvc-configuration-3">DVC Configuration</h4>
<p>First, we’ll add the <code>tuning_primary_model</code> stage right after the <code>preprocess</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/main.py
      data/x_train_processed_${params.stockcode}.parquet
      data/x_val_processed_${params.stockcode}.parquet
      data/y_train_df_${params.stockcode}.parquet
      data/y_val_df_${params.stockcode}.parquet
      ${tuning.should_local_save}
      ${tuning.grid}
      ${tuning.n_trials}
      ${tuning.num_epochs}
      ${params.stockcode}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/main.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.n_trials</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.grid</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.should_local_save</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/dfn_best_${params.stockcode}.pth</span> <span class="hljs-comment"># dvc track</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_val_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-3">Python Scripts</h4>
<p>Next, we’ll add the Python scripts to tune the model using <strong>Bayesian optimization</strong> and then train the optimal model on the complete <code>X_train</code> and <code>y_train</code> datasets created in the <code>preprocess</code> stage.</p>
<p><code>src/model/torch_model/main.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tune_and_train</span>(<span class="hljs-params">
        X_train, X_val, y_train, y_val,
        stockcode: str = <span class="hljs-string">''</span>,
        should_local_save: bool = True,
        grid: bool = False,
        n_trials: int = <span class="hljs-number">50</span>,
        num_epochs: int = <span class="hljs-number">3000</span>
    </span>) -&gt; tuple[nn.Module, dict]:</span>

    <span class="hljs-comment"># perform bayesian optimization</span>
    best_dfn, best_optimizer, best_batch_size, best_checkpoint = scripts.bayesian_optimization(
        X_train, X_val, y_train, y_val, n_trials=n_trials, num_epochs=num_epochs
    )

    <span class="hljs-comment"># save the model artifact (dvc track)</span>
    DFN_FILE_PATH = os.path.join(<span class="hljs-string">'models'</span>, <span class="hljs-string">'production'</span>, <span class="hljs-string">f'dfn_best_<span class="hljs-subst">{stockcode}</span>.pth'</span> <span class="hljs-keyword">if</span> stockcode <span class="hljs-keyword">else</span> <span class="hljs-string">'dfn_best.pth'</span>)
    os.makedirs(os.path.dirname(DFN_FILE_PATH), exist_ok=<span class="hljs-literal">True</span>)
    torch.save(best_checkpoint, DFN_FILE_PATH)

    <span class="hljs-keyword">return</span> best_dfn, best_checkpoint



<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">track_metrics_by_stockcode</span>(<span class="hljs-params">X_val, y_val, best_model, checkpoint: dict, stockcode: str</span>):</span>
    MODEL_VAL_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_val_<span class="hljs-subst">{stockcode}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_VAL_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># validate the tuned model</span>
    _, mse, exp_mae, rmsle = scripts.perform_inference(model=best_model, X=X_val, y=y_val)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    metrics = dict(
        stockcode=stockcode,
        mse_val=mse,
        mae_val=exp_mae,
        rmsle_val=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-comment"># store the validation results (dvc track)</span>
    <span class="hljs-keyword">with</span> open(MODEL_VAL_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... validation metrics saved to <span class="hljs-subst">{MODEL_VAL_METRICS_PATH}</span> ...'</span>)


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># fetch command arg values</span>
    X_TRAIN_PATH = sys.argv[<span class="hljs-number">1</span>]
    X_VAL_PATH = sys.argv[<span class="hljs-number">2</span>]
    Y_TRAIN_PATH = sys.argv[<span class="hljs-number">3</span>]
    Y_VAL_PATH = sys.argv[<span class="hljs-number">4</span>]
    SHOULD_LOCAL_SAVE = sys.argv[<span class="hljs-number">5</span>] == <span class="hljs-string">'True'</span>
    GRID = sys.argv[<span class="hljs-number">6</span>] == <span class="hljs-string">'True'</span>
    N_TRIALS = int(sys.argv[<span class="hljs-number">7</span>])
    NUM_EPOCHS = int(sys.argv[<span class="hljs-number">8</span>])
    STOCKCODE = str(sys.argv[<span class="hljs-number">9</span>])

    <span class="hljs-comment"># extract training and validation datasets from dvc cache</span>
    X_train, X_val = pd.read_parquet(X_TRAIN_PATH), pd.read_parquet(X_VAL_PATH)
    y_train, y_val = pd.read_parquet(Y_TRAIN_PATH), pd.read_parquet(Y_VAL_PATH)

    <span class="hljs-comment"># tuning</span>
    best_model, checkpoint = tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode=STOCKCODE, should_local_save=SHOULD_LOCAL_SAVE, grid=GRID, n_trials=N_TRIALS, num_epochs=NUM_EPOCHS
    )

    <span class="hljs-comment"># metrics tracking</span>
    track_metrics_by_stockcode(X_val, y_val, best_model=best_model, checkpoint=checkpoint, stockcode=STOCKCODE)
</code></pre>
<h4 id="heading-outputs-3">Outputs</h4>
<p>The stage generates two files:</p>
<ul>
<li><p><code>models/production/dfn_best.pth</code>: Includes model artifacts and checkpoint like the optimal hyperparameter set.</p>
</li>
<li><p><code>metrics/dfn_val.json</code>: Contains tuning results, model version, timestamp, and validation results for MSE, MAE, and RMSLE:</p>
</li>
</ul>
<p><code>metrics/dfn_val.json</code>:</p>
<pre><code class="lang-yaml">{
    <span class="hljs-attr">"stockcode":</span> <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_val":</span> <span class="hljs-number">0.6137686967849731</span>,
    <span class="hljs-attr">"mae_val":</span> <span class="hljs-number">9.092489242553711</span>,
    <span class="hljs-attr">"rmsle_val":</span> <span class="hljs-number">0.6953299045562744</span>,
    <span class="hljs-attr">"model_version":</span> <span class="hljs-string">"dfn_85123A_35604"</span>,
    <span class="hljs-attr">"hparams":</span> {
        <span class="hljs-attr">"num_layers":</span> <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm":</span> <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0":</span> <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0":</span> <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1":</span> <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1":</span> <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2":</span> <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2":</span> <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3":</span> <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3":</span> <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate":</span> <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr":</span> <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp":</span> <span class="hljs-string">"2025-10-07T00:31:08.700294"</span>
}
</code></pre>
<h3 id="heading-stage-5-performing-inference">Stage 5: Performing Inference</h3>
<p>After the model tuning phase is complete, we’ll configure the test inference for a final evaluation.</p>
<p>The final evaluation uses the MSE, MAE, and RMSLE metrics, as well as SHAP for feature impact and interpretability analysis.</p>
<p><strong>SHAP</strong> <strong>(SHapley Additive exPlanations)</strong> is a framework for quantifying how much each feature contributes to a model’s prediction by using the concept of Shapley values from game theory.</p>
<p>The SHAP values are leveraged for future EDA and feature engineering.</p>
<h4 id="heading-dvc-configuration-4">DVC Configuration</h4>
<p>First, we’ll add the <code>inference_primary_model</code> stage to the DVC configuration.</p>
<p>This stage has the <code>plots</code> section where DVC will track and version the generated visualization files on the SHAP values.</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/inference.py
      data/x_test_processed_${params.stockcode}.parquet
      data/y_test_df_${params.stockcode}.parquet
      models/production/dfn_best_${params.stockcode}.pth
      ${params.stockcode}
      ${tracking.sensitive_feature_col}
      ${tracking.privileged_group}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/inference.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_inf_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>

    <span class="hljs-attr">plots:</span>
      <span class="hljs-comment"># shap summary / beeswarm plot for global interpretability</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_summary_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">simple</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">shap_value</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Beeswarm</span> <span class="hljs-string">Plot</span>

      <span class="hljs-comment"># shap mean absolute vals - feature importance bar plot</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_mean_abs_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">bar</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">mean_abs_shap</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">Mean</span> <span class="hljs-string">Absolute</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Importance</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_raw_shap_values_${params.stockcode}.parquet</span> <span class="hljs-comment"># save raw shap vals for detailed analysis later</span>
</code></pre>
<h4 id="heading-python-scripts-4"><strong>Python Scripts</strong></h4>
<p>Next, we’ll add scripts where the trained model performs inference:</p>
<p><code>src/model/torch_model/inference.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> shap

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># load test dataset</span>
    X_TEST_PATH = sys.argv[<span class="hljs-number">1</span>]
    Y_TEST_PATH = sys.argv[<span class="hljs-number">2</span>]
    X_test, y_test = pd.read_parquet(X_TEST_PATH), pd.read_parquet(Y_TEST_PATH)

    <span class="hljs-comment"># create X_test w/ column names for shap analysis and sensitive feature tracking</span>
    X_test_with_col_names = X_test.copy()
    FEATURE_NAMES_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'feature_names.json'</span>)
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">with</span> open(FEATURE_NAMES_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f: feature_names = json.load(f)
    <span class="hljs-keyword">except</span> FileNotFoundError: feature_names = X_test.columns.tolist()
    <span class="hljs-keyword">if</span> len(X_test_with_col_names.columns) == len(feature_names): X_test_with_col_names.columns = feature_names

    <span class="hljs-comment"># reconstruct the optimal model tuned in the previous stage</span>
    MODEL_PATH = sys.argv[<span class="hljs-number">3</span>]
    checkpoint = torch.load(MODEL_PATH)
    model = scripts.load_model(checkpoint=checkpoint)

    <span class="hljs-comment"># perform inference</span>
    y_pred, mse, exp_mae, rmsle = scripts.perform_inference(model=model, X=X_test, y=y_test, batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>])

    <span class="hljs-comment"># create result df w/ y_pred, y_true, and sensitive features</span>
    STOCKCODE = sys.argv[<span class="hljs-number">4</span>]
    SENSITIVE_FEATURE = sys.argv[<span class="hljs-number">5</span>]
    PRIVILEGED_GROUP = sys.argv[<span class="hljs-number">6</span>]
    inference_df = pd.DataFrame(y_pred.cpu().numpy().flatten(), columns=[<span class="hljs-string">'y_pred'</span>])
    inference_df[<span class="hljs-string">'y_true'</span>] = y_test
    inference_df[SENSITIVE_FEATURE] = X_test_with_col_names[<span class="hljs-string">f'cat__<span class="hljs-subst">{SENSITIVE_FEATURE}</span>_<span class="hljs-subst">{str(PRIVILEGED_GROUP)}</span>'</span>].astype(bool)
    inference_df.to_parquet(path=os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">f'dfn_inference_results_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>))

    <span class="hljs-comment"># record inference metrics</span>
    MODEL_INF_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_inf_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_INF_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{STOCKCODE}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    inf_metrics = dict(
        stockcode=STOCKCODE,
        mse_inf=mse,
        mae_inf=exp_mae,
        rmsle_inf=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-keyword">with</span> open(MODEL_INF_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f: <span class="hljs-comment"># dvc track</span>
        json.dump(inf_metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... inference metrics saved to <span class="hljs-subst">{MODEL_INF_METRICS_PATH}</span> ...'</span>)


    <span class="hljs-comment">## shap analysis</span>
    <span class="hljs-comment"># compute shap vals</span>
    model.eval()

    <span class="hljs-comment"># prepare backgdound data</span>
    X_test_tensor = torch.from_numpy(X_test.values.astype(np.float32)).to(device_type)

    <span class="hljs-comment"># take the small samples from x_test as background</span>
    background = X_test_tensor[np.random.choice(X_test_tensor.shape[<span class="hljs-number">0</span>], <span class="hljs-number">100</span>, replace=<span class="hljs-literal">False</span>)].to(device_type)

    <span class="hljs-comment"># define deepexplainer</span>
    explainer = shap.DeepExplainer(model, background)

    <span class="hljs-comment"># compute shap vals</span>
    shap_values = explainer.shap_values(X_test_tensor) <span class="hljs-comment"># outputs = numpy array or tensor</span>

    <span class="hljs-comment"># convert shap array to pandas df</span>
    <span class="hljs-keyword">if</span> isinstance(shap_values, list): shap_values = shap_values[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">if</span> isinstance(shap_values, torch.Tensor): shap_values = shap_values.cpu().numpy()
    shap_values = shap_values.squeeze(axis=<span class="hljs-number">-1</span>) <span class="hljs-comment"># type: ignore</span>
    shap_df = pd.DataFrame(shap_values, columns=feature_names)

    <span class="hljs-comment"># shap raw data (dvc track)</span>
    RAW_SHAP_OUT_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_raw_shap_values_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>)
    os.makedirs(os.path.dirname(RAW_SHAP_OUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    shap_df.to_parquet(RAW_SHAP_OUT_PATH, index=<span class="hljs-literal">False</span>)
    main_logger.info(<span class="hljs-string">f'... shap values saved to <span class="hljs-subst">{RAW_SHAP_OUT_PATH}</span> ...'</span>)

    <span class="hljs-comment"># bar plot of mean abs shap vals (dvc report)</span>
    mean_abs_shap = shap_df.abs().mean().sort_values(ascending=<span class="hljs-literal">False</span>)
    shap_mean_abs_df = pd.DataFrame({<span class="hljs-string">'feature_name'</span>: feature_names, <span class="hljs-string">'mean_abs_shap'</span>: mean_abs_shap.values })
    MEAN_ABS_SHAP_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_shap_mean_abs_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    shap_mean_abs_df.to_json(MEAN_ABS_SHAP_PATH, orient=<span class="hljs-string">'records'</span>, indent=<span class="hljs-number">4</span>)
</code></pre>
<h4 id="heading-outputs-4"><strong>Outputs</strong></h4>
<p>This stage generates five output files:</p>
<ul>
<li><p><code>data/dfn_inference_result_${params_stockcode}.parquet</code>: Stores prediction results, labeled targets, and any columns with sensitive features like gender, age, income, and more. I’ll use this file for the fairness test in the last stage.</p>
</li>
<li><p><code>metrics/dfn_inf.json</code>: Stores evaluation metrics and tuning results:</p>
</li>
</ul>
<pre><code class="lang-json">{
    <span class="hljs-attr">"stockcode"</span>: <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_inf"</span>: <span class="hljs-number">0.6841545701026917</span>,
    <span class="hljs-attr">"mae_inf"</span>: <span class="hljs-number">11.5866117477417</span>,
    <span class="hljs-attr">"rmsle_inf"</span>: <span class="hljs-number">0.7423332333564758</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35834"</span>,
    <span class="hljs-attr">"hparams"</span>: {
        <span class="hljs-attr">"num_layers"</span>: <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm"</span>: <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0"</span>: <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0"</span>: <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1"</span>: <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1"</span>: <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2"</span>: <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2"</span>: <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3"</span>: <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3"</span>: <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate"</span>: <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr"</span>: <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:12.946405"</span>
}
</code></pre>
<ul>
<li><code>reports/dfn_shap_mean_abs.json</code>:  Stores the mean SHAP values:</li>
</ul>
<pre><code class="lang-json">[
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__invoicedate"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.219255722</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__unitprice"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1069829418</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_avg_quantity_last_month"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1021453096</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_max_price_all_time"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.0855356899</span>
    },
...
]
</code></pre>
<ul>
<li><p><code>reports/dfn_shap_summary.json</code>: Contains the data points necessary to draw the beeswarm/bar plots.</p>
</li>
<li><p><code>reports/dfn_raw_shap_values.parquet</code>: Stores raw SHAP values.</p>
</li>
</ul>
<h3 id="heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</h3>
<p>The last stage is to assess risk and fairness of the final inference results.</p>
<h4 id="heading-the-fairness-testing">The Fairness Testing</h4>
<p>Fairness testing in ML is the process of systematically evaluating a model’s predictions to ensure they are not unfairly biased toward specific groups defined by sensitive attributes like race and gender.</p>
<p>In this project, we’ll use the registration status <code>is_registered</code> column as a sensitive feature and make sure the <strong>Mean Outcome Difference (MOD)</strong> is within the specified threshold of <code>0.1</code>.</p>
<p>The MOD is calculated as the absolute difference between the mean prediction values of the privileged (registered) and unprivileged (unregistered) groups.</p>
<h4 id="heading-dvc-configuration-5">DVC Configuration</h4>
<p>First, we’ll add the <code>assess_model_risk</code> stage right after the <code>inference_primary_model</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">assess_model_risk:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/assess_risk_and_fairness.py
      data/dfn_inference_results_${params.stockcode}.parquet
      metrics/dfn_risk_fairness_${params.stockcode}.json
      ${tracking.sensitive_feature_col}
      ${params.stockcode}
      ${tracking.privileged_group}
      ${tracking.mod_threshold}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/assess_risk_and_fairness.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span> <span class="hljs-comment"># ensure the result df as dependency</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.mod_threshold</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_risk_fairness_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>param.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>

<span class="hljs-comment"># adding default values to the tracking metrics</span>
<span class="hljs-attr">tracking:</span>
  <span class="hljs-attr">sensitive_feature_col:</span> <span class="hljs-string">"is_registered"</span>
  <span class="hljs-attr">privileged_group:</span> <span class="hljs-number">1</span> <span class="hljs-comment"># member</span>
  <span class="hljs-attr">mod_threshold:</span> <span class="hljs-number">0.1</span>
</code></pre>
<h4 id="heading-python-script">Python Script</h4>
<p>The corresponding Python script contains the <code>calculate_fairness_metrics</code> function which performs the risk and fairness assessment:</p>
<p><code>src/model/torch_model/assess_risk_and_fairness.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_absolute_error, mean_squared_error, root_mean_squared_log_error

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_fairness_metrics</span>(<span class="hljs-params">
        df: pd.DataFrame,
        sensitive_feature_col: str,
        label_col: str = <span class="hljs-string">'y_true'</span>,
        prediction_col: str = <span class="hljs-string">'y_pred'</span>,
        privileged_group: int = <span class="hljs-number">1</span>,
        mod_threshold: float = <span class="hljs-number">0.1</span>,
    </span>) -&gt; dict:</span>

    metrics = dict()
    unprivileged_group = <span class="hljs-number">0</span> <span class="hljs-keyword">if</span> privileged_group == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-number">1</span>

    <span class="hljs-comment">## 1. risk assessment - predictive performance metrics by group</span>
    <span class="hljs-keyword">for</span> group, name <span class="hljs-keyword">in</span> zip([unprivileged_group, privileged_group], [<span class="hljs-string">'unprivileged'</span>, <span class="hljs-string">'privileged'</span>]):
        subset = df[df[sensitive_feature_col] == group]
        <span class="hljs-keyword">if</span> len(subset) == <span class="hljs-number">0</span>: <span class="hljs-keyword">continue</span>

        y_true = subset[label_col].values
        y_pred = subset[prediction_col].values

        metrics[<span class="hljs-string">f'mse_<span class="hljs-subst">{name}</span>'</span>] = float(mean_squared_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'mae_<span class="hljs-subst">{name}</span>'</span>] = float(mean_absolute_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'rmsle_<span class="hljs-subst">{name}</span>'</span>] = float(root_mean_squared_log_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>

        <span class="hljs-comment"># mean prediction (outcome disparity component)</span>
        metrics[<span class="hljs-string">f'mean_prediction_<span class="hljs-subst">{name}</span>'</span>] = float(y_pred.mean()) <span class="hljs-comment"># type: ignore</span>

    <span class="hljs-comment">## 2. bias assessment - fairness metrics</span>
    <span class="hljs-comment"># absolute mean error difference</span>
    mae_diff = metrics.get(<span class="hljs-string">'mae_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mae_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mae_diff'</span>] = float(mae_diff)

    <span class="hljs-comment"># mean outcome difference</span>
    mod = metrics.get(<span class="hljs-string">'mean_prediction_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mean_prediction_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mean_outcome_difference'</span>] = float(mod)
    metrics[<span class="hljs-string">'is_mod_acceptable'</span>] = <span class="hljs-number">1</span> <span class="hljs-keyword">if</span> abs(mod) &lt;= mod_threshold <span class="hljs-keyword">else</span> <span class="hljs-number">0</span>

    <span class="hljs-keyword">return</span> metrics


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'assess bias and fairness metrics on model inference results.'</span>)
    parser.add_argument(<span class="hljs-string">'inference_file_path'</span>, type=str, help=<span class="hljs-string">'parquet file path to the inference results w/ y_true, y_pred, and sensitive feature cols.'</span>)
    parser.add_argument(<span class="hljs-string">'metrics_output_path'</span>, type=str, help=<span class="hljs-string">'json file path to save the metrics output.'</span>)
    parser.add_argument(<span class="hljs-string">'sensitive_feature_col'</span>, type=str, help=<span class="hljs-string">'column name of sensitive features'</span>)
    parser.add_argument(<span class="hljs-string">'stockcode'</span>, type=str)
    parser.add_argument(<span class="hljs-string">'privileged_group'</span>, type=int, default=<span class="hljs-number">1</span>)
    parser.add_argument(<span class="hljs-string">'mod_threshold'</span>, type=float, default=<span class="hljs-number">.1</span>)
    args = parser.parse_args()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># load inf df</span>
        df_inference = pd.read_parquet(args.inference_file_path)
        LABEL_COL = <span class="hljs-string">'y_true'</span>
        PREDICTION_COL = <span class="hljs-string">'y_pred'</span>
        SENSITIVE_COL = args.sensitive_feature_col

        <span class="hljs-comment"># compute fairness metrics</span>
        metrics = calculate_fairness_metrics(
            df=df_inference,
            sensitive_feature_col=SENSITIVE_COL,
            label_col=LABEL_COL,
            prediction_col=PREDICTION_COL,
            privileged_group=args.privileged_group,
            mod_threshold=args.mod_threshold,
        )

        <span class="hljs-comment"># add items to metrics</span>
        metrics[<span class="hljs-string">'model_version'</span>] = <span class="hljs-string">f'dfn_<span class="hljs-subst">{args.stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>'</span>
        metrics[<span class="hljs-string">'sensitive_feature'</span>] = args.sensitive_feature_col
        metrics[<span class="hljs-string">'privileged_group'</span>] = args.privileged_group
        metrics[<span class="hljs-string">'mod_threshold'</span>] = args.mod_threshold
        metrics[<span class="hljs-string">'stockcode'</span>] = args.stockcode
        metrics[<span class="hljs-string">'timestamp'</span>] = datetime.datetime.now().isoformat()

        <span class="hljs-comment"># load metrics (dvc track)</span>
        <span class="hljs-keyword">with</span> open(args.metrics_output_path, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
            json_metrics = { k: (v <span class="hljs-keyword">if</span> pd.notna(v) <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>) <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> metrics.items() }
            json.dump(json_metrics, f, indent=<span class="hljs-number">4</span>)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        main_logger.error(<span class="hljs-string">f'... an error occurred during risk and fairness assessment: <span class="hljs-subst">{e}</span> ...'</span>)
        exit(<span class="hljs-number">1</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    main()
</code></pre>
<h4 id="heading-outputs-5">Outputs</h4>
<p>The final stage generates a metrics file which contains test results and model version:</p>
<p><code>metrics/dfn_risk_fairness.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"mse_unprivileged"</span>: <span class="hljs-number">3.5370739412593575</span>,
    <span class="hljs-attr">"mae_unprivileged"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"rmsle_unprivileged"</span>: <span class="hljs-number">0.6080000224747837</span>,
    <span class="hljs-attr">"mean_prediction_unprivileged"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"mae_diff"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"mean_outcome_difference"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"is_mod_acceptable"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35971"</span>,
    <span class="hljs-attr">"sensitive_feature"</span>: <span class="hljs-string">"is_registered"</span>,
    <span class="hljs-attr">"privileged_group"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"mod_threshold"</span>: <span class="hljs-number">0.1</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:15.998590"</span>
}
</code></pre>
<p>That’s all for the lineage configuration. Now, we’ll test it in local.</p>
<h3 id="heading-test-in-local">Test in Local</h3>
<p>We’ll run the entire ML lineage with this command:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> repro -f
</code></pre>
<p><code>-f</code> forces DVC to rerun all the stages with or without any updates.</p>
<p>The command will automatically create the <code>dvc.lock</code> file at the root of the project directory:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">schema:</span> <span class="hljs-string">'2.0'</span>
<span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline_full:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
    <span class="hljs-attr">deps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">ae41392532188d290395495f6827ed00.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">15870</span>
      <span class="hljs-attr">nfiles:</span> <span class="hljs-number">10</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">a8a61a4b270581a7c387d51e416f4e86.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">95715</span>
<span class="hljs-string">...</span>
</code></pre>
<p>The <code>dvc.lock</code> file must be published in Git to make sure DVC will load the latest files:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add dvc.lock .dvc dvc.yaml params.yaml
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dvc config'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<h2 id="heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</h2>
<p>Next, we’ll deploy the DVC project to ensure the AWS Lambda function can access the cached files in production.</p>
<p>We’ll start by configuring the DVC remote where the cached files are stored.</p>
<p>DVC offers <a target="_blank" href="https://dvc.org/doc/user-guide/data-management/remote-storage#supported-storage-types">various storage types</a> like AWS S3 and Google Cloud. We’ll use AWS S3 for this project but your choice depend on the project ecosystem, your familiarity with the tool, and any resource constraints.</p>
<p>First, we’ll create a new S3 bucket in the selected AWS region:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> s3 mb s3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;  --region &lt;AWS REGION&gt;
</code></pre>
<p>Make sure the IAM role has the following permissions: <code>s3:ListBucket</code>, <code>s3:GetObject</code>, <code>s3:PutObject</code>, and <code>s3:DeleteObject</code>.</p>
<p>Then, add theURI of the S3 bucket to the DVC remote:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> remote add -d &lt;DVC REMOTE NAME&gt; ss3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;
</code></pre>
<p>Next, push the cache files to the DVC remote:</p>
<pre><code class="lang-python">$dvc push
</code></pre>
<p>Now, all cache files are stored in the S3 bucket:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*yl9N4P8LNI7d_G_z.png" alt="Figure C. Screenshot of the DVC remote in AWS S3 bucket" width="600" height="400" loading="lazy"></p>
<p><strong>Figure C.</strong> Screenshot of the DVC remote in AWS S3 bucket</p>
<p>As shown in <strong>Figure A,</strong> this deployment step is necessary for the AWS Lambda function to access the DVC cache in production.</p>
<h2 id="heading-step-4-configuring-scheduled-run-with-prefect"><strong>Step 4: Configuring Scheduled Run with Prefect</strong></h2>
<p>The next step is to configure the scheduled run of the entire lineage with Prefect.</p>
<p>Prefect is an open-source workflow orchestration tool for building, scheduling, and monitoring pipelines. It uses a concept called a work pool to effectively decouple the orchestration logic from the execution infrastructure.</p>
<p>Then, the work pool serves as a standardized base configuration by running a Docker container image to guarantee a consistent execution environment for all flows.</p>
<h3 id="heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</h3>
<p>The first step is to configure the Docker image registry for the Prefect work pool:</p>
<ul>
<li><p>For local deployment: <strong>A container registry in the Docker Hub.</strong></p>
</li>
<li><p>For production deployment: <strong>AWS ECR</strong>.</p>
</li>
</ul>
<p>For local deployment, we’ll first authenticate the Docker client:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> login
</code></pre>
<p>And grant a user permission to run Docker commands without <code>sudo</code>:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$sudo</span> dscl . -append /Groups/docker GroupMembership <span class="hljs-variable">$USER</span>
</code></pre>
<p>For production deployment, we’ll create a new ECR:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> ecr create-repository --repository-name &lt;REGISTORY NAME&gt; --region &lt;AWS REGION&gt;
</code></pre>
<p>(Make sure the IAM role has access to this new ECR URI.)</p>
<h3 id="heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</h3>
<p>Next, we’ll configure the Prefect <code>task</code> and <code>flow</code> in the project:</p>
<ul>
<li><p>The Prefect <code>task</code> executes the <code>dvc repro</code> and <code>dvc push</code> commands</p>
</li>
<li><p>The Prefect <code>flow</code> weekly executes the Prefect <code>task</code>.</p>
</li>
</ul>
<p><code>src/prefect_flows.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> subprocess
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> timedelta, datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> prefect <span class="hljs-keyword">import</span> flow, task
<span class="hljs-keyword">from</span> prefect.schedules <span class="hljs-keyword">import</span> Schedule
<span class="hljs-keyword">from</span> prefect_aws <span class="hljs-keyword">import</span> AwsCredentials

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># add project root to the python path - enabling prefect to find the script</span>
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), <span class="hljs-string">'..'</span>)))

<span class="hljs-comment"># define the prefect task</span>
<span class="hljs-meta">@task(retries=3, retry_delay_seconds=30)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_dvc_pipeline</span>():</span>
    <span class="hljs-comment"># execute the dvc pipeline </span>
    result = subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"repro"</span>], capture_output=<span class="hljs-literal">True</span>, text=<span class="hljs-literal">True</span>, check=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># push the updated data</span>
    subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"push"</span>], check=<span class="hljs-literal">True</span>)


<span class="hljs-comment"># define the prefect flow</span>
<span class="hljs-meta">@flow(name="Weekly Data Pipeline")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">weekly_data_flow</span>():</span>
    run_dvc_pipeline()

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># docker image registry (either docker hub or aws ecr)</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ENV = os.getenv(<span class="hljs-string">'ENV'</span>, <span class="hljs-string">'production'</span>)
    DOCKER_HUB_REPO = os.getenv(<span class="hljs-string">'DOCKER_HUB_REPO'</span>)
    ECR_FOR_PREFECT_PATH = os.getenv(<span class="hljs-string">'S3_BUCKET_FOR_PREFECT_PATH'</span>)
    image_repo = <span class="hljs-string">f'<span class="hljs-subst">{DOCKER_HUB_REPO}</span>:ml-sales-pred-data-latest'</span> <span class="hljs-keyword">if</span> ENV == <span class="hljs-string">'local'</span> <span class="hljs-keyword">else</span> <span class="hljs-string">f'<span class="hljs-subst">{ECR_FOR_PREFECT_PATH}</span>:latest'</span>

    <span class="hljs-comment"># define weekly schedule</span>
    weekly_schedule = Schedule(
        interval=timedelta(weeks=<span class="hljs-number">1</span>),
        anchor_date=datetime(<span class="hljs-number">2025</span>, <span class="hljs-number">9</span>, <span class="hljs-number">29</span>, <span class="hljs-number">9</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>),
        active=<span class="hljs-literal">True</span>,
    )

    <span class="hljs-comment"># aws credentials to access ecr</span>
    AwsCredentials(
        aws_access_key_id=os.getenv(<span class="hljs-string">'AWS_ACCESS_KEY_ID'</span>),
        aws_secret_access_key=os.getenv(<span class="hljs-string">'AWS_SECRET_ACCESS_KEY'</span>),
        region_name=os.getenv(<span class="hljs-string">'AWS_REGION_NAME'</span>),
    ).save(<span class="hljs-string">'aws'</span>, overwrite=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># deploy the prefect flow</span>
    weekly_data_flow.deploy(
        name=<span class="hljs-string">'weekly-data-flow'</span>,
        schedule=weekly_schedule, <span class="hljs-comment"># schedule</span>
        work_pool_name=<span class="hljs-string">"wp-ml-sales-pred"</span>, <span class="hljs-comment"># work pool where the docker image (flow) runs</span>
        image=image_repo, <span class="hljs-comment"># create a docker image at docker hub (local) or ecr (production)</span>
        concurrency_limit=<span class="hljs-number">3</span>,
        push=<span class="hljs-literal">True</span> <span class="hljs-comment"># push the docker image to the image_repo</span>
    )
</code></pre>
<h3 id="heading-test-in-local-1">Test in Local</h3>
<p>Next, we’ll test the workflow locally with the Prefect server:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run prefect server start

<span class="hljs-variable">$export</span> PREFECT_API_URL=<span class="hljs-string">"http://127.0.0.1:4200/api"</span>
</code></pre>
<p>Run the <code>prefect_flows.py</code> script:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run src/prefect_flows.py
</code></pre>
<p>Upon the successful execution, the Prefect dashboard indicates the workflow is scheduled to run:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*pUJppTJ4MloU2DVr.png" alt="Figure D. The screenshot of the Prefect dashboard" width="600" height="400" loading="lazy"></p>
<p><strong>Figure D.</strong> As screenshot of the Prefect dashboard</p>
<h2 id="heading-step-5-deploying-the-application">Step 5: Deploying the Application</h2>
<p>The final step is to deploy the entire application as a containerized Lambda by configuring the <code>Dockerfile</code> and the Flask application scripts.</p>
<p>The specific process in this final deployment step depends on the infrastructure.</p>
<p>But the common point is that DVC eliminates the need to store the large Parquet or CSV files directly in the feature store or model store because it caches them as lightweight hashed files.</p>
<p>So, first, we’ll simplify the loading logic of the Flask application script by using the <code>dvc.api</code> framework:</p>
<p><code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-keyword">import</span> dvc.api

DVC_REMOTE_NAME=&lt;REMOTE NAME IN .dvc/config file&gt;


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">configure_dvc_for_lambda</span>():</span>
    <span class="hljs-comment"># set dvc directories to /tmp</span>
    os.environ.update({
        <span class="hljs-string">'DVC_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-cache'</span>,
        <span class="hljs-string">'DVC_DATA_DIR'</span>: <span class="hljs-string">'/tmp/dvc-data'</span>,
        <span class="hljs-string">'DVC_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-config'</span>,
        <span class="hljs-string">'DVC_GLOBAL_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-global-config'</span>,
        <span class="hljs-string">'DVC_SITE_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-site-cache'</span>
    })
    <span class="hljs-keyword">for</span> dir_path <span class="hljs-keyword">in</span> [<span class="hljs-string">'/tmp/dvc-cache'</span>, <span class="hljs-string">'/tmp/dvc-data'</span>, <span class="hljs-string">'/tmp/dvc-config'</span>]:
        os.makedirs(dir_path, exist_ok=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_x_test</span>():</span>
    <span class="hljs-keyword">global</span> X_test
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading x_test ..."</span>)

        <span class="hljs-comment"># config dvc directories</span>
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(X_TEST_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                X_test = pd.read_parquet(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded x_test via dvc api'</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_preprocessor</span>():</span>
    <span class="hljs-keyword">global</span> preprocessor
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading preprocessor ..."</span>)
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(PREPROCESSOR_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                preprocessor = joblib.load(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded preprocessor via dvc api'</span>)

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)

<span class="hljs-comment">### ... the rest components remain the same  ...</span>
</code></pre>
<p>Then, update the Dockerfile to enable Docker to correctly reference the DVC components:</p>
<p><code>Dockerfile.lambda.production</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># use an official python runtime</span>
FROM public.ecr.aws/<span class="hljs-keyword">lambda</span>/python:<span class="hljs-number">3.12</span>

<span class="hljs-comment"># set environment variables (adding dvc related env variables)</span>
ENV JOBLIB_MULTIPROCESSING=<span class="hljs-number">0</span>
ENV DVC_HOME=<span class="hljs-string">"/tmp/.dvc"</span>
ENV DVC_CACHE_DIR=<span class="hljs-string">"/tmp/.dvc/cache"</span>
ENV DVC_REMOTE_NAME=<span class="hljs-string">"storage"</span>
ENV DVC_GLOBAL_SITE_CACHE_DIR=<span class="hljs-string">"/tmp/dvc_global"</span>

<span class="hljs-comment"># copy requirements file and install dependencies</span>
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir dvc dvc-s3

<span class="hljs-comment"># setup dvc</span>
RUN dvc init --no-scm
RUN dvc config core.no_scm true

<span class="hljs-comment"># copy the code to the lambda task root</span>
COPY . ${LAMBDA_TASK_ROOT}
CMD [ <span class="hljs-string">"app.handler"</span> ]
</code></pre>
<p>Lastly, ensure the large files are ignored from the Docker container image:</p>
<p><code>.dockerignore</code>:</p>
<pre><code class="lang-bash"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-comment"># dvc cache contains large files</span>
.dvc/cache
.dvcignore

<span class="hljs-comment"># add all folders that DVC will track</span>
data/
preprocessors/
models/
reports/
metrics/
</code></pre>
<h3 id="heading-test-in-local-2">Test in Local</h3>
<p>Finally, we’ll build and test the Docker image:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> build -t my-app -f Dockerfile.lambda.local .
<span class="hljs-variable">$docker</span> run -p 5002:5002 -e ENV=<span class="hljs-built_in">local</span> my-app app.py
</code></pre>
<p>Upon the successful configuration, the waitress server will run the Flask application.</p>
<p>After confirming the changes, push the code to Git:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add .
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dockerfiles and flask app scripts'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<p>This <code>push</code> command triggers the CI/CD pipeline via GitHub Actions, which generates a Docker container image and pushes it to AWS ECR.</p>
<p>And then after a successful pipeline flow and verification, we can manually run the deployment workflow using GitHub Actions.</p>
<p>And that’s it!</p>
<p>You can learn more here: <a target="_blank" href="https://medium.com/towards-artificial-intelligence/integrating-ci-cd-pipelines-to-machine-learning-applications-f5657c7fa164">Integrating the infrastructure CI/CD pipeline to an ML application</a></p>
<p>All code is available in <a target="_blank" href="https://github.com/krik8235/ml-sales-prediction">my GitHub repository</a>.</p>
<p>The mock app is available <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Building robust ML applications requires comprehensive ML lineage to ensure reliability and traceability.</p>
<p>In this article, you learned how to build an ML lineage by integrating open-source services like DVC and Prefect.</p>
<p>In practice, initial planning matters. Specifically, defining how metrics are tracked and at which stages leads directly to a cleaner, more maintainable code structure and the extensibility in the future.</p>
<p>Moving forward, we can consider adding more stages to the lineage and integrating advanced logic for data drift detection or fairness tests.</p>
<p>This will further ensure continued model performance and data integrity in the production environment.</p>
<p><strong>You can check out my</strong> <a target="_blank" href="https://kuriko-iwai.vercel.app/"><strong>Portfolio</strong></a> <strong>/</strong> <a target="_blank" href="https://github.com/krik8235"><strong>Github</strong></a><strong>.</strong></p>
<p><em>All images, unless otherwise noted, are by the author.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Transformers for Real-Time Gesture Recognition ]]>
                </title>
                <description>
                    <![CDATA[ Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are no... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/using-transformers-for-real-time-gesture-recognition/</link>
                <guid isPermaLink="false">68e3c692aa82abf4b593114c</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pytorch ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ONNX ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gradio ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Gesture Recognition ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Tutorial ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 06 Oct 2025 13:39:30 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759757931295/5f19fd4e-93c0-4bd7-a75c-a7858e061ecd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.</p>
<p>This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.</p>
<p>In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-transformers-for-gestures">Why Transformers for Gestures?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You’ll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-generate-a-gesture-dataset">Generate a Gesture Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-training-script-trainpy">Training Script:</a> <a target="_blank" href="http://train.py">train.py</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-export-the-model-to-onnx">Export the Model to ONNX</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps">Next Steps</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-transformers-for-gestures">Why Transformers for Gestures?</h2>
<p>Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.</p>
<p>Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.</p>
<p>Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.</p>
<h2 id="heading-what-youll-learn">What You’ll Learn</h2>
<p>In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:</p>
<ul>
<li><p>Create (or record) a tiny gesture dataset</p>
</li>
<li><p>Train a Vision Transformer (ViT) with temporal pooling</p>
</li>
<li><p>Export the model to ONNX for faster inference</p>
</li>
<li><p>Build a real-time Gradio app that classifies gestures from your webcam</p>
</li>
<li><p>Evaluate your model’s accuracy and latency with simple scripts</p>
</li>
<li><p>Understand the accessibility potential and ethical limits of gesture recognition</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you should have:</p>
<ul>
<li><p>Basic Python knowledge (functions, scripts, virtual environments)</p>
</li>
<li><p>Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required</p>
</li>
<li><p>Python 3.8+ installed on your system</p>
</li>
<li><p>A webcam (for the live demo in Gradio)</p>
</li>
<li><p>Optionally: GPU access (training on CPU works, but is slower)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<p>Create a new project folder and install the required libraries.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a new project directory and navigate into it</span>
mkdir transformer-gesture &amp;&amp; <span class="hljs-built_in">cd</span> transformer-gesture

<span class="hljs-comment"># Set up a Python virtual environment</span>
python -m venv .venv

<span class="hljs-comment"># Activate the virtual environment</span>
<span class="hljs-comment"># Windows PowerShell</span>
.venv\Scripts\Activate.ps1

<span class="hljs-comment"># macOS/Linux</span>
<span class="hljs-built_in">source</span> .venv/bin/activate
</code></pre>
<p>The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:</p>
<ol>
<li><p><code>mkdir transformer-gesture &amp;&amp; cd transformer-gesture</code>: This command creates a new directory named "transformer-gesture" and then navigates into it.</p>
</li>
<li><p><code>python -m venv .venv</code>: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".</p>
</li>
<li><p>Activating the virtual environment:</p>
<ul>
<li><p>For Windows PowerShell, you can use <code>.venv\Scripts\Activate.ps1</code> to activate the virtual environment.</p>
</li>
<li><p>For macOS/Linux, use <code>source .venv/bin/activate</code> to activate the virtual environment.</p>
</li>
</ul>
</li>
</ol>
<p>Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.</p>
<p>Create a <code>requirements.txt</code> file:</p>
<pre><code class="lang-plaintext">torch&gt;=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn
</code></pre>
<p>The list provided is a set of package dependencies typically found in a <code>requirements.txt</code> file for a Python project. Here's a brief explanation of each package:</p>
<ol>
<li><p><strong>torch&gt;=2.0</strong>: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.</p>
</li>
<li><p><strong>torchvision</strong>: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.</p>
</li>
<li><p><strong>torchaudio</strong>: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.</p>
</li>
<li><p><strong>timm</strong>: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.</p>
</li>
<li><p><strong>huggingface_hub</strong>: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.</p>
</li>
<li><p><strong>onnx</strong>: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.</p>
</li>
<li><p><strong>onnxruntime</strong>: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.</p>
</li>
<li><p><strong>gradio</strong>: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.</p>
</li>
<li><p><strong>numpy</strong>: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.</p>
</li>
<li><p><strong>opencv-python</strong>: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.</p>
</li>
<li><p><strong>pillow</strong>: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.</p>
</li>
<li><p><strong>matplotlib</strong>: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.</p>
</li>
<li><p><strong>seaborn</strong>: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.</p>
</li>
<li><p><strong>scikit-learn</strong>: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.</p>
</li>
</ol>
<p>Install dependencies:</p>
<pre><code class="lang-bash">pip install -r requirements.txt
</code></pre>
<p>The command <code>pip install -r requirements.txt</code> is used to install all the Python packages listed in a file named <code>requirements.txt</code>. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.</p>
<p>By running this command, <code>pip</code>, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.</p>
<h2 id="heading-generate-a-gesture-dataset">Generate a Gesture Dataset</h2>
<p>To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.</p>
<h2 id="heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</h2>
<p>We’ll use a small Python script that creates short <code>.mp4</code> clips of a moving (or still) coloured box. Each class represents a gesture:</p>
<ul>
<li><p><strong>swipe_left</strong> – box moves from right to left</p>
</li>
<li><p><strong>swipe_right</strong> – box moves from left to right</p>
</li>
<li><p><strong>stop</strong> – box stays still in the center</p>
</li>
</ul>
<p>Save this script as <code>generate_synthetic_gestures.py</code> in your project root:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, cv2, numpy <span class="hljs-keyword">as</span> np, random, argparse

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ensure_dir</span>(<span class="hljs-params">p</span>):</span> os.makedirs(p, exist_ok=<span class="hljs-literal">True</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">make_clip</span>(<span class="hljs-params">mode, out_path, seconds=<span class="hljs-number">1.5</span>, fps=<span class="hljs-number">16</span>, size=<span class="hljs-number">224</span>, box_size=<span class="hljs-number">60</span>, seed=<span class="hljs-number">0</span>, codec=<span class="hljs-string">"mp4v"</span></span>):</span>
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    <span class="hljs-comment"># background + box color</span>
    bg_val = rng.randint(<span class="hljs-number">160</span>, <span class="hljs-number">220</span>)
    bg = np.full((H, W, <span class="hljs-number">3</span>), bg_val, dtype=np.uint8)
    color = (rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>))

    <span class="hljs-comment"># path of motion</span>
    y = rng.randint(<span class="hljs-number">40</span>, H - <span class="hljs-number">40</span> - box_size)
    <span class="hljs-keyword">if</span> mode == <span class="hljs-string">"swipe_left"</span>:
        x_start, x_end = W - <span class="hljs-number">20</span> - box_size, <span class="hljs-number">20</span>
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"swipe_right"</span>:
        x_start, x_end = <span class="hljs-number">20</span>, W - <span class="hljs-number">20</span> - box_size
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"stop"</span>:
        x_start = x_end = (W - box_size) // <span class="hljs-number">2</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Unknown mode: <span class="hljs-subst">{mode}</span>"</span>)

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> vw.isOpened():
        <span class="hljs-keyword">raise</span> RuntimeError(
            <span class="hljs-string">f"Could not open VideoWriter with codec '<span class="hljs-subst">{codec}</span>'. "</span>
            <span class="hljs-string">"Try --codec XVID and use .avi extension, e.g. out.avi"</span>
        )

    <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> range(frames):
        alpha = t / max(<span class="hljs-number">1</span>, frames - <span class="hljs-number">1</span>)
        x = int((<span class="hljs-number">1</span> - alpha) * x_start + alpha * x_end)
        <span class="hljs-comment"># small jitter to avoid being too synthetic</span>
        jitter_x, jitter_y = rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>), rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=<span class="hljs-number">-1</span>)
        <span class="hljs-comment"># overlay text</span>
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>), <span class="hljs-number">2</span>, cv2.LINE_AA)
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">255</span>, <span class="hljs-number">255</span>, <span class="hljs-number">255</span>), <span class="hljs-number">1</span>, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_labels</span>(<span class="hljs-params">labels, out_dir</span>):</span>
    <span class="hljs-keyword">with</span> open(os.path.join(out_dir, <span class="hljs-string">"labels.txt"</span>), <span class="hljs-string">"w"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> labels:
            f.write(c + <span class="hljs-string">"\n"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    ap = argparse.ArgumentParser(description=<span class="hljs-string">"Generate a tiny synthetic gesture dataset."</span>)
    ap.add_argument(<span class="hljs-string">"--out"</span>, default=<span class="hljs-string">"data"</span>, help=<span class="hljs-string">"Output directory (default: data)"</span>)
    ap.add_argument(<span class="hljs-string">"--classes"</span>, nargs=<span class="hljs-string">"+"</span>,
                    default=[<span class="hljs-string">"swipe_left"</span>, <span class="hljs-string">"swipe_right"</span>, <span class="hljs-string">"stop"</span>],
                    help=<span class="hljs-string">"Class names (default: swipe_left swipe_right stop)"</span>)
    ap.add_argument(<span class="hljs-string">"--clips"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Clips per class (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--seconds"</span>, type=float, default=<span class="hljs-number">1.5</span>, help=<span class="hljs-string">"Seconds per clip (default: 1.5)"</span>)
    ap.add_argument(<span class="hljs-string">"--fps"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Frames per second (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--size"</span>, type=int, default=<span class="hljs-number">224</span>, help=<span class="hljs-string">"Frame size WxH (default: 224)"</span>)
    ap.add_argument(<span class="hljs-string">"--box"</span>, type=int, default=<span class="hljs-number">60</span>, help=<span class="hljs-string">"Box size (default: 60)"</span>)
    ap.add_argument(<span class="hljs-string">"--codec"</span>, default=<span class="hljs-string">"mp4v"</span>, help=<span class="hljs-string">"Codec fourcc (mp4v or XVID)"</span>)
    ap.add_argument(<span class="hljs-string">"--ext"</span>, default=<span class="hljs-string">".mp4"</span>, help=<span class="hljs-string">"File extension (.mp4 or .avi)"</span>)
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, <span class="hljs-string">"."</span>)  <span class="hljs-comment"># writes labels.txt to project root</span>

    print(<span class="hljs-string">f"Generating synthetic dataset -&gt; <span class="hljs-subst">{args.out}</span>"</span>)
    <span class="hljs-keyword">for</span> cls <span class="hljs-keyword">in</span> args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = <span class="hljs-string">"stop"</span> <span class="hljs-keyword">if</span> cls == <span class="hljs-string">"stop"</span> <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_left"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"left"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_right"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"right"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> <span class="hljs-string">"stop"</span>))
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(args.clips):
            filename = os.path.join(cls_dir, <span class="hljs-string">f"<span class="hljs-subst">{cls}</span>_<span class="hljs-subst">{i+<span class="hljs-number">1</span>:<span class="hljs-number">03</span>d}</span><span class="hljs-subst">{args.ext}</span>"</span>)
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + <span class="hljs-number">1</span>,
                codec=args.codec
            )
        print(<span class="hljs-string">f"  <span class="hljs-subst">{cls}</span>: <span class="hljs-subst">{args.clips}</span> clips"</span>)

    print(<span class="hljs-string">"Done. You can now run: python train.py, python export_onnx.py, python app.py"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p>The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.</p>
<p>Now run it inside your virtual environment:</p>
<pre><code class="lang-bash">python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5
</code></pre>
<p>The command above runs a Python script named <code>generate_synthetic_gestures.py</code>, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".</p>
<p>This creates a dataset like:</p>
<pre><code class="lang-plaintext">data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt
</code></pre>
<p>Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.</p>
<h3 id="heading-training-script-trainpy">Training Script: <code>train.py</code></h3>
<p>Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.</p>
<p>Here’s the full training script:</p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
<span class="hljs-keyword">import</span> torch, torch.nn <span class="hljs-keyword">as</span> nn, torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader
<span class="hljs-keyword">import</span> timm
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ViTTemporal</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">"""Frame-wise ViT encoder -&gt; mean pool over time -&gt; linear head."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, num_classes, vit_name=<span class="hljs-string">"vit_tiny_patch16_224"</span></span>):</span>
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=<span class="hljs-literal">True</span>, num_classes=<span class="hljs-number">0</span>, global_pool=<span class="hljs-string">"avg"</span>)
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>  <span class="hljs-comment"># x: (B,T,C,H,W)</span>
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  <span class="hljs-comment"># (B*T, D)</span>
        feats = feats.view(B, T, <span class="hljs-number">-1</span>).mean(dim=<span class="hljs-number">1</span>)  <span class="hljs-comment"># (B, D)</span>
        <span class="hljs-keyword">return</span> self.head(feats)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>():</span>
    device = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
    labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
    n_classes = len(labels)

    train_ds = GestureClips(train=<span class="hljs-literal">True</span>)
    val_ds   = GestureClips(train=<span class="hljs-literal">False</span>)
    print(<span class="hljs-string">f"Train clips: <span class="hljs-subst">{len(train_ds)}</span> | Val clips: <span class="hljs-subst">{len(val_ds)}</span>"</span>)

    <span class="hljs-comment"># Windows/CPU friendly</span>
    train_dl = DataLoader(train_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">True</span>,  num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)
    val_dl   = DataLoader(val_ds,   batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>, num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=<span class="hljs-number">3e-4</span>, weight_decay=<span class="hljs-number">0.05</span>)

    best_acc = <span class="hljs-number">0.0</span>
    epochs = <span class="hljs-number">5</span>
    <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, epochs + <span class="hljs-number">1</span>):
        <span class="hljs-comment"># ---- Train ----</span>
        model.train()
        total, correct, loss_sum = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0.0</span>
        <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(<span class="hljs-number">0</span>)
            correct += (logits.argmax(<span class="hljs-number">1</span>) == y).sum().item()
            total += x.size(<span class="hljs-number">0</span>)

        train_acc = correct / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>
        train_loss = loss_sum / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        <span class="hljs-comment"># ---- Validate ----</span>
        model.eval()
        vtotal, vcorrect = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
        <span class="hljs-keyword">with</span> torch.no_grad():
            <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(<span class="hljs-number">1</span>) == y).sum().item()
                vtotal += x.size(<span class="hljs-number">0</span>)
        val_acc = vcorrect / vtotal <span class="hljs-keyword">if</span> vtotal <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch:<span class="hljs-number">02</span>d}</span> | train_loss <span class="hljs-subst">{train_loss:<span class="hljs-number">.4</span>f}</span> "</span>
              <span class="hljs-string">f"| train_acc <span class="hljs-subst">{train_acc:<span class="hljs-number">.3</span>f}</span> | val_acc <span class="hljs-subst">{val_acc:<span class="hljs-number">.3</span>f}</span>"</span>)

        <span class="hljs-keyword">if</span> val_acc &gt; best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), <span class="hljs-string">"vit_temporal_best.pt"</span>)

    print(<span class="hljs-string">"Best val acc:"</span>, best_acc)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    train()
</code></pre>
<p>Running the command <code>python train.py</code> initiates the training process for your gesture recognition model. Here's a breakdown of what happens:</p>
<ol>
<li><p><strong>Load your dataset from data/</strong>: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.</p>
</li>
<li><p><strong>Fine-tune a pre-trained Vision Transformer</strong>: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.</p>
</li>
<li><p><strong>Save the best checkpoint as vit_temporal_best.pt</strong>: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.</p>
</li>
</ol>
<h4 id="heading-what-training-looks-like">What Training Looks Like</h4>
<p>You should see logs similar to this:</p>
<pre><code class="lang-plaintext">Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200
</code></pre>
<p>Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:</p>
<ul>
<li><p>Adding more clips per class</p>
</li>
<li><p>Training for more epochs</p>
</li>
<li><p>Switching to real recorded gestures</p>
</li>
</ul>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/training-logs.png?raw=true" alt="Training logs" width="600" height="400" loading="lazy"></p>
<p>Figure 1. Example training logs from <code>train.py</code>, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.</p>
<h3 id="heading-export-the-model-to-onnx">Export the Model to ONNX</h3>
<p>To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.</p>
<p><strong>Note:</strong> ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.</p>
<p>ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.</p>
<p>Create a file called <code>export_onnx.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

<span class="hljs-comment"># Dummy input: batch=1, 16 frames, 3x224x224</span>
dummy = torch.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>)

<span class="hljs-comment"># Export</span>
torch.onnx.export(
    model, dummy, <span class="hljs-string">"vit_temporal.onnx"</span>,
    input_names=[<span class="hljs-string">"video"</span>], output_names=[<span class="hljs-string">"logits"</span>],
    dynamic_axes={<span class="hljs-string">"video"</span>: {<span class="hljs-number">0</span>: <span class="hljs-string">"batch"</span>}},
    opset_version=<span class="hljs-number">13</span>
)

print(<span class="hljs-string">"Exported vit_temporal.onnx"</span>)
</code></pre>
<p>Run it with <code>python export_onnx.py</code>.</p>
<p>This generates a file <code>vit_temporal.onnx</code> in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.</p>
<p>Create a file called <code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, tempfile, cv2, torch, onnxruntime, numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

T = <span class="hljs-number">16</span>
SIZE = <span class="hljs-number">224</span>
MODEL_PATH = <span class="hljs-string">"vit_temporal.onnx"</span>

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

<span class="hljs-comment"># --- ONNX session + auto-detect names ---</span>
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
<span class="hljs-comment"># detect first input and first output names to avoid mismatches</span>
INPUT_NAME = ort_session.get_inputs()[<span class="hljs-number">0</span>].name   <span class="hljs-comment"># e.g. "input" or "video"</span>
OUTPUT_NAME = ort_session.get_outputs()[<span class="hljs-number">0</span>].name <span class="hljs-comment"># e.g. "logits" or something else</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess_clip</span>(<span class="hljs-params">frames_rgb</span>):</span>
    <span class="hljs-keyword">if</span> len(frames_rgb) == <span class="hljs-number">0</span>:
        frames_rgb = [np.zeros((SIZE, SIZE, <span class="hljs-number">3</span>), dtype=np.uint8)]
    <span class="hljs-keyword">if</span> len(frames_rgb) &lt; T:
        frames_rgb = frames_rgb + [frames_rgb[<span class="hljs-number">-1</span>]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> frames_rgb]
    clip = np.stack(clip, axis=<span class="hljs-number">0</span>)                                    <span class="hljs-comment"># (T,H,W,3)</span>
    clip = np.transpose(clip, (<span class="hljs-number">0</span>, <span class="hljs-number">3</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>)).astype(np.float32) / <span class="hljs-number">255</span> <span class="hljs-comment"># (T,3,H,W)</span>
    clip = (clip - <span class="hljs-number">0.5</span>) / <span class="hljs-number">0.5</span>
    clip = np.expand_dims(clip, <span class="hljs-number">0</span>)                                   <span class="hljs-comment"># (1,T,3,H,W)</span>
    <span class="hljs-keyword">return</span> clip

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_extract_path_from_gradio_video</span>(<span class="hljs-params">inp</span>):</span>
    <span class="hljs-keyword">if</span> isinstance(inp, str) <span class="hljs-keyword">and</span> os.path.exists(inp):
        <span class="hljs-keyword">return</span> inp
    <span class="hljs-keyword">if</span> isinstance(inp, dict):
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"video"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"path"</span>, <span class="hljs-string">"filepath"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, str) <span class="hljs-keyword">and</span> os.path.exists(v):
                <span class="hljs-keyword">return</span> v
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"data"</span>, <span class="hljs-string">"video"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>)
                tmp.write(v); tmp.flush(); tmp.close()
                <span class="hljs-keyword">return</span> tmp.name
    <span class="hljs-keyword">if</span> isinstance(inp, (list, tuple)) <span class="hljs-keyword">and</span> inp <span class="hljs-keyword">and</span> isinstance(inp[<span class="hljs-number">0</span>], str) <span class="hljs-keyword">and</span> os.path.exists(inp[<span class="hljs-number">0</span>]):
        <span class="hljs-keyword">return</span> inp[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_read_uniform_frames</span>(<span class="hljs-params">video_path</span>):</span>
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) <span class="hljs-keyword">or</span> <span class="hljs-number">1</span>
    idxs = np.linspace(<span class="hljs-number">0</span>, total - <span class="hljs-number">1</span>, max(T, <span class="hljs-number">1</span>)).astype(int)
    want = set(int(i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> idxs.tolist())
    j = <span class="hljs-number">0</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        ok, bgr = cap.read()
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">if</span> j <span class="hljs-keyword">in</span> want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += <span class="hljs-number">1</span>
    cap.release()
    <span class="hljs-keyword">return</span> frames

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_video</span>(<span class="hljs-params">gradio_video</span>):</span>
    video_path = _extract_path_from_gradio_video(gradio_video)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> video_path <span class="hljs-keyword">or</span> <span class="hljs-keyword">not</span> os.path.exists(video_path):
        <span class="hljs-keyword">return</span> {}
    frames = _read_uniform_frames(video_path)

    <span class="hljs-comment"># If OpenCV choked on the codec (common with recorded webm), re-encode once:</span>
    <span class="hljs-keyword">if</span> len(frames) == <span class="hljs-number">0</span>:
        tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*<span class="hljs-string">"mp4v"</span>)
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) <span class="hljs-keyword">or</span> <span class="hljs-number">640</span>
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) <span class="hljs-keyword">or</span> <span class="hljs-number">480</span>
        out = cv2.VideoWriter(tmp_name, fourcc, <span class="hljs-number">20.0</span>, (w, h))
        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            ok, frame = cap.read()
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    <span class="hljs-comment"># &gt;&gt;&gt; use the detected ONNX input/output names &lt;&lt;&lt;</span>
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]  <span class="hljs-comment"># (1, C)</span>
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_image</span>(<span class="hljs-params">image</span>):</span>
    <span class="hljs-keyword">if</span> image <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">return</span> {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-keyword">with</span> gr.Blocks() <span class="hljs-keyword">as</span> demo:
    gr.Markdown(<span class="hljs-string">"# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**."</span>)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Video (record or upload)"</span>):
        vid_in = gr.Video(label=<span class="hljs-string">"Record from webcam or upload a short clip"</span>)
        vid_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Video"</span>).click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Single Image (fallback)"</span>):
        img_in = gr.Image(label=<span class="hljs-string">"Upload an image frame"</span>, type=<span class="hljs-string">"numpy"</span>)
        img_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Image"</span>).click(fn=predict_from_image, inputs=img_in, outputs=img_out)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    demo.launch()
</code></pre>
<p>Running the command <code>python app.py</code> launches a Gradio application in your web browser. Here's what happens:</p>
<ol>
<li><p><strong>Webcam feed streams live</strong>: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.</p>
</li>
<li><p><strong>Predictions update continuously</strong>: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.</p>
</li>
<li><p><strong>Top 3 gesture classes displayed with probabilities</strong>: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.</p>
</li>
</ol>
<p>When you open the app in your browser, you'll find two tabs. In the <strong>Video tab</strong>, you can click <em>Record from webcam</em> to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click <strong>Classify Video</strong>. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.</p>
<p>Here’s an example where I raised my hand for a <strong>stop</strong> gesture, and the model predicts “stop” as the top class:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/realtime-demo.png?raw=true" alt="Gradio demo output" width="600" height="400" loading="lazy"></p>
<p>Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.</p>
<h3 id="heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</h3>
<p>Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:</p>
<ul>
<li><p><strong>Accuracy</strong>: does the model predict the right gesture class?</p>
</li>
<li><p><strong>Latency</strong>: how fast does it respond, especially on CPU vs GPU?</p>
</li>
</ul>
<h4 id="heading-1-quick-accuracy-check">1. Quick Accuracy Check</h4>
<p>Save this as <code>eval.py</code> in the same folder as your other scripts:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load validation data</span>
val_ds = GestureClips(train=<span class="hljs-literal">False</span>)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

correct, total = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
all_preds, all_labels = [], []

<span class="hljs-keyword">with</span> torch.no_grad():
    <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
        logits = model(x)
        preds = logits.argmax(dim=<span class="hljs-number">1</span>)
        correct += (preds == y).sum().item()
        total += y.size(<span class="hljs-number">0</span>)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(<span class="hljs-string">f"Validation accuracy: <span class="hljs-subst">{correct/total:<span class="hljs-number">.2</span>%}</span>"</span>)
</code></pre>
<h4 id="heading-2-confusion-matrix">2. Confusion Matrix</h4>
<p>Let’s also visualize which gestures are confused. Add this snippet at the bottom of <code>eval.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">6</span>))
sns.heatmap(cm, annot=<span class="hljs-literal">True</span>, fmt=<span class="hljs-string">"d"</span>, xticklabels=labels, yticklabels=labels, cmap=<span class="hljs-string">"Blues"</span>)
plt.xlabel(<span class="hljs-string">"Predicted"</span>)
plt.ylabel(<span class="hljs-string">"True"</span>)
plt.title(<span class="hljs-string">"Confusion Matrix"</span>)
plt.tight_layout()
plt.show()
</code></pre>
<p>When you run <code>python eval.py</code>, a heatmap like this will pop up:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/confusion-matrix.png?raw=true" alt="Confusion matrix" width="600" height="400" loading="lazy"></p>
<p>Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.</p>
<h4 id="heading-3-latency-benchmark">3. Latency Benchmark</h4>
<p>Finally, let’s see how fast inference runs. Save the following as <code>benchmark.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time, numpy <span class="hljs-keyword">as</span> np, onnxruntime
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

ort = onnxruntime.InferenceSession(<span class="hljs-string">"vit_temporal.onnx"</span>, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
INPUT_NAME = ort.get_inputs()[<span class="hljs-number">0</span>].name
OUTPUT_NAME = ort.get_outputs()[<span class="hljs-number">0</span>].name

dummy = np.random.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>).astype(np.float32)

<span class="hljs-comment"># Warmup</span>
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">3</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

<span class="hljs-comment"># Benchmark</span>
t0 = time.time()
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">50</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(<span class="hljs-string">f"Average latency: <span class="hljs-subst">{(t1 - t0)/<span class="hljs-number">50</span>:<span class="hljs-number">.3</span>f}</span> seconds per clip"</span>)
</code></pre>
<p>Run: <code>python benchmark.py</code></p>
<p>On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.</p>
<p><strong>Note</strong>: If latency is high, you can enable <strong>quantization</strong> in ONNX to shrink the model and speed up inference.</p>
<h2 id="heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</h2>
<p>If you’d prefer to see your model trained on <em>real</em> gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few <code>.mp4</code> samples are enough to follow along.</p>
<h3 id="heading-recommended-sources">Recommended sources</h3>
<ul>
<li><p><strong>20BN Jester Dataset</strong>: Contains short clips of hand gestures like swiping, clapping, and pointing.</p>
</li>
<li><p><strong>WLASL</strong>: A large-scale dataset of isolated sign language words.</p>
</li>
</ul>
<p>Both projects provide small <code>.mp4</code> videos you can use as realistic training examples. I’ve linked them below.</p>
<h3 id="heading-setting-up-your-dataset-folder">Setting up your dataset folder</h3>
<p>Once you download a few clips, place them in the <code>data/</code> folder under subfolders named after each gesture class. For example:</p>
<pre><code class="lang-plaintext">data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4
</code></pre>
<p>And update <code>labels.txt</code> to match the folder names:</p>
<pre><code class="lang-plaintext">swipe_left
swipe_right
stop
</code></pre>
<p>Now your dataset is ready, and the same training scripts from earlier (<code>train.py</code>, <code>eval.py</code>) will work without modification.</p>
<h3 id="heading-why-choose-this-option">Why choose this option?</h3>
<ul>
<li><p>Gives more realistic results than synthetic coloured boxes</p>
</li>
<li><p>Lets you see how the model handles <em>actual human hand movements</em></p>
</li>
<li><p>It just requires a bit more effort (downloading clips, trimming them if needed)</p>
</li>
</ul>
<p><strong>Tip:</strong> If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as <code>.mp4</code> files and organize them in the same folder structure.</p>
<h2 id="heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</h2>
<p>While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the <strong>human context</strong>:</p>
<ul>
<li><p><strong>Accessibility first</strong>: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.</p>
</li>
<li><p><strong>Dataset sensitivity</strong>: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.</p>
</li>
<li><p><strong>Error tolerance</strong>: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing <em>stop</em> with <em>go</em>). Always plan for fallback options (like manual input or confirmation).</p>
</li>
<li><p><strong>Bias and inclusivity</strong>: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.</p>
</li>
</ul>
<p>In other words: this demo is a <strong>teaching scaffold</strong>, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.</p>
<h2 id="heading-next-steps">Next Steps</h2>
<p>If you’d like to push this project further, here are some directions to explore:</p>
<ul>
<li><p><strong>Better models</strong>: Try video-focused Transformers like <a target="_blank" href="https://arxiv.org/abs/2102.05095">TimeSformer</a> or <a target="_blank" href="https://arxiv.org/abs/2203.12602">VideoMAE</a> for stronger temporal reasoning.</p>
</li>
<li><p><strong>Larger vocabularies</strong>: Add more gesture classes, build your own dataset, or use portions of public datasets like <a target="_blank" href="https://www.kaggle.com/datasets/toxicmender/20bn-jester">20BN Jester</a> or <a target="_blank" href="https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed">WLASL.</a></p>
</li>
<li><p><strong>Pose fusion</strong>: Combine gesture video with human pose keypoints from <a target="_blank" href="https://mediapipe.readthedocs.io/en/latest/solutions/hands.html">MediaPipe</a> or <a target="_blank" href="https://github.com/CMU-Perceptual-Computing-Lab/openpose">OpenPose</a> for more robust predictions.</p>
</li>
<li><p><strong>Real-time smoothing</strong>: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.</p>
</li>
<li><p><strong>Quantization + edge devices</strong>: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.</p>
<p>This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.</p>
<p>Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.</p>
<p>Here’s the GitHub repo for full source code: <a target="_blank" href="https://github.com/tayo4christ/transformer-gesture">transformer-gesture</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Machine Learning vs Deep Learning vs Generative AI - What are the Differences? ]]>
                </title>
                <description>
                    <![CDATA[ When I started using LLMs for work and personal use, I picked up on some technical terms, such as "machine learning" and "deep learning," which are the main technologies behind these LLMs. I've always been interested in learning about the differences... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/machine-learning-vs-deep-learning-vs-generative-ai/</link>
                <guid isPermaLink="false">68de98a534a379d15102109e</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ generative ai ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nitheesh Poojary ]]>
                </dc:creator>
                <pubDate>Thu, 02 Oct 2025 15:22:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759006391065/3cd87534-e2e9-49df-a9c7-1b636e491032.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I started using LLMs for work and personal use, I picked up on some technical terms, such as "machine learning" and "deep learning," which are the main technologies behind these LLMs. I've always been interested in learning about the differences between these technologies. Most companies in the industry are now developing their own AI tools, which makes MLOps necessary for managing and utilizing them.</p>
<p>Before I began learning about MLOps, I tried to understand the technologies behind LLMs and how they work. In this article, I’ll share my understanding of machine learning, deep learning, and generative AI, along with their potential applications.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-artificial-intelligence-ai">Artificial Intelligence (AI)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-machine-learning-ml-the-foundation">Machine Learning (ML): The Foundation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-deep-learning-adding-complexity">Deep Learning: Adding Complexity</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-generative-ai-write-new">Generative AI: Write New</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-summary-of-differences-between-machine-learning-vs-deep-learning-vs-generative-ai">Summary of Differences Between Machine Learning vs Deep Learning vs Generative AI</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759006565108/9698f88c-7d81-40b6-b902-c3d75b054728.jpeg" alt="how AI works" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-artificial-intelligence-ai">Artificial Intelligence (AI)</h2>
<p>Artificial Intelligence (AI) is a form of technology that lets machines solve problems in a way that is identical to how people do it. It helps businesses make better decisions on a large scale by helping them recognize images, create content, and make predictions based on data. Artificial intelligence includes machine learning, deep learning, and generative AI.</p>
<h2 id="heading-machine-learning-ml-the-foundation">Machine Learning (ML): The Foundation</h2>
<p>When we give computers many examples, they learn how to make their own decisions or guesses. It's like teaching a kid to tell the difference between animals. You show them a lot of pictures of cats and dogs and say things like "This is a cat" and "This is a dog." In the end, they learn to tell the difference between cats and dogs on their own. Machine learning is similar in that you give a computer a lot of data with examples, and it learns how to make predictions about new data.</p>
<h3 id="heading-how-does-machine-learning-work">How Does Machine Learning Work?</h3>
<p>Machine Learning (ML) is the process of teaching computers to find patterns in data and make decisions or predictions without being instructed what to do. There are usually six main steps in this process:</p>
<p><strong>Data Collection:</strong> Get many examples, like thousands of emails, photos, or sales records. The more training data you have, the more accurate your predictions will be.</p>
<p><strong>Data Preparation</strong>: At this stage, you clean the data by getting rid of mistakes and adding missing labels.</p>
<p><strong>Selecting Algorithm (Models):</strong> It's like choosing the right tools for the job. Models can find patterns in data or make predictions. You can find machine learning models for your data <a target="_blank" href="https://www.ibm.com/think/topics/machine-learning-algorithms">here</a>.</p>
<p><strong>Training Phase:</strong> After you pick the right model for your cleaned-up data, you teach it. This is like getting ready for a test.</p>
<p><strong>Evaluation</strong>: Use the test data to assess the model's performance and see if it can make accurate predictions on unseen data.</p>
<p><strong>Deployment</strong>: Put the trained model to work in the real world.</p>
<p><strong>Training Phase</strong>: Teach the computer with 10,000 house sales with details like size (2,000 sq ft), number of bedrooms (3), and location (downtown). Cost: $300,000.</p>
<p><strong>Learning</strong>: The algorithm finds patterns, such as the fact that bigger houses cost more and places in the city center cost more. More bedrooms make a house worth more.</p>
<p><strong>Prediction</strong>: Think about a new house with 1,800 square feet, two bedrooms, and a location in the suburbs. It guesses a figure based on what it has learned.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759006771594/12afae06-9d72-4d65-af81-c10fda1e2099.png" alt="how machine learning works" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-types-of-machine-learning">Types of Machine Learning</h3>
<ol>
<li><p><strong>Supervised Learning</strong>: Give algorithms labeled and defined training data to look for patterns. The sample data tells the algorithm what to do and what to expect as an output. For instance, millions of X-ray reports that say someone is healthy or sick would need to be tagged. Then, machine learning programs could use this training data to guess if a new X-ray shows signs of illness.</p>
</li>
<li><p><strong>Unsupervised Learning</strong>: Algorithms that use unsupervised learning learn from data that doesn't have labels. The algorithm must find patterns in untagged data without outside help. For instance, finding groups of people on Facebook or Twitter who have similar interests.</p>
</li>
<li><p><strong>Reinforcement Learning</strong>: This technique is a kind of machine learning in which an agent learns how to make choices by interacting with the world around it. The agent receives points for doing things right and loses points for doing things wrong. Its goal is to get as many points as possible. For instance, cars learn how to drive safely by making mistakes in simulations. They get rewards for staying in their lane, following traffic rules, and not hitting other cars.</p>
</li>
</ol>
<h3 id="heading-machine-learningreal-world-examples">Machine Learning—Real-World Examples</h3>
<p><strong>Email Spam Detection</strong></p>
<p>You can show the computer thousands of emails that say "spam" or "not spam." It learns patterns, like how emails with "FREE MONEY" are usually spam. It can now automatically sort your inbox.</p>
<p><strong>Photo Recognition</strong></p>
<p>Give the computer millions of pictures with labels that say what's in them. It learns that apples are likely to be round and have stems. Your phone can now tell what things are in your pictures.</p>
<p><strong>Movie Recommendations</strong></p>
<p>Netflix keeps track of the movies you've seen and rated. It finds people who like the same things you do. It suggests movies that other people like.</p>
<h2 id="heading-deep-learning-adding-complexity">Deep Learning: Adding Complexity</h2>
<p>Deep learning is a type of artificial intelligence. It helps computers understand data like humans do. Deep learning can identify complex images, text, sound, and other data patterns to make accurate predictions. It uses artificial neural networks that work like the human brain. Neural networks are connected nodes that handle information.</p>
<h3 id="heading-how-does-deep-learning-work">How Does Deep Learning Work?</h3>
<p>Artificial neural networks are used in deep learning to learn from data. These networks consist of interconnected layers of nodes. Each node learns a different thing about the data.</p>
<p>For instance, when you show a computer a picture of a cat, the picture goes through a lot of steps. The first layer looks for shapes and edges. The second layer puts these shapes together to make ears, eyes, and whiskers. The last layers say things like "This picture looks like a cat." Deep learning can make a lot of mistakes when learning, but it gets better and better after each piece of feedback.</p>
<h3 id="heading-deep-learningreal-world-examples">Deep Learning—Real-World Examples</h3>
<ul>
<li><p><strong>Tesla Autopilot</strong>: Processes eight cameras simultaneously to navigate roads, recognize traffic signs, and avoid obstacles.</p>
</li>
<li><p><strong>Google's DeepMind</strong>: Detects over fifty eye diseases from retinal scans with 94% accuracy.</p>
</li>
<li><p><strong>ChatGPT</strong>: Helps with writing, coding, and problem-solving.</p>
</li>
</ul>
<h2 id="heading-generative-ai-write-new">Generative AI: Write New</h2>
<p>Generative AI is a subset of deep learning that makes new things, like stories, pictures, music, or code, instead of just looking at or sorting through things that are already there. Generative AI systems learn patterns from a lot of training data and then use those patterns to make new content.</p>
<h3 id="heading-real-world-examples">Real-World Examples</h3>
<ul>
<li><p>Chatbots help institutions give better customer service by making product suggestions and answering questions.</p>
</li>
<li><p>Automatically generate technical documents from the source code.</p>
</li>
<li><p>Auto-generate quizzes, practice problems, and explanations</p>
</li>
</ul>
<h2 id="heading-summary-of-differences-between-machine-learning-vs-deep-learning-vs-generative-ai">Summary of Differences Between Machine Learning vs Deep Learning vs Generative AI</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Feature</strong></td><td><strong>Machine Learning (ML)</strong></td><td><strong>Deep Learning (DL)</strong></td><td><strong>Generative AI (GenAI)</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Definition</strong></td><td>Subset of AI where machines learn from data to make predictions or decisions.</td><td>Subset of AI using artificial neural networks with multiple layers to model complex patterns</td><td>Subset of Deep learning that can create new content (text, images, code, etc.) similar to human-created content</td></tr>
<tr>
<td><strong>Data Requirements</strong></td><td>Small-to-medium datasets.</td><td>Large amounts of data (structured and unstructured)</td><td>Massive datasets for training, varying amounts for generation</td></tr>
<tr>
<td><strong>Computational Power</strong></td><td>Works on CPUs, moderate hardware.</td><td>Needs GPUs/TPUs for training.</td><td>Requires large-scale GPU/TPU clusters.</td></tr>
<tr>
<td><strong>Use Cases</strong></td><td>Predictions and classification.</td><td>Recognize complex data like speech, images, and language.</td><td>Generate new, original content.</td></tr>
<tr>
<td><strong>When NOT to Use</strong></td><td>Data is very complex/unstructured; accuracy is critical (medical, legal) ,Need to handle images/audio/video</td><td>The dataset is small (&lt;1000 samples), and computational resources are limited.</td><td>Copyright/IP restriction</td></tr>
<tr>
<td><strong>Cost Comparison</strong></td><td>Low ($1K-$10K) (Standard serve)</td><td>Medium ($10K-$100K)</td><td>High ($100K-$1M+)</td></tr>
<tr>
<td><strong>Real-World Examples</strong></td><td>Netflix recommendations, fraud detection, spam filters.</td><td>Face recognition, self-driving cars, Siri/Alexa.</td><td>Original creative outputs (text, images, code, video).</td></tr>
</tbody>
</table>
</div><h2 id="heading-conclusion">Conclusion</h2>
<p>To sum it up, anyone who is keen to learn more about artificial intelligence needs to know the differences between machine learning, deep learning, and generative AI.</p>
<p>Machine learning is the basis for this because it lets computers learn from data and make predictions. Deep learning takes this a step further by using neural networks to process complicated data patterns in a way that is similar to how humans understand things.</p>
<p>Generative AI goes a step further by making new things, which shows how creative AI can be. As these technologies get better, they open up a lot of new opportunities in many fields, such as improving customer service, making medical diagnoses more accurate, and making new content. To maximize AI's benefits in your life, stay current on new developments.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Machine Learning System on Serverless Architecture ]]>
                </title>
                <description>
                    <![CDATA[ Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks. But a model isn’t truly valuable until it’s in production, serving real users and solving real problems. In this article, you’ll learn how to ship a pro... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-machine-learning-system-on-serverless-architecture/</link>
                <guid isPermaLink="false">68addf802314e8b22eae4655</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ coding ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Tue, 26 Aug 2025 16:23:28 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756225357023/04572f1b-b9a7-43e0-aabc-2842faa2703f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks.</p>
<p>But a model isn’t truly valuable until it’s in production, serving real users and solving real problems.</p>
<p>In this article, you’ll learn how to ship a production-ready ML application built on serverless architecture.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-were-building">What We’re Building</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-ai-pricing-for-retailers">AI Pricing for Retailers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-models">The Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tuning-and-training">Tuning and Training</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-prediction">The Prediction</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-performance-validation">Performance Validation</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-system-architecture">The System Architecture</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-core-aws-resources-in-the-architecture">Core AWS Resources in the Architecture</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-deployment-workflow-in-action">The Deployment Workflow in Action</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-step-1-draft-python-scripts">Step 1: Draft Python Scripts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-configure-featuremodel-stores-in-s3">Step 2: Configure Feature/Model Stores in S3</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-create-a-flask-application-with-api-endpoints">Step 3: Create a Flask Application with API Endpoints</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-publish-a-docker-image-to-ecr">Step 4: Publish a Docker Image to ECR</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-create-a-lambda-function">Step 5: Create a Lambda Function</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-configure-aws-resources">Step 6: Configure AWS Resources</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-building-a-client-application-optional">Building a Client Application (Optional)</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-the-react-application">The React Application</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-final-results">Final Results</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>This project requires some basic experience with:</p>
<ul>
<li><p><strong>Machine Learning / Deep Learning:</strong> The full lifecycle, including data handling, model training, tuning, and validation.</p>
</li>
<li><p><strong>Coding:</strong> Proficiency in Python, with experience using major ML libraries such as PyTorch and Scikit-Learn.</p>
</li>
<li><p><strong>Full-stack deployment:</strong> Experience deploying applications using RESTful APIs.</p>
</li>
</ul>
<h2 id="heading-what-were-building">What We’re Building</h2>
<h3 id="heading-ai-pricing-for-retailers">AI Pricing for Retailers</h3>
<p>This project aims to help a middle-sized retailer compete with large players like Amazon.</p>
<p>Smaller companies often can’t afford significant price discounts, so they can face challenges finding optimal price points as they expand their product lines.</p>
<p>Our goal is to leverage AI models to recommend the best price for a selected product to maximize sales for the retailer, and display it on a client-side user interface (UI):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755873936847/ecf696ef-e161-4453-a6ad-e97d92ac1677.png" alt="What the UI will look like" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You can explore the UI from <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<h3 id="heading-the-models">The Models</h3>
<p>I’ll train and tune multiple models so that when the primary model fails, a backup model gets loaded to serve predictions.</p>
<ul>
<li><p><strong>Primary Model</strong>: Multi-layered feedforward network (on the <strong>PyTorch</strong> library)</p>
</li>
<li><p><strong>Backup Models (Backups)</strong>: LightGBM, SVR, and Elastic Net (on the <strong>Scikit-Learn</strong> library)</p>
</li>
</ul>
<p>The backup models are prioritized based on learning capabilities.</p>
<h3 id="heading-tuning-and-training">Tuning and Training</h3>
<p>The primary model was trained on a dataset of around 500,000 samples (<a target="_blank" href="https://archive.ics.uci.edu/dataset/352/online+retail">source)</a> and fine-tuned using <code>Optuna</code>'s Bayesian Optimization, with grid search available for further refinement.</p>
<p>The backups are also trained on the same samples and tuned using the <code>Scikit-Optimize</code> framework.</p>
<h3 id="heading-the-prediction">The Prediction</h3>
<p>All models serve predictions on <strong>logged quantity values.</strong></p>
<p>Logarithmic transformations of the quantity data make the distribution denser, which helps models learn patterns more effectively. This is because logarithms reduce the impact of extreme values, or outliers, and can help normalize skewed data.</p>
<h3 id="heading-performance-validation">Performance Validation</h3>
<p>We’ll evaluate model performance using different metrics for the transformed and original data, with a lower value always indicating better performance.</p>
<ul>
<li><p><strong>Logged values</strong>: Mean Squared Error (MSE)</p>
</li>
<li><p><strong>Actual values</strong>: Root Mean Squared Log Error (RMSLE) and Mean Absolute Error (MAE)</p>
</li>
</ul>
<h2 id="heading-the-system-architecture">The System Architecture</h2>
<p>We’re going to build a complete ecosystem around an <strong>AWS Lambda function</strong> to create a scalable ML system:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:4680/0*ulcNtwJeU5EOfhTg.png" alt="Fig. The system architecture (Created by Kuriko IWAI)" width="600" height="400" loading="lazy"></p>
<p>Fig. The system architecture (Created by <a target="_blank" href="https://kuriko-iwai.vercel.app/">Kuriko IWAI)</a></p>
<p><strong>AWS Lambda</strong> is a <strong>serverless production</strong> where a service provider can run the application without managing servers. Once they upload the code, AWS takes on the responsibility of managing the underlying infrastructure.</p>
<p>In the serverless production, the code is deployed as <strong>a stateless function</strong> that runs only when it’s triggered by an event like HTTP requests or scheduled tasks.</p>
<p>This event-driven nature makes serverless production extremely efficient in resource allocation because:</p>
<ul>
<li><p><strong>There’s no server management</strong>: The cloud provider takes care of operational tasks.</p>
</li>
<li><p><strong>You have automatic scaling</strong>: Serverless applications automatically scale up or down based on demand.</p>
</li>
<li><p><strong>You have pay-per-use billing</strong>: Charged for the exact amount of compute resources the application consumes.</p>
</li>
</ul>
<p>Note that other cloud ecosystems like Google Cloud Platform (GCP) and Microsoft Azure offer comprehensive alternatives to AWS. Which one you choose depends on your budget, project type, and familiarity with each ecosystem.</p>
<h3 id="heading-core-aws-resources-in-the-architecture">Core AWS Resources in the Architecture</h3>
<p>The system architecture focuses on the following points:</p>
<ul>
<li><p>The application is fully containerized on Docker for universal accessibility.</p>
</li>
<li><p>The container image is stored in AWS Elastic Container Registry (ECR).</p>
</li>
<li><p>The API Gateway’s REST API endpoints trigger an event to invoke the Lambda function.</p>
</li>
<li><p>The Lambda function loads the container image from ECR and perform inference.</p>
</li>
<li><p>Trained models, processors, and input features are stored in AWS S3 buckets.</p>
</li>
<li><p>A Redis client serves cached analytical data and past predictions stored in the ElastiCache.</p>
</li>
</ul>
<p>And to build the system, we’ll use the following AWS resources:</p>
<ul>
<li><p><strong>Lamda</strong>: Serves a function to perform inference.</p>
</li>
<li><p><strong>API Gateway:</strong> Routes API calls to the Lambda function.</p>
</li>
<li><p><strong>S3 Storage</strong>: Serves feature store and model store.</p>
</li>
<li><p><strong>ElastiCache:</strong> Store cached predictions and analytical data.</p>
</li>
<li><p><strong>ECR</strong>: Stores Docker container images to allow Lambda to pull the image.</p>
</li>
</ul>
<p>Each resource requires configuration. I’ll explore those details in the next section.</p>
<h2 id="heading-the-deployment-workflow-in-action"><strong>The Deployment Workflow in Action</strong></h2>
<p>The deployment workflow involves the following steps:</p>
<ol>
<li><p>Draft data preparation, model training, and serialization scripts</p>
</li>
<li><p>Configure designated feature store and model store in S3</p>
</li>
<li><p>Create a Flask application with API endpoints</p>
</li>
<li><p>Publish a Docker image to ECR</p>
</li>
<li><p>Create a Lambda function</p>
</li>
<li><p>Configure related AWS resources</p>
</li>
</ol>
<p>We’ll now walk through each of these steps to help you fully understand the process.</p>
<p>For your reference, here is the repository structure:</p>
<pre><code class="lang-markdown">.
.venv/                  [.gitignore]    # stores uv venv
│
└── data/               [.gitignore]
│     └──raw/                           # stores raw data
│     └──preprocessed/                  # stores processed data after imputation and engineering
│
└── models/             [.gitignore]    # stores serialized model after training and tuning
│     └──dfn/                           # deep feedforward network
│     └──gbm/                           # light gbm
│     └──en/                            # elastic net
│     └──production/                    # models to be stored in S3 for production use
|
└── notebooks/                          # stores experimentation notebooks
│
└── src/                                # core functions
│     └──<span class="hljs-emphasis">_utils/                        # utility functions
│     └──data_</span>handling/                 # functions to engineer features
│     └──model/                         # functions to train, tune, validate models
│     │     └── sklearn<span class="hljs-emphasis">_model
│     │     └── torch_</span>model
│     │     └── ...
│     └──main.py                        # main script to run the inference locally
│
└──app.py                               # Flask application (API endpoints)
└──pyproject.toml                       # project configuration
└──.env                [.gitignore]     # environment variables
└──uv.lock                              # dependency locking
└──Dockerfile                           # for Docker container image
└──.dockerignore
└──requirements.txt
└──.python-version                      # python version locking (3.12)
</code></pre>
<h3 id="heading-step-1-draft-python-scripts">Step 1: Draft Python Scripts</h3>
<p>The first step is to draft Python scripts for data preparation, model training and tuning.</p>
<p>We’ll run these scripts in a <strong>batch process</strong> because these are resource-intensive and stateful tasks that aren’t suitable for serverless functions optimized for short-lived, stateless, and event-driven tasks.</p>
<p>Serverless functions also can experience <a target="_blank" href="https://www.freecodecamp.org/news/cold-start-problem-in-recommender-systems/"><strong>cold starts</strong></a>. With heavy tasks in the function, the API gateway would timeout before serving predictions.</p>
<p><code>src/main.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> warnings
<span class="hljs-keyword">import</span> pickle
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> lightgbm <span class="hljs-keyword">as</span> lgb
<span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> ElasticNet
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVR
<span class="hljs-keyword">from</span> skopt.space <span class="hljs-keyword">import</span> Real, Integer, Categorical
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">import</span> src.data_handling <span class="hljs-keyword">as</span> data_handling
<span class="hljs-keyword">import</span> src.model.torch_model <span class="hljs-keyword">as</span> t
<span class="hljs-keyword">import</span> src.model.sklearn_model <span class="hljs-keyword">as</span> sk


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>: 
    load_dotenv(override=<span class="hljs-literal">True</span>)
    os.makedirs(PRODUCTION_MODEL_FOLDER_PATH, exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># create train, validation, test datasets</span>
    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = data_handling.main_script()

    <span class="hljs-comment"># store the trained preprocessor in local storage</span>
    joblib.dump(preprocessor, PREPROCESSOR_PATH)

    <span class="hljs-comment"># model tuning and training</span>
    best_dfn_full_trained, checkpoint = t.main_script(X_train, X_val, y_train, y_val)

    <span class="hljs-comment"># serialize the trained model</span>
    torch.save(checkpoint, DFN_FILE_PATH)

    <span class="hljs-comment"># svr</span>
    best_svr_trained, best_hparams_svr = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">1</span>]
    )
    <span class="hljs-keyword">if</span> best_svr_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(SVR_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({ <span class="hljs-string">'best_model'</span>: best_svr_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_svr }, f)

    <span class="hljs-comment"># elastic net</span>
    best_en_trained, best_hparams_en = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">0</span>]
    )
    <span class="hljs-keyword">if</span> best_en_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(EN_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({ <span class="hljs-string">'best_model'</span>: best_en_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_en }, f)

    <span class="hljs-comment"># light gbm</span>
    best_gbm_trained, best_hparams_gbm = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">2</span>]
    )

    <span class="hljs-keyword">if</span> best_gbm_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(GBM_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({<span class="hljs-string">'best_model'</span>: best_gbm_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_gbm }, f)
</code></pre>
<p>Run the script to train and serialize the models using the <code>uv</code> package management:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> venv
<span class="hljs-variable">$source</span> .venv/bin/activate
<span class="hljs-variable">$uv</span> run src/main.py
</code></pre>
<p>The <code>main.py</code> script includes several key components.</p>
<h4 id="heading-scripts-for-data-handling">Scripts for Data Handling</h4>
<p>These scripts involve loading original data, structure missing values, and engineer features necessary for the future prediction.</p>
<p><code>src/data_handling/main.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-comment"># load and save the original data frame in parquet</span>
df = scripts.load_original_dataframe()
df.to_parquet(ORIGINAL_DF_PATH, index=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># imputation</span>
df = scripts.structure_missing_values(df=df)

<span class="hljs-comment"># feature engineering</span>
df = scripts.handle_feature_engineering(df=df)

<span class="hljs-comment"># save processed df in csv and parquet</span>
scripts.save_df_to_csv(df=df)
df.to_parquet(PROCESSED_DF_PATH, index=<span class="hljs-literal">False</span>)


<span class="hljs-comment"># for preprocessing, classify numerical and categorical columns</span>
num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
<span class="hljs-keyword">if</span> cat_cols:
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cat_cols: df[col] = df[col].astype(<span class="hljs-string">'string'</span>)

<span class="hljs-comment"># creates training, validation, and test datasets (test dataset is for inference only)</span>
y = df[target_col]
X = df.copy().drop(target_col, axis=<span class="hljs-string">'columns'</span>)
test_size, random_state = <span class="hljs-number">50000</span>, <span class="hljs-number">42</span>
X_tv, X_test, y_tv, y_test = train_test_split(
    X, y, test_size=test_size, random_state=random_state
)
X_train, X_val, y_train, y_val = train_test_split(
    X_tv, y_tv, test_size=test_size, random_state=random_state
)

<span class="hljs-comment"># transform the input datasets</span>
X_train, X_val, X_test, preprocessor = scripts.transform_input(
    X_train, X_val, X_test, num_cols=num_cols, cat_cols=cat_cols
)

<span class="hljs-comment"># retrain and serialize the preprocessor</span>
<span class="hljs-keyword">if</span> preprocessor <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>: preprocessor.fit(X)
joblib.dump(preprocessor, PREPROCESSOR_PATH)
</code></pre>
<h4 id="heading-scripts-for-model-training-and-tuning-pytorch-model">Scripts for Model Training and Tuning (PyTorch Model)</h4>
<p>The scripts involve initiating the model, searching optimal neural architecture and hyperparameters, and serializing the fully-trained model so that the system can load the trained model when performing inference.</p>
<p>Because the primary model is built on PyTorch and the backups use Scikit-Learn, we’re drafting the scripts separately.</p>
<h4 id="heading-1-pytorch-models">1. PyTorch Models</h4>
<p><strong>The training script</strong> contains training the model with the validation over a subset of training data.</p>
<p>It contains the early stopping logic when the loss history is not improved for a given consecutive epochs (that is, 10 epochs).</p>
<p><code>src/model/torch_model/scripts/training.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">import</span> optuna <span class="hljs-comment"># type: ignore</span>
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># device</span>
device_type = device_type <span class="hljs-keyword">if</span> device_type <span class="hljs-keyword">else</span> <span class="hljs-string">'cuda'</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'mps'</span> <span class="hljs-keyword">if</span> torch.backends.mps.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'cpu'</span>
device = torch.device(device_type)

<span class="hljs-comment"># gradient scaler for stability (only applicable for cuba)</span>
scaler = torch.GradScaler(device=device_type) <span class="hljs-keyword">if</span> device_type == <span class="hljs-string">'cuba'</span> <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># start training</span>
best_val_loss = float(<span class="hljs-string">'inf'</span>)
epochs_no_improve = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(num_epochs):
    model.train()
    <span class="hljs-keyword">for</span> batch_X, batch_y <span class="hljs-keyword">in</span> train_data_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()

        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># pytorch's AMP system automatically handles the casting of tensors to Float16 or Float32</span>
            <span class="hljs-keyword">with</span> torch.autocast(device_type=device_type):
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)

                <span class="hljs-comment"># break the training loop when models return nan or inf</span>
                <span class="hljs-keyword">if</span> torch.any(torch.isnan(outputs)) <span class="hljs-keyword">or</span> torch.any(torch.isinf(outputs)):
                    main_logger.error(
                        <span class="hljs-string">'pytorch model returns nan or inf. break the training loop.'</span>
                    )
                    <span class="hljs-keyword">break</span>

            <span class="hljs-comment"># create scaled gradients of losses</span>
            <span class="hljs-keyword">if</span> scaler <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)  <span class="hljs-comment"># cliping grad</span>
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=<span class="hljs-number">1.0</span>)
                scaler.step(optimizer)  <span class="hljs-comment"># unscales the gradients</span>
                scaler.update()  <span class="hljs-comment"># updates the scale</span>

            <span class="hljs-keyword">else</span>:
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=<span class="hljs-number">1.0</span>) <span class="hljs-comment"># cliping grad</span>
                optimizer.step()

        <span class="hljs-keyword">except</span>:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()


    <span class="hljs-comment"># run validation on a subset of the training dataset</span>
    model.eval()
    val_loss = <span class="hljs-number">0.0</span>

    <span class="hljs-comment"># switch the torch mode</span>
    <span class="hljs-keyword">with</span> torch.inference_mode():
        <span class="hljs-keyword">for</span> batch_X_val, batch_y_val <span class="hljs-keyword">in</span> val_data_loader:
            batch_X_val, batch_y_val = batch_X_val.to(device), batch_y_val.to(device)
            outputs_val = model(batch_X_val)
            val_loss += criterion(outputs_val, batch_y_val).item()

    val_loss /= len(val_data_loader)

    <span class="hljs-comment"># check if early stop</span>
    <span class="hljs-keyword">if</span> val_loss &lt; best_val_loss - min_delta:
        best_val_loss = val_loss
        epochs_no_improve = <span class="hljs-number">0</span>
    <span class="hljs-keyword">else</span>:
        epochs_no_improve += <span class="hljs-number">1</span>
        <span class="hljs-keyword">if</span> epochs_no_improve &gt;= patience:
            main_logger.info(<span class="hljs-string">f'early stopping at epoch <span class="hljs-subst">{epoch + <span class="hljs-number">1</span>}</span>'</span>)
            <span class="hljs-keyword">break</span>
</code></pre>
<p><strong>The tuning script</strong> uses the <code>study</code> component from the <code>Optuna</code> library to run the Bayesian Optimization.</p>
<p>The <code>study</code> component choose a neural architecture and hyperparameter set to test from the global search space.</p>
<p>Then, it builds, trains, and validates the model to find the optimal neural architecture that can minimize the loss (MSE, for instance).</p>
<p><code>src/model/torch_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> itertools
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> optuna
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">import</span> torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader, TensorDataset
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">from</span> src.model.torch_model.scripts.pretrained_base <span class="hljs-keyword">import</span> DFN
<span class="hljs-keyword">from</span> src.model.torch_model.scripts.training <span class="hljs-keyword">import</span> train_model
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># device</span>
device_type = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"mps"</span> <span class="hljs-keyword">if</span> torch.backends.mps.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
device = torch.device(device_type)

<span class="hljs-comment"># loss function</span>
criterion = nn.MSELoss()

<span class="hljs-comment"># define objective function for optuna</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">objective</span>(<span class="hljs-params">trial</span>):</span>
    <span class="hljs-comment"># model</span>
    num_layers = trial.suggest_int(<span class="hljs-string">'num_layers'</span>, <span class="hljs-number">1</span>, <span class="hljs-number">20</span>)
    batch_norm = trial.suggest_categorical(<span class="hljs-string">'batch_norm'</span>, [<span class="hljs-literal">True</span>, <span class="hljs-literal">False</span>])
    dropout_rates = []
    hidden_units_per_layer = []
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(num_layers):
        dropout_rates.append(trial.suggest_float(<span class="hljs-string">f'dropout_rate_layer_<span class="hljs-subst">{i}</span>'</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.6</span>))
        hidden_units_per_layer.append(trial.suggest_int(<span class="hljs-string">f'n_units_layer_<span class="hljs-subst">{i}</span>'</span>, <span class="hljs-number">8</span>, <span class="hljs-number">256</span>)) <span class="hljs-comment"># hidden units per layer</span>

    model = DFN(
        input_dim=X_train.shape[<span class="hljs-number">1</span>],
        num_layers=num_layers,
        dropout_rates=dropout_rates,
        batch_norm=batch_norm,
        hidden_units_per_layer=hidden_units_per_layer
    ).to(device)

    <span class="hljs-comment"># optimizer</span>
    learning_rate = trial.suggest_float(<span class="hljs-string">'learning_rate'</span>, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1e-1</span>, log=<span class="hljs-literal">True</span>)
    optimizer_name = trial.suggest_categorical(<span class="hljs-string">'optimizer'</span>, [<span class="hljs-string">'adam'</span>, <span class="hljs-string">'rmsprop'</span>, <span class="hljs-string">'sgd'</span>, <span class="hljs-string">'adamw'</span>, <span class="hljs-string">'adamax'</span>, <span class="hljs-string">'adadelta'</span>, <span class="hljs-string">'radam'</span>])
    optimizer = _handle_optimizer(optimizer_name=optimizer_name, model=model, lr=learning_rate)

    <span class="hljs-comment"># data loaders</span>
    batch_size = trial.suggest_categorical(<span class="hljs-string">'batch_size'</span>, [<span class="hljs-number">32</span>, <span class="hljs-number">64</span>, <span class="hljs-number">128</span>, <span class="hljs-number">256</span>])
    test_size = <span class="hljs-number">10000</span> <span class="hljs-keyword">if</span> len(X_train) &gt; <span class="hljs-number">15000</span> <span class="hljs-keyword">else</span> int(len(X_train) * <span class="hljs-number">0.2</span>)
    X_train_search, X_val_search, y_train_search, y_val_search = train_test_split(X_train, y_train, test_size=test_size, random_state=<span class="hljs-number">42</span>)
    train_data_loader = create_torch_data_loader(X=X_train_search, y=y_train_search, batch_size=batch_size)
    val_data_loader = create_torch_data_loader(X=X_val_search, y=y_val_search, batch_size=batch_size)

    <span class="hljs-comment"># training</span>
    num_epochs = <span class="hljs-number">3000</span> <span class="hljs-comment"># ensure enough epochs (early stopping would stop the loop when overfitting)</span>
    _, best_val_loss = train_model(
        train_data_loader=train_data_loader,
        val_data_loader=val_data_loader,
        model=model,
        optimizer=optimizer,
        criterion = criterion,
        num_epochs=num_epochs,
        trial=trial,
    )
    <span class="hljs-keyword">return</span> best_val_loss


<span class="hljs-comment"># start to optimize hyperparameters and architecture</span>
study = optuna.create_study(direction=<span class="hljs-string">'minimize'</span>, sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=<span class="hljs-number">50</span>, timeout=<span class="hljs-number">600</span>)

<span class="hljs-comment"># best </span>
best_trial = study.best_trial
best_hparams = best_trial.params

<span class="hljs-comment"># construct the model based on the tuning results</span>
best_lr = best_hparams[<span class="hljs-string">'learning_rate'</span>]
best_batch_size = best_hparams[<span class="hljs-string">'batch_size'</span>]
input_dim = X_train.shape[<span class="hljs-number">1</span>]
best_model = DFN(
    input_dim=input_dim,
    num_layers=best_hparams[<span class="hljs-string">'num_layers'</span>],
    hidden_units_per_layer=[v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> best_hparams.items() <span class="hljs-keyword">if</span> <span class="hljs-string">'n_units_layer_'</span> <span class="hljs-keyword">in</span> k],
    batch_norm=best_hparams[<span class="hljs-string">'batch_norm'</span>],
    dropout_rates=[v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> best_hparams.items() <span class="hljs-keyword">if</span> <span class="hljs-string">'dropout_rate_layer_'</span> <span class="hljs-keyword">in</span> k],
).to(device)

<span class="hljs-comment"># construct an optimizer based on the tuning results</span>
best_optimizer_name = best_hparams[<span class="hljs-string">'optimizer'</span>]
best_optimizer = _handle_optimizer(
    optimizer_name=best_optimizer_name, model=best_model, lr=best_lr
)

<span class="hljs-comment"># create torch data loaders</span>
train_data_loader = create_torch_data_loader(
    X=X_train, y=y_train, batch_size=best_batch_size
)
val_data_loader = create_torch_data_loader(
    X=X_val, y=y_val, batch_size=best_batch_size
)

<span class="hljs-comment"># retrain the best model with full training dataset applying the optimal batch size and optimizer</span>
best_model, _ = train_model(
    train_data_loader=train_data_loader,
    val_data_loader=val_data_loader,
    model=best_model,
    optimizer=best_optimizer,
    criterion = criterion,
    num_epochs=<span class="hljs-number">1000</span>
)

<span class="hljs-comment"># create a checkpoint for serialization (reconstruct the model using the checkpoint)</span>
checkpoint = {
    <span class="hljs-string">'state_dict'</span>: best_model.state_dict(),
    <span class="hljs-string">'hparams'</span>: best_hparams,
    <span class="hljs-string">'input_dim'</span>: X_train.shape[<span class="hljs-number">1</span>],
    <span class="hljs-string">'optimizer'</span>: best_optimizer,
    <span class="hljs-string">'batch_size'</span>: best_batch_size
}

<span class="hljs-comment"># serialize the model w/ checkpoint</span>
torch.save(checkpoint, FILE_PATH)
</code></pre>
<h4 id="heading-2-scikit-learn-models-backups">2. Scikit-Learn Models (Backups)</h4>
<p>For Scikit-Learn models, we’ll run <strong>k-fold cross validation</strong> during training to prevent overfitting.</p>
<p>K-fold cross-validation is a technique for evaluating a machine learning model's performance by training and testing it on different subsets of training data.</p>
<p>We define the <code>run_kfold_validation</code> function where the model is trained and validated using <strong>5-fold cross-validation</strong>.</p>
<p><code>src/model/sklearn_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> KFold
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_kfold_validation</span>(<span class="hljs-params">
        X_train,
        y_train,
        base_model,
        hparams: dict,
        n_splits: int = <span class="hljs-number">5</span>, <span class="hljs-comment"># the number of folds </span>
        early_stopping_rounds: int = <span class="hljs-number">10</span>,
        max_iters: int = <span class="hljs-number">200</span>
    </span>) -&gt; float:</span>

    mses = <span class="hljs-number">0.0</span>

    <span class="hljs-comment"># create k-fold component</span>
    kf = KFold(n_splits=n_splits, shuffle=<span class="hljs-literal">True</span>, random_state=<span class="hljs-number">42</span>)

    <span class="hljs-keyword">for</span> fold, (train_index, val_index) <span class="hljs-keyword">in</span> enumerate(kf.split(X_train)):
        <span class="hljs-comment"># create a subset of training and validation datasets from the entire training data</span>
        X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

        <span class="hljs-comment"># reconstruct a model</span>
        model = base_model(**hparams)

        <span class="hljs-comment"># start the cross validation</span>
        best_val_mse = float(<span class="hljs-string">'inf'</span>)
        patience_counter = <span class="hljs-number">0</span>
        best_model_state = <span class="hljs-literal">None</span>
        best_iteration = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> iteration <span class="hljs-keyword">in</span> range(max_iters):
            <span class="hljs-comment"># train on a subset of the training data</span>
            <span class="hljs-keyword">try</span>:
                model.train_one_step(X_train_fold, y_train_fold, iteration)
            <span class="hljs-keyword">except</span>:
                model.fit(X_train_fold, y_train_fold)

            <span class="hljs-comment"># make a prediction on validation data </span>
            y_pred_val_kf = model.predict(X_val_fold)

            <span class="hljs-comment"># compute validation loss (MSE)</span>
            current_val_mse = mean_squared_error(y_val_fold, y_pred_val_kf)

            <span class="hljs-comment"># check if epochs should be stopped (early stopping)</span>
           <span class="hljs-keyword">if</span> current_val_mse &lt; best_val_mse:
                best_val_mse = current_val_mse
                patience_counter = <span class="hljs-number">0</span>
                best_model_state = model.get_params()
                best_iteration = iteration
           <span class="hljs-keyword">else</span>:
                patience_counter += <span class="hljs-number">1</span>

           <span class="hljs-comment"># execute early stopping when patience_counter exceeds early_stopping_rounds</span>
           <span class="hljs-keyword">if</span> patience_counter &gt;= early_stopping_rounds:
                main_logger.info(<span class="hljs-string">f"Fold <span class="hljs-subst">{fold}</span>: Early stopping triggered at iteration <span class="hljs-subst">{iteration}</span> (best at <span class="hljs-subst">{best_iteration}</span>). Best MSE: <span class="hljs-subst">{best_val_mse:<span class="hljs-number">.4</span>f}</span>"</span>)
                <span class="hljs-keyword">break</span>


        <span class="hljs-comment"># after training epochs, reconstruct the best performing model </span>
        <span class="hljs-keyword">if</span> best_model_state: model.set_params(**best_model_state)

        <span class="hljs-comment"># make prediction</span>
        y_pred_val_kf = model.predict(X_val_fold)

        <span class="hljs-comment"># add MSEs</span>
        mses += mean_squared_error(y_pred_val_kf, y_val_fold)

    <span class="hljs-comment"># compute the final loss (avarage of MSEs across folds)</span>
    ave_mse = mses / n_splits
    <span class="hljs-keyword">return</span> ave_mse
</code></pre>
<p>Then, for the <strong>tuning script</strong>, we use the <code>gp_minimize</code> function from the <code>Scikit-Optimize</code> library.</p>
<p>The <code>gp_minimize</code> function is used to tune hyperparameters with Bayesian optimization.</p>
<p>This function intelligently searches the best hyperparameter set that can minimize the model's error, which is calculated using the <code>run_kfold_validation</code> function defined earlier.</p>
<p>The best-performing hyperparameters are then used to reconstruct and train the final model.</p>
<p><code>src/model/sklearn_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial
<span class="hljs-keyword">from</span> skopt <span class="hljs-keyword">import</span> gp_minimize


<span class="hljs-comment"># define the objective function for Bayesian Optimization using Scikit-Optimize</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">objective</span>(<span class="hljs-params">params, X_train, y_train, base_model, hparam_names</span>):</span>
    hparams = {item: params[i] <span class="hljs-keyword">for</span> i, item <span class="hljs-keyword">in</span> enumerate(hparam_names)}
    ave_mse = run_kfold_validation(X_train=X_train, y_train=y_train, base_model=base_model, hparams=hparams)
    <span class="hljs-keyword">return</span> ave_mse

<span class="hljs-comment"># create the search space</span>
hparam_names = [s.name <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> space]
objective_partial = partial(objective, X_train=X_train, y_train=y_train, base_model=base_model, hparam_names=hparam_names)

<span class="hljs-comment"># search the optimal hyperparameters</span>
results = gp_minimize(
    func=objective_partial,
    dimensions=space,
    n_calls=n_calls,
    random_state=<span class="hljs-number">42</span>,
    verbose=<span class="hljs-literal">False</span>,
    n_initial_points=<span class="hljs-number">10</span>,
)
<span class="hljs-comment"># results</span>
best_hparams = dict(zip(hparam_names, results.x))
best_mse = results.fun

<span class="hljs-comment"># reconstruct the model with the best hyperparameters</span>
best_model = base_model(**best_hparams)

<span class="hljs-comment"># retrain the model with full training dataset</span>
best_model.fit(X_train, y_train)
</code></pre>
<h3 id="heading-step-2-configure-featuremodel-stores-in-s3">Step 2: Configure Feature/Model Stores in S3</h3>
<p>The trained models and processed data are stored in the S3 bucket as a <strong>Parquet file</strong>.</p>
<p>We’ll draft the <code>s3_upload</code> function where the <strong>Boto3 client</strong>, a low-level interface to an AWS service, initiates the connection to S3:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">s3_upload</span>(<span class="hljs-params">file_path: str</span>):</span>
    <span class="hljs-comment"># initiate the boto3 client</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    S3_BUCKET_NAME = os.environ.get(<span class="hljs-string">'S3_BUCKET_NAME'</span>) <span class="hljs-comment"># the bucket created in s3</span>
    s3_client = boto3.client(<span class="hljs-string">'s3'</span>, region_name=os.environ.get(<span class="hljs-string">'AWS_REGION_NAME'</span>)) <span class="hljs-comment"># your default region</span>

    <span class="hljs-keyword">if</span> s3_client:
        <span class="hljs-comment"># create s3 key and upload the file to the bucket</span>
        s3_key = file_path <span class="hljs-keyword">if</span> file_path[<span class="hljs-number">0</span>] != <span class="hljs-string">'/'</span> <span class="hljs-keyword">else</span> file_path[<span class="hljs-number">1</span>:]
        s3_client.upload_file(file_path, S3_BUCKET_NAME, s3_key)
        main_logger.info(<span class="hljs-string">f"file uploaded to s3://<span class="hljs-subst">{S3_BUCKET_NAME}</span>/<span class="hljs-subst">{s3_key}</span>"</span>)
    <span class="hljs-keyword">else</span>:
        main_logger.error(<span class="hljs-string">'failed to create an S3 client.'</span>)
</code></pre>
<h4 id="heading-model-store">Model Store</h4>
<p>Trained PyTorch models are serialized (converted) into <code>.pth</code> files.</p>
<p>Then, these files are uploaded to the S3 bucket, enabling the system to load the trained model when it performs inference in production.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_upload

<span class="hljs-comment"># model serialization, store in local</span>
torch.save(trained_model.state_dict(), MODEL_FILE_PATH)

<span class="hljs-comment"># upload to s3 model store</span>
s3_upload(file_path=MODEL_FILE_PATH)
</code></pre>
<h4 id="heading-feature-store">Feature Store</h4>
<p>The processed data is converted into a CSV and Parquet file format.</p>
<p>Then, the Parquet files are uploaded to the S3 bucket, enabling the system to load the lightweight data when it creates prediction data to perform inference in production.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_upload

<span class="hljs-comment"># store csv and parquet files in local</span>
df.to_csv(file_path, index=<span class="hljs-literal">False</span>)
df.to_parquet(DATA_FILE_PATH, index=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># store in s3 feature store</span>
s3_upload(file_path=DATA_FILE_PATH)

<span class="hljs-comment"># trained preprocessor is also stored to transform the prediction data</span>
s3_upload(file_path=PROCESSOR_PATH)
</code></pre>
<h3 id="heading-step-3-create-a-flask-application-with-api-endpoints">Step 3: Create a Flask Application with API Endpoints</h3>
<p>Next, we’ll create a Flask application with API endpoints.</p>
<p>Flask needs to configure Python scripts in the <code>app.py</code> file located at the root of the project repository.</p>
<p>As showed in the code snippets, the <code>app.py</code> file needs to contain the components in order of:</p>
<ol>
<li><p>AWS Boto3 client setup,</p>
</li>
<li><p>Flask app configuration and API endpoint setup,</p>
</li>
<li><p>Loading the trained preprocessor, processed input data <code>X_test</code>, and trained models,</p>
</li>
<li><p>Invoke the Lambda function via API Gateway, and</p>
</li>
<li><p>The local test section.</p>
</li>
</ol>
<p>Note that <code>X_test</code> should never be used during model training to avoid data leakage.</p>
<p><code>app.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> flask <span class="hljs-keyword">import</span> Flask
<span class="hljs-keyword">from</span> flask_cors <span class="hljs-keyword">import</span> cross_origin
<span class="hljs-keyword">from</span> waitress <span class="hljs-keyword">import</span> serve
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># global variables (will be loaded from the S3 buckets)</span>
_redis_client = <span class="hljs-literal">None</span>
X_test = <span class="hljs-literal">None</span>
preprocessor = <span class="hljs-literal">None</span>
model = <span class="hljs-literal">None</span>
backup_model = <span class="hljs-literal">None</span>

<span class="hljs-comment"># load env if local else skip (lambda refers to env in production)</span>
AWS_LAMBDA_RUNTIME_API = os.environ.get(<span class="hljs-string">'AWS_LAMBDA_RUNTIME_API'</span>, <span class="hljs-literal">None</span>)
<span class="hljs-keyword">if</span> AWS_LAMBDA_RUNTIME_API <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>: load_dotenv(override=<span class="hljs-literal">True</span>)


<span class="hljs-comment">#### &lt;---- 1. AWS BOTO3 CLIENT ----&gt;</span>
<span class="hljs-comment"># boto3 client </span>
S3_BUCKET_NAME = os.environ.get(<span class="hljs-string">'S3_BUCKET_NAME'</span>, <span class="hljs-string">'ml-sales-pred'</span>)
s3_client = boto3.client(<span class="hljs-string">'s3'</span>, region_name=os.environ.get(<span class="hljs-string">'AWS_REGION_NAME'</span>, <span class="hljs-string">'us-east-1'</span>))
<span class="hljs-keyword">try</span>:
    <span class="hljs-comment"># test connection to boto3 client</span>
    sts_client = boto3.client(<span class="hljs-string">'sts'</span>)
    identity = sts_client.get_caller_identity()
    main_logger.info(<span class="hljs-string">f"Lambda is using role: <span class="hljs-subst">{identity[<span class="hljs-string">'Arn'</span>]}</span>"</span>)
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
    main_logger.error(<span class="hljs-string">f"Lambda credentials/permissions error: <span class="hljs-subst">{e}</span>"</span>)

<span class="hljs-comment">#### &lt;---- 2. FLASK CONFIGURATION &amp; API ENDPOINTS ----&gt;</span>
<span class="hljs-comment"># configure the flask app</span>
app = Flask(__name__)
app.config[<span class="hljs-string">'CORS_HEADERS'</span>] = <span class="hljs-string">'Content-Type'</span>

<span class="hljs-comment"># add a simple API endpoint to serve the prediction by price point to test</span>
<span class="hljs-meta">@app.route('/v1/predict-price/&lt;string:stockcode&gt;', methods=['GET', 'OPTIONS'])</span>
<span class="hljs-meta">@cross_origin(origins=origins, methods=['GET', 'OPTIONS'], supports_credentials=True)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_price</span>(<span class="hljs-params">stockcode</span>):</span>
    df_stockcode = <span class="hljs-literal">None</span>

    <span class="hljs-comment"># fetch request params</span>
    data = request.args.to_dict()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># fetch cache</span>
        <span class="hljs-keyword">if</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
            <span class="hljs-comment"># returns cached prediction results if any without performing inference</span>
            cached_prediction_result = _redis_client.get(cache_key_prediction_result_by_stockcode)
            <span class="hljs-keyword">if</span> cached_prediction_result: 
                <span class="hljs-keyword">return</span> jsonify(json.loads(json.dumps(cached_prediction_result)))

            <span class="hljs-comment"># historical data of the selected product</span>
            cached_df_stockcode = _redis_client.get(cache_key_df_stockcode)
            <span class="hljs-keyword">if</span> cached_df_stockcode: df_stockcode = json.loads(json.dumps(cached_df_stockcode))


        <span class="hljs-comment"># define the price range to make predictions. can be a request param, or historical min/max prices</span>
        min_price = float(data.get(<span class="hljs-string">'unitprice_min'</span>, df_stockcode[<span class="hljs-string">'unitprice_min'</span>][<span class="hljs-number">0</span>]))
        max_price = float(data.get(<span class="hljs-string">'unitprice_max'</span>, df_stockcode[<span class="hljs-string">'unitprice_max'</span>][<span class="hljs-number">0</span>]))

        <span class="hljs-comment"># create bins in the price range. when the number of the bins increase, the prediction becomes more smooth, but requires more computational cost</span>
        NUM_PRICE_BINS = int(data.get(<span class="hljs-string">'num_price_bins'</span>, <span class="hljs-number">100</span>))
        price_range = np.linspace(min_price, max_price, NUM_PRICE_BINS)

        <span class="hljs-comment"># create a prediction dataset by merging X_test (dataset never used in model training) and df_stockcode</span>
        price_range_df = pd.DataFrame({ <span class="hljs-string">'unitprice'</span>: price_range })
        test_sample = X_test.sample(n=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">42</span>)
        test_sample_merged = test_sample.merge(price_range_df, how=<span class="hljs-string">'cross'</span>) <span class="hljs-keyword">if</span> X_test <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> price_range_df
        test_sample_merged.drop(<span class="hljs-string">'unitprice_x'</span>, axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-literal">True</span>)
        test_sample_merged.rename(columns={<span class="hljs-string">'unitprice_y'</span>: <span class="hljs-string">'unitprice'</span>}, inplace=<span class="hljs-literal">True</span>)

        <span class="hljs-comment"># preprocess the dataset</span>
        X = preprocessor.transform(test_sample_merged) <span class="hljs-keyword">if</span> preprocessor <span class="hljs-keyword">else</span> test_sample_merged

        <span class="hljs-comment"># perform inference</span>
        y_pred_actual = <span class="hljs-literal">None</span>
        epsilon = <span class="hljs-number">0</span>
        <span class="hljs-comment"># try using the primary model</span>
        <span class="hljs-keyword">if</span> model:
            input_tensor = torch.tensor(X, dtype=torch.float32)
            model.eval()
            <span class="hljs-keyword">with</span> torch.inference_mode():
                y_pred = model(input_tensor)
                y_pred = y_pred.cpu().numpy().flatten()
                y_pred_actual = np.exp(y_pred + epsilon)

        <span class="hljs-comment"># if not, use backups</span>
        <span class="hljs-keyword">elif</span> backup_model:
            y_pred = backup_model.predict(X)
            y_pred_actual = np.exp(y_pred + epsilon)


        <span class="hljs-comment"># finalize the outcome for client app</span>
        df_ = test_sample_merged.copy()
        df_[<span class="hljs-string">'quantity'</span>] = np.floor(y_pred_actual) <span class="hljs-comment"># quantity must be an integer</span>
        df_[<span class="hljs-string">'sales'</span>] = df_[<span class="hljs-string">'quantity'</span>] * df_[<span class="hljs-string">'unitprice'</span>] <span class="hljs-comment"># compute sales</span>
        df_ = df_.sort_values(by=<span class="hljs-string">'unitprice'</span>)

        <span class="hljs-comment"># aggregate the results by the unitprice in the price range</span>
        df_results = df_.groupby(<span class="hljs-string">'unitprice'</span>).agg(
            quantity=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'median'</span>),
            quantity_min=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'min'</span>),
            quantity_max=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'max'</span>),
            sales=(<span class="hljs-string">'sales'</span>, <span class="hljs-string">'median'</span>),
        ).reset_index()

        <span class="hljs-comment"># find the optimal price point</span>
        optimal_row = df_results.loc[df_results[<span class="hljs-string">'sales'</span>].idxmax()]
        optimal_price = optimal_row[<span class="hljs-string">'unitprice'</span>]
        optimal_quantity = optimal_row[<span class="hljs-string">'quantity'</span>]
        best_sales = optimal_row[<span class="hljs-string">'sales'</span>]

        all_outputs = []
        <span class="hljs-keyword">for</span> _, row <span class="hljs-keyword">in</span> df_results.iterrows():
            current_output = {
                <span class="hljs-string">"stockcode"</span>: stockcode,
                <span class="hljs-string">"unit_price"</span>: float(row[<span class="hljs-string">'unitprice'</span>]),
                <span class="hljs-string">'quantity'</span>: int(row[<span class="hljs-string">'quantity'</span>]),
                <span class="hljs-string">'quantity_min'</span>: int(row[<span class="hljs-string">'quantity_min'</span>]),
                <span class="hljs-string">'quantity_max'</span>: int(row[<span class="hljs-string">'quantity_max'</span>]),
                <span class="hljs-string">"predicted_sales"</span>: float(row[<span class="hljs-string">'sales'</span>]),
            }
            all_outputs.append(current_output)

        <span class="hljs-comment"># store the prediction results in cache</span>
        <span class="hljs-keyword">if</span> all_outputs <span class="hljs-keyword">and</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
             serialized_data = json.dumps(all_outputs)
            _redis_client.set(
                cache_key_prediction_result_by_stockcode, 
                serialized_data,
                ex=<span class="hljs-number">3600</span>     <span class="hljs-comment"># expire in an hour</span>
            )

        <span class="hljs-comment"># return a list of all outputs</span>
        <span class="hljs-keyword">return</span> jsonify(all_outputs)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e: <span class="hljs-keyword">return</span> jsonify([])


<span class="hljs-comment"># request header management (for the process from API gateway to the Lambda)</span>
<span class="hljs-meta">@app.after_request</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_header</span>(<span class="hljs-params">response</span>):</span>
    response.headers[<span class="hljs-string">'Cache-Control'</span>] = <span class="hljs-string">'public, max-age=0'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Origin'</span>] = CLIENT_A
    response.headers[<span class="hljs-string">'Access-Control-Allow-Headers'</span>] = <span class="hljs-string">'Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token,Origin'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Methods'</span>] = <span class="hljs-string">'GET, POST, OPTIONSS'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Credentials'</span>] = <span class="hljs-string">'true'</span>
    <span class="hljs-keyword">return</span> response

<span class="hljs-comment">#### &lt;---- 3. LOADING PROCESSOR, DATASET, AND MODELS ----&gt;</span>
load_processor()
load_x_test()
load_model()

<span class="hljs-comment">#### &lt;---- 4. INVOKE LAMBDA ----&gt;</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handler</span>(<span class="hljs-params">event, context</span>):</span>
    logger.info(<span class="hljs-string">"lambda handler invoked."</span>)
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># connecting the redis client after the lambda is invoked</span>
        get_redis_client()
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        logger.critical(<span class="hljs-string">f"failed to establish initial Redis connection in handler: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> {
            <span class="hljs-string">'statusCode'</span>: <span class="hljs-number">500</span>,
            <span class="hljs-string">'body'</span>: json.dumps({<span class="hljs-string">'error'</span>: <span class="hljs-string">'Failed to initialize Redis client. Check environment variables and network config.'</span>})
        }

    <span class="hljs-comment"># use the awsgi package to convert JSON to WSGI</span>
    <span class="hljs-keyword">return</span> awsgi.response(app, event, context)


<span class="hljs-comment">#### &lt;---- 5. FOR LOCAL TEST ----&gt;</span>
<span class="hljs-comment"># serve the application locally on WSGI server, waitress</span>
<span class="hljs-comment"># lambda will ignore this section.</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:   
    <span class="hljs-keyword">if</span> os.getenv(<span class="hljs-string">'ENV'</span>) == <span class="hljs-string">'local'</span>:
        main_logger.info(<span class="hljs-string">"...start the operation (local)..."</span>)
        serve(app, host=<span class="hljs-string">'0.0.0.0'</span>, port=<span class="hljs-number">5002</span>)
    <span class="hljs-keyword">else</span>:
        app.run(host=<span class="hljs-string">'0.0.0.0'</span>, port=<span class="hljs-number">8080</span>)
</code></pre>
<p>I’ll test the endpoint locally using the <code>uv</code> package manager:</p>
<pre><code class="lang-python">$uv run app.py --cache-clear

$curl http://localhost:<span class="hljs-number">5002</span>/v1/predict-price/{STOCKCODE}
</code></pre>
<p>The system provided a list of sales predictions for each price point:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755607075000/e0e8cbcb-8817-4aa5-b3d1-37b76cc684fb.png" alt="Fig. Screenshot of the Flask app local response" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the Flask app local response</p>
<h4 id="heading-key-points-on-flask-app-configuration">Key Points on Flask App Configuration</h4>
<p>There are various points you should take into consideration when configuring a Flask application with Lambda. Let’s go over them now:</p>
<h5 id="heading-1-a-few-api-endpoints-per-container"><strong>1. A Few API Endpoints Per Container</strong></h5>
<p>Adding many API endpoints to a single serverless instance can lead to <strong>monolithic function concern</strong> where issues in one endpoint impact others.</p>
<p>In this project, we’ll focus on a single endpoint per container – and if needed, we can add separate Lambda functions to the system.</p>
<h5 id="heading-2-understanding-the-handler-function-and-the-role-of-awsgi"><strong>2. Understanding the</strong> <code>handler</code> <strong>Function and the role of AWSGI</strong></h5>
<p>The <code>handler</code> function is invoked every time the Lambda function receives a client request from the API Gateway.</p>
<p>The function takes the <code>event</code> argument that includes the request details in a <strong>JSON dictionary</strong> and passes it to the Flask application.</p>
<p><strong>AWSGI</strong> acts as an adapter, translating a Lambda event in JSON format into a WSGI request that a Flask application can understand, and converts the application’s response back into a JSON format that Lambda and API Gateway can process.</p>
<h5 id="heading-3-using-cache-storage"><strong>3. Using Cache Storage</strong></h5>
<p>The <code>get_redis_client</code> function is called once the <code>handler</code> function is called by the API Gateway. This allows the Flask application to store or fetch a cache from the Redis client:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> redis
<span class="hljs-keyword">import</span> redis.cluster
<span class="hljs-keyword">from</span> redis.cluster <span class="hljs-keyword">import</span> ClusterNode

_redis_client = <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_redis_client</span>():</span>
    <span class="hljs-keyword">global</span> _redis_client
    <span class="hljs-keyword">if</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        REDIS_HOST = os.environ.get(<span class="hljs-string">"REDIS_HOST"</span>)
        REDIS_PORT = int(os.environ.get(<span class="hljs-string">"REDIS_PORT"</span>, <span class="hljs-number">6379</span>))
        REDIS_TLS = os.environ.get(<span class="hljs-string">"REDIS_TLS"</span>, <span class="hljs-string">"true"</span>).lower() == <span class="hljs-string">"true"</span>
        <span class="hljs-keyword">try</span>:
            startup_nodes = [ClusterNode(host=REDIS_HOST, port=REDIS_PORT)]
            _redis_client = redis.cluster.RedisCluster(
                startup_nodes=startup_nodes,
                decode_responses=<span class="hljs-literal">True</span>,
                skip_full_coverage_check=<span class="hljs-literal">True</span>,
                ssl=REDIS_TLS,                  <span class="hljs-comment"># elasticache has encryption in transit: enabled -&gt; must be true</span>
                ssl_cert_reqs=<span class="hljs-literal">None</span>,
                socket_connect_timeout=<span class="hljs-number">5</span>,
                socket_timeout=<span class="hljs-number">5</span>,
                health_check_interval=<span class="hljs-number">30</span>,
                retry_on_timeout=<span class="hljs-literal">True</span>,
                retry_on_error=[
                    redis.exceptions.ConnectionError,
                    redis.exceptions.TimeoutError
                ],
                max_connections=<span class="hljs-number">10</span>,            <span class="hljs-comment"># limit connections for Lambda</span>
                max_connections_per_node=<span class="hljs-number">2</span>     <span class="hljs-comment"># limit per node</span>
            )
            _redis_client.ping()
            main_logger.info(<span class="hljs-string">"successfully connected to ElastiCache Redis Cluster (Configuration Endpoint)"</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f"an unexpected error occurred during Redis Cluster connection: <span class="hljs-subst">{e}</span>"</span>, exc_info=<span class="hljs-literal">True</span>)
            _redis_client = <span class="hljs-literal">None</span>
    <span class="hljs-keyword">return</span> _redis_client
</code></pre>
<h5 id="heading-4-handling-heavy-tasks-outside-of-the-handler-function"><strong>4. Handling Heavy Tasks Outside of the</strong> <code>handler</code> <strong>Function</strong></h5>
<p>Serverless functions can experience a <strong>cold start duration</strong>.</p>
<p>While a Lambda function can run for up to 15 minutes, its associated API Gateway has a timeout of 29 seconds (29,000 ms) for a RESTful API.</p>
<p>So, any heavy tasks like loading preprocessors, input data, or models should be performed once outside of the <code>handler</code> function, ensuring they are ready <em>before</em> the API endpoint is called.</p>
<p>Here are the loading functions called in <code>app.py</code>.</p>
<p><code>app.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> joblib

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_load, s3_load_to_temp_file

preprocessor = <span class="hljs-literal">None</span>
X_test = <span class="hljs-literal">None</span>
model = <span class="hljs-literal">None</span>
backup_model = <span class="hljs-literal">None</span>


<span class="hljs-comment"># load processor</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_preprocessor</span>():</span>
    <span class="hljs-keyword">global</span> preprocessor
    preprocessor_tempfile_path = s3_load_to_temp_file(PREPROCESSOR_PATH)
    preprocessor = joblib.load(preprocessor_tempfile_path)
    os.remove(preprocessor_tempfile_path)


<span class="hljs-comment"># load input data</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_x_test</span>():</span>
    <span class="hljs-keyword">global</span> X_test
    x_test_io = s3_load(file_path=X_TEST_PATH)
    X_test = pd.read_parquet(x_test_io)


<span class="hljs-comment"># load model</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_model</span>():</span>
    <span class="hljs-keyword">global</span> model, backup_model
    <span class="hljs-comment"># try loading &amp; reconstructing the primary model</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># first load io file from the s3 bucket</span>
        model_data_bytes_io_ = s3_load(file_path=DFN_FILE_PATH)
        <span class="hljs-comment"># convert to checkpoint dictionary (containing hyperparameter set)</span>
        checkpoint_ = torch.load(
            model_data_bytes_io_, 
            weights_only=<span class="hljs-literal">False</span>, 
            map_location=device
        )
        <span class="hljs-comment"># reconstruct the model</span>
        model = t.scripts.load_model(checkpoint=checkpoint_, file_path=DFN_FILE_PATH)
        <span class="hljs-comment"># set the model evaluation mode</span>
        model.eval()

    <span class="hljs-comment"># else, backup model</span>
     <span class="hljs-keyword">except</span>:
        load_artifacts_backup_model()
</code></pre>
<h3 id="heading-step-4-publish-a-docker-image-to-ecr">Step 4: Publish a Docker Image to ECR</h3>
<p>After configuring the Flask application, we’ll containerize the entire application on <strong>Docker</strong>.</p>
<p>Containerization makes a package of the application, including models, its dependencies, and configuration in machine learning context, as a container<strong>.</strong></p>
<p>Docker creates a container image based on the instructions defined in a Dockerfile, and the Docker engine uses the image to run the isolated container.</p>
<p>In this project, we’ll upload the Docker container image to ECR, so the Lambda function can access it in production.</p>
<p>After this, we’ll define the <code>.dockerignore</code> file to optimize the container image:</p>
<p><code>.dockerignore</code></p>
<pre><code class="lang-plaintext"># any irrelevant data
__pycache__/
.ruff_cache/
.DS_Store/
.venv/
dist/
.vscode
*.psd
*.pdf
[a-f]*.log
tmp/
awscli-bundle/

# add any experimental models, unnecessary data
dfn_bayesian/
dfn_grid/
data/
notebooks/
</code></pre>
<p><code>Dockerfile</code></p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># serve from aws ecr </span>
<span class="hljs-keyword">FROM</span> public.ecr.aws/lambda/python:<span class="hljs-number">3.12</span>

<span class="hljs-comment"># define a working directory in the container</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>

<span class="hljs-comment"># copy the entire repository (except .dockerignore) into the container at /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . /app/</span>

<span class="hljs-comment"># install dependencies defined in the requirements.txt</span>
<span class="hljs-keyword">RUN</span><span class="bash"> pip install --no-cache-dir -r requirements.txt</span>

<span class="hljs-comment"># define commands</span>
<span class="hljs-keyword">ENTRYPOINT</span><span class="bash"> [ <span class="hljs-string">"python"</span> ]</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [ <span class="hljs-string">"-m"</span>, <span class="hljs-string">"awslambdaric"</span>, <span class="hljs-string">"app.handler"</span> ]</span>
</code></pre>
<h4 id="heading-test-in-local">Test in Local</h4>
<p>Next, we’ll test the Docker image by building the container named <code>my-app</code> locally:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> build -t my-app -f Dockerfile .
</code></pre>
<p>Then, we’ll run the container with the <code>waitress</code> server in local:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> run -p 5002:5002 -e ENV=<span class="hljs-built_in">local</span> my-app app.py
</code></pre>
<p>The <code>-e ENV=local</code> flag sets the environment variable inside the container, which will trigger the <code>waitress.serve()</code> call in the <code>app.py</code>.</p>
<p>In the terminal, you’ll find a message saying the following:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*zu8mamgKMKOUxwCA.png" alt="Flask app response" width="600" height="400" loading="lazy"></p>
<p>You can also call the endpoint created to see the results returned:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run app.py --cache-clear

<span class="hljs-variable">$curl</span> http://localhost:5002/v1/predict-price/{STOCKCODE}
</code></pre>
<h4 id="heading-publish-the-docker-image-to-ecr">Publish the Docker Image to ECR</h4>
<p>To publish the Docker image, we first need to configure the default AWS credentials and region:</p>
<ul>
<li><p>From the AWS account console, issue an access token and check the default region.</p>
</li>
<li><p>Store them in the <code>~/aws/credentials</code> and <code>~/aws/config</code> files:</p>
</li>
</ul>
<p><code>~/aws/credentials</code></p>
<pre><code class="lang-plaintext">[default] 
aws_secret_access_key=
aws_access_key_id=
</code></pre>
<p><code>~/aws/config</code></p>
<pre><code class="lang-plaintext">[default]
region=
</code></pre>
<p>After the configuration, we’ll publish the Docker image to ECR.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># authenticate the docker client to ECR</span>
<span class="hljs-variable">$aws</span> ecr get-login-password --region &lt;your-aws-region&gt; | docker login --username AWS --password-stdin &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com

<span class="hljs-comment"># create repository</span>
<span class="hljs-variable">$aws</span> ecr create-repository --repository-name &lt;your-repo-name&gt; --region &lt;your-aws-region&gt;

<span class="hljs-comment"># tag the docker image</span>
<span class="hljs-variable">$docker</span> tag &lt;your-repo-name&gt;:&lt;your-app-version&gt;  &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com/&lt;your-app-name&gt;:&lt;your-app-version&gt;

<span class="hljs-comment"># push</span>
<span class="hljs-variable">$docker</span> push &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com/&lt;your-repo-name&gt;:&lt;your-app-version&gt;
</code></pre>
<p>Here’s what’s going on:</p>
<ul>
<li><p><code>&lt;your-aws-region&gt;</code>: Your default AWS region (for example, <code>us-east-1</code> ).</p>
</li>
<li><p><code>&lt;your-aws-account-id&gt;</code>: 12-digit AWS account ID.</p>
</li>
<li><p><code>&lt;your-repo-name&gt;</code>: Your desired repository name.</p>
</li>
<li><p><code>&lt;your-app-version&gt;</code>: Your desired tag name (for example, <code>v1.0</code>).</p>
</li>
</ul>
<p>Now, the Docker image is stored in ECR with the tag:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*tUQkbDW-uAmrjBfx.png" alt="Fig. Screenshot of the AWS ECR console" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS ECR console</p>
<h3 id="heading-step-5-create-a-lambda-function">Step 5: Create a Lambda Function</h3>
<p>Next, we’ll create a Lambda function.</p>
<p>From the Lambda console, choose:</p>
<ul>
<li><p>The <code>Container Image</code> option,</p>
</li>
<li><p>The container image URL from the pull down list,</p>
</li>
<li><p>A function name of our choice, and</p>
</li>
<li><p>An architecture type (arm64 is recommended for a better price-performance).</p>
</li>
</ul>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*3b-wIEUzRooQcvN_.png" alt="Fig. Screenshot of AWS Lambda function configurationFig. Screenshot of AWS Lambda function configuration" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of AWS Lambda function configuration</p>
<p>The Lambda function <code>my-app</code> was successfully launched.</p>
<h4 id="heading-connect-the-lambda-function-to-api-gateway">Connect the Lambda function to API Gateway</h4>
<p>Next, we’ll add API gateway as an event trigger to the Lambda function.</p>
<p>First, visit the API Gateway console and create <strong>REST API methods</strong> using the ARN of the Lambda function (press enter or click to view image in full size):</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*60TP64gdSjhKfiO8.png" alt="Fig. Screenshot of the AWS API Gateway configurationFig. Screenshot of the AWS API Gateway configuration" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS API Gateway configuration</p>
<p>Then, add resources to the created API gateway to create an endpoint:<br><code>API Gateway &gt; APIs &gt; Resources &gt; Create Resource</code></p>
<ul>
<li><p>Align the resource endpoint with the API endpoint defined in the <a target="_blank" href="http://app.py"><code>app.py</code></a>.</p>
</li>
<li><p>Configure CORS (for example, accept specific origins).</p>
</li>
<li><p>Deploy the resource to the stage.</p>
</li>
</ul>
<p>Going back to the Lambda console, you’ll find the API Gateway is connected as an event trigger:<br><code>Lambda &gt; Function &gt; my-app (your function name)</code></p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*DlfiEieZArmYlOuT.png" alt="Fig. Screenshot of the AWS Lambda dashboard" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS Lambda dashboard</p>
<h3 id="heading-step-6-configure-aws-resources">Step 6: Configure AWS Resources</h3>
<p>Lastly, we’ll configure the related AWS resources to make the system work in production.</p>
<p>This process involves the following steps:</p>
<h4 id="heading-1-the-iam-role-controls-who-to-access-resources">1. The IAM Role: Controls Who to Access Resources</h4>
<p>AWS requires <strong>IAM roles</strong> to grant temporary, secure permissions to users, mitigating security risks related to long-term credentials like passwords.</p>
<p>The IAM role leverages policies to grant accesses to the selected service. Policies can be issued by AWS or customized by the user by defining the inline policy.</p>
<p>It is important to avoid overly permissive access rights for the IAM role.</p>
<ol>
<li><p>In the Lambda function console, check the execution role:<br> <code>Lambda &gt; Function &gt; &lt;FUNCTION&gt; &gt; Permission &gt; The execution role</code>.</p>
</li>
<li><p>Set up the following policies to allow the Lambda’s IAM role to handle necessary operations:</p>
<ul>
<li><p><strong>Lambda</strong> <code>AWSLambdaExecute</code>: Allows executing the function.</p>
</li>
<li><p><strong>EC2</strong> <code>Inline policy</code>: Allows controlling the security group and the VPC of the Lambda function.</p>
</li>
<li><p><strong>ECR</strong> <code>AmazonElasticContainerRegistryPublicFullAccess</code> + <code>Inline policy</code>: Allows storing and pulling the Docker image.</p>
</li>
<li><p><strong>ElastiCache</strong> <code>AmazonElastiCacheFullAccess</code> + <code>Inline policy</code>: Allows storing and pulling caches.</p>
</li>
<li><p><strong>S3</strong>: <code>AmazonS3ReadOnlyAccess</code> + <code>Inline policy</code>: Allows reading and storing contents.</p>
</li>
</ul>
</li>
</ol>
<p>Now, the IAM role can access these resources and perfo the allowed actions.</p>
<h4 id="heading-2-the-security-group-controls-network-traffic">2. The Security Group: Controls Network Traffic</h4>
<p>A <strong>security group</strong> is a virtual firewall that controls inbound and outbound network traffic for AWS resources.</p>
<p>It uses stateful (allowing return traffic automatically) “allow-only” rules based on protocol, port, and IP address, where it denies all traffic by default.</p>
<p>Create a new security group for the Lambda function:<br><code>EC2 &gt; Security Groups &gt; &lt;YOUR SECURITY GROUP&gt;</code></p>
<p>Now, we’ll want to setup inbound / outbound traffic rules.</p>
<p>The inbound rules:</p>
<ul>
<li><p><strong>S3 → Lambda</strong>:<strong>Type</strong>*: HTTPS /* <strong>Protocol</strong>*: TCP /* <strong>Port range</strong>*: 443 / Source: Custom**</p>
</li>
<li><p><strong>ElastiCache → Lambda</strong>:<strong>Type</strong>*: Custom TCP /* <strong>Port range</strong>*: 6379 / Source: Custom**</p>
</li>
</ul>
<p>*Choose the created security group for the Lambda function as a custom source.</p>
<p>The outbound rules:</p>
<ul>
<li><p><strong>Lambda → Internet</strong>: <strong>Type</strong>*: HTTPS /* <strong>Protocol</strong>*: TCP /* <strong>Port range</strong>*: 443 /* <strong>Destination</strong>*: 0.0.0.0/0*</p>
</li>
<li><p><strong>ElastiCache → Internet</strong>: <strong>Type</strong>*: All Traffic /* <strong>Destination</strong>*: 0.0.0.0/0*</p>
</li>
</ul>
<h4 id="heading-3-the-virtual-private-cloud-vpc">3. The Virtual Private Cloud (VPC)</h4>
<p>A <strong>Virtual Private Cloud (VPC)</strong> provides a logically isolated private network for the AWS resources, acting as our own private data center within AWS.</p>
<p>AWS can create a <strong>Hyperplane ENI</strong> (Elastic Network Interface) for the Lambda function and its connected resources in the subnets of the VPC.</p>
<p>Though it’s optional, we’ll use the VPC to connect the Lambda function to the S3 storage and ElastiCache.</p>
<p>This process involves:</p>
<ol>
<li><p>Creating a VPC endpoint from the VPC console:<code>VPC &gt; Create VPC</code>.</p>
</li>
<li><p>Creating an STS (Security Token Service) endpoint:<br> <code>VPC &gt; PrivateLink and Lattice &gt; Endpoints &gt; Create Endpoint &gt;</code></p>
<ul>
<li><p><strong>Type</strong>*: AWS Service*</p>
</li>
<li><p><strong>Service name</strong>*: com.amazonaws.&lt;YOUR REGION&gt;.sts*</p>
</li>
<li><p><strong>Type</strong>*: Interface*</p>
</li>
<li><p><strong>VPC:</strong> Select the VPC created earlier.</p>
</li>
<li><p><strong>Subnets</strong>*: Select all subnets.*</p>
</li>
<li><p><strong>Security groups</strong>*: Select the security group of the Lambda function.*</p>
</li>
<li><p><strong>Policy</strong>*: Full access*</p>
</li>
<li><p><strong>Enable DNS names</strong></p>
</li>
</ul>
</li>
</ol>
<p>The VPC must have a dedicated endpoint for STS to receive temporary credentials from STS.</p>
<ol start="3">
<li><p>Create an S3 endpoint in the VPC:<br> <code>VPC &gt; PrivateLink and Lattice &gt; Endpoints &gt; Create Endpoint &gt;</code></p>
<ul>
<li><p><strong>Type</strong>*: AWS Service*</p>
</li>
<li><p><strong>Service name</strong>*: com.amazonaws.&lt;YOUR REGION&gt;.s3*</p>
</li>
<li><p><strong>Type</strong>*: Gateway*</p>
</li>
<li><p><strong>VPC:</strong> Select the VPC created earlier.</p>
</li>
<li><p><strong>Subnets</strong>*: Select all subnets.*</p>
</li>
<li><p><strong>Security groups</strong>*: Select the security group of the Lambda function.*</p>
</li>
<li><p><strong>Policy</strong>*: Full access*</p>
</li>
</ul>
</li>
</ol>
<p>Lastly, check the security group of the Lambda function and ensure that its VPC ID directs to the VPC created: <code>EC2 &gt; Security Group &gt; &lt;YOUR SECURITY GROUP FOR THE LAMDA FUNCTION&gt; &gt; VPC ID</code>.</p>
<p>That’s all for the deployment flow.</p>
<p>We can now test the API endpoint in production. Copy the <strong>Invoke URL</strong> of the deployed API endpoint: <code>API Gateway &gt; APIs &gt; Stages &gt; Invoke URL</code>. Then call the API endpoint and check if it responds predictions:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$curl</span> -H <span class="hljs-string">'Authorization: Bearer YOUR_API_TOKEN'</span> -H <span class="hljs-string">'Accept: application/json'</span> \
     <span class="hljs-string">'&lt;INVOKE URL&gt;/&lt;ENDPOINT&gt;'</span>
</code></pre>
<p>For logging and debugging, we’ll use the LiveTail of CloudWatch: <code>CloudWatch &gt; LiveTail</code>.</p>
<h2 id="heading-building-a-client-application-optional">Building a Client Application (Optional)</h2>
<p>For full-stack deployment, we’ll build a simple React application to display the prediction using the <a target="_blank" href="https://recharts.org/en-US">recharts</a> library for visualization.</p>
<p>Other options for quick frontend deployment include <a target="_blank" href="https://streamlit.io/">Streamlit</a> or <a target="_blank" href="https://www.gradio.app/">Gradio</a>.</p>
<h3 id="heading-the-react-application">The React Application</h3>
<p>The React application creates a web page that fetches and visualizes sales predictions from an external API, recommending an optimal price point.</p>
<p>The app uses <code>useState</code> to manage its data and state, including the selected product, the list of sales predictions, and the loading/error status.</p>
<p>When the user initiates a request, a <code>useEffect</code> hook triggers a <code>fetch</code> request to a Flask backend. It handles the API response as a <strong>data stream</strong>, processing it line by line to progressively update the predictions.</p>
<p>The <code>AreaChart</code> from the <code>recharts</code> library then visualizes this data. The X-axis represents the <code>price</code> and the Y-axis represents the <code>sales</code>. The chart updates in real-time as the data streams in. Finally, the app displays the optimal price once all the predictions are received.</p>
<p><code>App.js</code>: (in a separate React app)</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> { useState, useEffect } <span class="hljs-keyword">from</span> <span class="hljs-string">"react"</span>
<span class="hljs-keyword">import</span> { AreaChart, Area, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer, ReferenceLine } <span class="hljs-keyword">from</span> <span class="hljs-string">'recharts'</span>


<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">App</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-comment">// state</span>
  <span class="hljs-keyword">const</span> [predictions, setPredictions] = useState([])
  <span class="hljs-keyword">const</span> [start, setStart] = useState(<span class="hljs-literal">false</span>)
  <span class="hljs-keyword">const</span> [isLoading, setIsLoading] = useState(<span class="hljs-literal">false</span>)

  <span class="hljs-comment">// product data</span>
  <span class="hljs-keyword">let</span> selectedStockcode = <span class="hljs-string">'85123A'</span>
  <span class="hljs-keyword">let</span> selectedProduct = productOptions.filter(<span class="hljs-function"><span class="hljs-params">item</span> =&gt;</span> item.id === selectedStockcode)[<span class="hljs-number">0</span>]

  <span class="hljs-comment">// api endpoint</span>
  <span class="hljs-keyword">const</span> flaskBackendUrl = <span class="hljs-string">"YOUR FLASK BACKEND URL"</span>

  <span class="hljs-comment">// create chart data to display</span>
  <span class="hljs-keyword">const</span> chartDataSales = predictions &amp;&amp; predictions.length &gt; <span class="hljs-number">0</span>
    ? predictions
      .map(<span class="hljs-function"><span class="hljs-params">item</span> =&gt;</span> ({
        <span class="hljs-attr">price</span>: item.unit_price,
        <span class="hljs-attr">sales</span>: item.predicted_sales,
        <span class="hljs-attr">volume</span>: item.unit_price !== <span class="hljs-number">0</span> ? item.predicted_sales / item.unit_price : <span class="hljs-number">0</span>
      }))
      .sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> a.price - b.price)
    : [...selectedProduct[<span class="hljs-string">'histPrices'</span>]]

  <span class="hljs-comment">// optimal price to display</span>
  <span class="hljs-keyword">const</span> optimalPrice = predictions.length &gt; <span class="hljs-number">0</span>
    ? predictions.sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> b.predicted_sales - a.predicted_sales)[<span class="hljs-number">0</span>][<span class="hljs-string">'unit_price'</span>]
    : <span class="hljs-number">0</span>

  <span class="hljs-comment">// fetch prediction results</span>
  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-keyword">const</span> handlePrediction = <span class="hljs-keyword">async</span> () =&gt; {
      setIsLoading(<span class="hljs-literal">true</span>)
      setPredictions([])
      <span class="hljs-keyword">const</span> errorPrices = selectedProduct[<span class="hljs-string">'errorPrices'</span>]

      <span class="hljs-keyword">await</span> fetch(flaskBackendUrl)
        .then(<span class="hljs-function"><span class="hljs-params">res</span> =&gt;</span> {
          <span class="hljs-keyword">if</span> (res.status !== <span class="hljs-number">200</span>) { setPredictions(errorPrices); setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>) }
          <span class="hljs-keyword">else</span> <span class="hljs-keyword">return</span> <span class="hljs-built_in">Promise</span>.resolve(res.clone().json())
        })
        .then(<span class="hljs-function"><span class="hljs-params">res</span> =&gt;</span> {
          <span class="hljs-keyword">if</span> (res &amp;&amp; res.length &gt; <span class="hljs-number">0</span>) setPredictions(res)
          <span class="hljs-keyword">else</span> setPredictions(errorPrices)
          setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>)
        })
        .catch(<span class="hljs-function"><span class="hljs-params">err</span> =&gt;</span> { setPredictions(errorPrices); setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>) })
        .finally(setStart(<span class="hljs-literal">false</span>))
    }

    <span class="hljs-keyword">if</span> (start) handlePrediction()
    <span class="hljs-keyword">if</span> (predictions &amp;&amp; predictions.length &gt; <span class="hljs-number">0</span>) setStart(<span class="hljs-literal">false</span>)
  }, [flaskBackendUrl, start])


  <span class="hljs-comment">// render</span>
  <span class="hljs-keyword">if</span> (isLoading) <span class="hljs-keyword">return</span> <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">Loading</span> /&gt;</span></span>
  <span class="hljs-keyword">return</span> (
    <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">ResponsiveContainer</span> <span class="hljs-attr">width</span>=<span class="hljs-string">"100%"</span> <span class="hljs-attr">height</span>=<span class="hljs-string">"100%"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">AreaChart</span>
          <span class="hljs-attr">key</span>=<span class="hljs-string">{chartDataSales.length}</span>
          <span class="hljs-attr">data</span>=<span class="hljs-string">{chartDataSales.sort(data</span> =&gt;</span> data.unit_price)}
          margin={{ top: 10, right: 30, left: 0, bottom: 0 }}
        &gt;
          <span class="hljs-tag">&lt;<span class="hljs-name">CartesianGrid</span> <span class="hljs-attr">strokeDasharray</span>=<span class="hljs-string">"3 3"</span> <span class="hljs-attr">strokeOpacity</span>=<span class="hljs-string">{0.6}</span> /&gt;</span>

          <span class="hljs-tag">&lt;<span class="hljs-name">XAxis</span>
            <span class="hljs-attr">dataKey</span>=<span class="hljs-string">"price"</span>
            <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">value:</span> "<span class="hljs-attr">Unit</span> <span class="hljs-attr">Price</span> ($)", <span class="hljs-attr">position:</span> "<span class="hljs-attr">insideBottom</span>", <span class="hljs-attr">offset:</span> <span class="hljs-attr">0</span>, <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span>, <span class="hljs-attr">marginTop:</span> <span class="hljs-attr">10</span> }}
            <span class="hljs-attr">tickFormatter</span>=<span class="hljs-string">{(tick)</span> =&gt;</span> `$${parseFloat(tick).toFixed(2)}`}
            tick={{ fontSize: 12 }}
            padding={{ left: 20, right: 20 }}
          /&gt;

          <span class="hljs-tag">&lt;<span class="hljs-name">YAxis</span>
            <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">value:</span> "<span class="hljs-attr">Predicted</span> <span class="hljs-attr">Sales</span> ($)", <span class="hljs-attr">angle:</span> <span class="hljs-attr">-90</span>, <span class="hljs-attr">position:</span> "<span class="hljs-attr">insideLeft</span>", <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span> }}
            <span class="hljs-attr">tick</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span> }}
            <span class="hljs-attr">tickFormatter</span>=<span class="hljs-string">{(tick)</span> =&gt;</span> `$${tick.toLocaleString()}`}
          /&gt;

          {/* tooltips with the prediction result data */}
          <span class="hljs-tag">&lt;<span class="hljs-name">Tooltip</span>
            <span class="hljs-attr">contentStyle</span>=<span class="hljs-string">{{</span>
              <span class="hljs-attr">borderRadius:</span> '<span class="hljs-attr">8px</span>',
              <span class="hljs-attr">padding:</span> '<span class="hljs-attr">10px</span>',
              <span class="hljs-attr">boxShadow:</span> '<span class="hljs-attr">0px</span> <span class="hljs-attr">0px</span> <span class="hljs-attr">15px</span> <span class="hljs-attr">rgba</span>(<span class="hljs-attr">0</span>,<span class="hljs-attr">0</span>,<span class="hljs-attr">0</span>,<span class="hljs-attr">0.5</span>)'
            }}
            <span class="hljs-attr">formatter</span>=<span class="hljs-string">{(value,</span> <span class="hljs-attr">name</span>) =&gt;</span> {
              if (name === 'sales') {
                return [`$${value.toFixed(4)}`, 'Predicted Sales']
              }
              if (name === 'volume') {
                return [`${value.toFixed(0)}`, 'Volume']
              }
              return value
            }}
            labelFormatter={(label) =&gt; `Price: $${label.toFixed(2)}`}
          /&gt;

          {/* chart area = sales */}
          <span class="hljs-tag">&lt;<span class="hljs-name">Area</span>
            <span class="hljs-attr">type</span>=<span class="hljs-string">"monotone"</span>
            <span class="hljs-attr">dataKey</span>=<span class="hljs-string">"sales"</span>
            <span class="hljs-attr">fillOpacity</span>=<span class="hljs-string">{1}</span>
            <span class="hljs-attr">fill</span>=<span class="hljs-string">"url(#colorSales)"</span>
          /&gt;</span>

          {/* vertical line for the optimal price */}
          {optimalPrice &amp;&amp;
            <span class="hljs-tag">&lt;<span class="hljs-name">ReferenceLine</span>
              <span class="hljs-attr">x</span>=<span class="hljs-string">{optimalPrice}</span>
              <span class="hljs-attr">strokeDasharray</span>=<span class="hljs-string">"4 4"</span>
              <span class="hljs-attr">ifOverflow</span>=<span class="hljs-string">"visible"</span>
              <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span>
                <span class="hljs-attr">value:</span> `<span class="hljs-attr">Optimal</span> <span class="hljs-attr">Price:</span> $${<span class="hljs-attr">optimalPrice</span> !== <span class="hljs-string">null</span> &amp;&amp; <span class="hljs-attr">optimalPrice</span> &gt;</span> 0 ? Math.ceil(optimalPrice * 10000) / 10000 : ''}`,
                position: "right",
                fontSize: 12,
                offset: 10
              }}
            /&gt;
          }
        <span class="hljs-tag">&lt;/<span class="hljs-name">AreaChart</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">ResponsiveContainer</span>&gt;</span>

      {optimalPrice &amp;&amp; <span class="hljs-tag">&lt;<span class="hljs-name">p</span>&gt;</span>Optimal Price: $ {Math.ceil(optimalPrice * 10000) / 10000}<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>}

    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span></span>
  )
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> App
</code></pre>
<h2 id="heading-final-results">Final Results</h2>
<p>Now, the application is ready to serve.</p>
<p>You can explore the UI from <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<p>All code (backend) is available in <a target="_blank" href="https://github.com/krik8235/ml-sales-prediction">my Github Repo</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building a machine learning system requires thoughtful project scoping and architecture design.</p>
<p>In this article, we built a dynamic pricing system as a simple single interface on containerized serverless architecture.</p>
<p>Moving forward, we’d need to consider potential drawbacks of this minimal architecture:</p>
<ul>
<li><p><strong>Increase in cold start duration</strong>: The WSGI adapter <code>awsgi</code> layer adds a small overhead. Loading a larger container image takes longer time.</p>
</li>
<li><p><strong>Monolithic function:</strong> Adding endpoints to the Lambda function can lead to a monolithic function where an issue in one endpoint impacts others.</p>
</li>
<li><p><strong>Less granular observability</strong>: AWS CloudWatch cannot provide individual invocation/error metrics per API endpoint without custom instrumentation.</p>
</li>
</ul>
<p>To scale the application effectively, extracting functionalities into a new microservice can be a good strategy to the next step.</p>
<p>I’m Kuriko IWAI, and you can find more of my work and learn more about me here:</p>
<p><a target="_blank" href="https://kuriko-iwai.vercel.app/"><strong>Portfolio</strong></a> <strong>/</strong> <a target="_blank" href="https://www.linkedin.com/in/k-i-i/"><strong>LinkedIn</strong></a> <strong>/</strong> <a target="_blank" href="https://github.com/krik8235"><strong>Github</strong></a></p>
<p><em>All images, unless otherwise noted, are by the author. This application utilizes synthetic dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.</em></p>
<p><em>This information about AWS is current as of August 2025 and is subject to change.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Deep Reinforcement Learning in Natural Language Understanding ]]>
                </title>
                <description>
                    <![CDATA[ Language is messy, subtle, and full of meaning that shifts with context. Teaching machines to truly understand it is one of the hardest problems in artificial intelligence. That challenge is what natural language understanding (NLU) sets out to solve... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/deep-reinforcement-learning-in-natural-language-understanding/</link>
                <guid isPermaLink="false">689f4b8b1694c0dba616a0d0</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Fri, 15 Aug 2025 15:00:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755270013761/005fd330-7f59-4753-ba14-8852f4240f3c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Language is messy, subtle, and full of meaning that shifts with context. Teaching machines to truly understand it is one of the hardest problems in artificial intelligence.</p>
<p>That challenge is what natural language understanding (NLU) sets out to solve. From voice assistants that follow instructions to support systems that interpret user intent, NLU sits at the core of many real-world AI applications.</p>
<p>Most systems today are trained using labeled data and supervised techniques. But there's growing interest in something more adaptive: deep reinforcement learning (DRL). Instead of learning from fixed examples, DRL allows a model to improve through trial, error, and feedback, much like a person learning through experience.</p>
<p>This article looks at where DRL fits into the modern NLU landscape. We'll explore how it's being used to fine-tune responses, guide conversation flow, and align models with human values.</p>
<h3 id="heading-what-well-cover">What we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-overview-of-deep-reinforcement-learning">Overview of Deep Reinforcement Learning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-natural-language-understanding-nlu">What is Natural Language Understanding (NLU)?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-challenges-in-nlu-and-how-to-address-them">Challenges in NLU and How to Address Them</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-where-drl-adds-value-in-nlu">Where DRL Adds Value in NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-modern-architectures-in-nlu-from-bert-to-claude">Modern Architectures in NLU from BERT to Claude</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-niche-role-of-drl-in-modern-nlu">The Niche Role of DRL in Modern NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-reinforcement-learning-from-human-feedback-rlhf">Reinforcement Learning from Human Feedback (RLHF)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ecosystem-and-tools-for-drl-in-nlp">Ecosystem and Tools for DRL in NLP</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hands-on-demo-simulating-drl-feedback-in-nlu">Hands-On Demo: Simulating DRL Feedback in NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-case-studies-of-drl-in-nlu">Case Studies of DRL in NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-overview-of-deep-reinforcement-learning">Overview of Deep Reinforcement Learning</h2>
<p>Reinforcement learning is a subfield of machine learning. It’s inspired by behavioral psychology, in which agents learn to maximize cumulative rewards by performing behaviors in a given environment.</p>
<p>Traditionally, reinforcement learning techniques have been used to solve simple problems with discrete state and action spaces. But the development of deep learning has opened the door to applying these techniques to more complicated, high-dimensional environments, like computer vision, natural language processing (NLP), and robotics.</p>
<p>DRL uses deep neural networks to approximate complex functions that translate observations into actions, allowing agents to learn from raw sensory data. Deep neural networks, which represent knowledge in numerous layers of abstraction, may catch detailed patterns and relationships in data, allowing for more effective decision-making.</p>
<p>Imagine you’re playing a video game where you’re controlling a character, and your goal is to get the highest score possible. Now, when you first start playing, you might not know the best way to play, right? You might try different things like jumping, running, or shooting, and you see what works and what doesn’t.</p>
<p>We can think of DRL as a technique that enables computers or robots to learn how to play video games as time goes on. DRL involves a computer learning from its environment, learning from its experiences and mistakes. The computer, like the player, tries different actions and receives feedback based on its performance. If it performs well, it gets rewards, while if it fails, it gets a penalty.</p>
<p>The computer’s job is to figure out the best possible actions to take in different situations to maximize rewards. Instead of learning from trial and error, DRL uses deep neural networks, which are like super-smart brains that can understand vast amounts of data and patterns. These neural networks help the computer make better decisions in the future, and over time, it can become even better at playing the game – sometimes even better than humans.</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/1*7UeewswDEpqTALIvwkNNAw.png" alt="Deep reinforcement learning approach" width="600" height="400" loading="lazy"></p>
<p><a target="_blank" href="https://www.researchgate.net/publication/333909668_Demand_Response_Management_for_Industrial_Facilities_A_Deep_Reinforcement_Learning_Approach">Image Source</a></p>
<h2 id="heading-what-is-natural-language-understanding-nlu">What is Natural Language Understanding (NLU)?</h2>
<p>NLU is a subfield of artificial intelligence (AI), and its aim is to help computers understand, interpret, and respond to human language in meaningful ways. It involves creating algorithms and models that can process and analyze text to extract meaningful information, determine the intent behind it, and provide appropriate replies.</p>
<p>NLU is a basic part of many AI applications, such as chatbots, virtual assistants, and personalized recommendation systems, which require the ability to interpret and respond to human language.</p>
<p>Its key components include:</p>
<ul>
<li><p><strong>Text processing:</strong> NLU systems must be able to process and interpret text, which includes tokenization (cutting it down into words or phrases), part-of-speech tagging, and named entity recognition.</p>
</li>
<li><p><strong>Sentiment analysis:</strong> Identifying the sentiment communicated in a piece of text (positive, negative, or neutral) is a common task in NLU.</p>
</li>
<li><p><strong>Intent recognition:</strong> Identifying the goal or objective of a user’s input, such as buying a flight or requesting weather forecasts.</p>
</li>
<li><p><strong>Language generation:</strong> (technically part of Natural Language Generation, or NLG): While NLU focuses on understanding text, NLG is about producing coherent, contextually appropriate text. Many AI systems combine both, first interpreting the input through NLU, then generating an appropriate response using NLG.</p>
</li>
<li><p><strong>Entity extraction:</strong> Identifying and categorizing essential details in the text, such as dates, locations, and people.</p>
</li>
</ul>
<h2 id="heading-challenges-in-nlu-and-how-to-address-them"><strong>Challenges in NLU and How to Address Them</strong></h2>
<p>NLU aims to help machines interpret, understand, and respond to human language in ways that make sense. While it has made great progress, there are still challenges that limit how well it works in practice.</p>
<p>Below are some of these challenges and how Deep Reinforcement Learning (DRL) can play a supportive role. DRL is not a replacement for large-scale pretraining or instruction tuning, but it can complement them by helping models adapt through interaction and feedback.</p>
<h3 id="heading-ambiguity"><strong>Ambiguity</strong></h3>
<p>Naturally, words can have more than one meaning, and a single sentence or phrase might be understood in different ways. This makes it hard for NLU systems to always pinpoint what the speaker or writer intends.</p>
<p>DRL can help reduce ambiguity by allowing models to learn from feedback. If a certain interpretation gets positive results, the model can prioritize it. If not, it can try a different approach. While this does not remove ambiguity entirely, it can improve a model’s ability to make better choices over time, especially when combined with a strong pretrained foundation.</p>
<h3 id="heading-contextual-understanding"><strong>Contextual understanding</strong></h3>
<p>Understanding language often depends on context such as cultural references, sarcasm, or the tone behind certain words. These are straightforward for people but challenging for machines to recognize.</p>
<p>By learning from interaction signals such as whether a user is satisfied with a response, DRL can help a model adapt to context more effectively. However, the core ability to understand context still comes from large-scale pretraining. DRL mainly fine-tunes and adjusts this behavior during use.</p>
<h3 id="heading-language-variation"><strong>Language variation</strong></h3>
<p>Human language comes in many forms including different dialects, slang, colloquialisms, and regional expressions. This variety can challenge NLU systems that have not seen enough examples of these patterns during training.</p>
<p>With DRL, models can adapt to new language styles when exposed to them repeatedly in real-world use. This makes them more flexible and responsive, although their base understanding still relies on the diversity of the data used during pretraining.</p>
<h3 id="heading-scalability"><strong>Scalability</strong></h3>
<p>As text data continues to grow, NLU systems must be able to process large volumes quickly and efficiently, especially in real-time applications such as chatbots and virtual assistants.</p>
<p>DRL can contribute by helping models optimize certain processing steps through trial and feedback. While it will not replace architectural or infrastructure improvements, it can help fine-tune performance for specific high-traffic tasks.</p>
<h3 id="heading-computational-complexity"><strong>Computational complexity</strong></h3>
<p>Training advanced NLU models is resource-intensive, which can be a challenge for mobile devices, edge computing, or other resource-limited environments.</p>
<p>DRL can make the learning process more efficient by reusing past experiences through techniques such as off-policy learning and reward modeling. Combined with smaller, distilled model architectures, this can make it easier to deploy capable NLU systems even with limited computing power.</p>
<h2 id="heading-where-drl-adds-value-in-nlu"><strong>Where DRL Adds Value in NLU</strong></h2>
<p>DRL is not a primary training method for most NLU models. Its main value comes when interaction, feedback, or rewards can be used to improve how a system behaves after it has already been pretrained. When applied selectively, DRL can help refine and personalize model performance in ways that matter for specific use cases.</p>
<p>Here are some areas where DRL has shown potential:</p>
<ol>
<li><p><strong>Dialogue systems</strong><br> DRL can help chatbots and virtual assistants manage conversations more smoothly. It can be used to refine turn-taking, handle vague questions in a better way, or adjust responses to improve user satisfaction during longer conversations.</p>
</li>
<li><p><strong>Text summarization</strong><br> Most summarization models rely on supervised learning. DRL can be added as a fine-tuning step to focus on factors such as relevance or fluency, especially when custom reward signals are linked to specific goals or user preferences.</p>
</li>
<li><p><strong>Response generation and language modeling</strong><br> DRL can guide language generation toward outputs that are more useful, aligned with user intent, or better suited to certain tone and safety requirements.</p>
</li>
<li><p><strong>Reward-based optimization in parsing or classification</strong><br> In certain cases, DRL has been used to improve outputs based on downstream objectives such as increasing label confidence or enhancing the quality of supporting explanations, alongside accuracy.</p>
</li>
<li><p><strong>Interactive machine translation</strong><br> DRL can help translation systems adapt over time by learning from reinforcement signals like human corrections or post-editing feedback, leading to gradual improvements in quality.</p>
</li>
</ol>
<p>In short, DRL works best as a targeted enhancement. It is not used to build general-purpose NLU systems from scratch, but it can make existing systems more adaptable, aligned, and responsive when feedback loops are part of the application.</p>
<h2 id="heading-modern-architectures-in-nlu-from-bert-to-claude"><strong>Modern Architectures in NLU from BERT to Claude</strong></h2>
<p>Early NLU systems used Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), but most modern systems use transformers.</p>
<p>These models use a mechanism called self-attention to capture long-range dependencies. <strong>Self-attention</strong> allows each word to “attend” to every other word in the input, assigning weights that determine relevance for understanding the current word. <strong>Long-range dependencies</strong> occur when the meaning of one word depends on another far away in the text (like linking “he” to “the president” from earlier sentences). This helps maintain context over large spans of text.</p>
<p>Here’s how the main types of transformer models are used today:</p>
<h3 id="heading-encoder-only-models">Encoder-only models</h3>
<p>Examples: BERT, RoBERTa, ALBERT, DeBERTa</p>
<p>These models process text input and create rich contextual representations without generating new text. They are excellent for classification, entity extraction, and tasks that require understanding rather than producing language. The encoder reads the whole input and encodes it into a vector representation, which is then used by a task-specific head for predictions.  </p>
<p>They're often fine-tuned for specific tasks and perform especially well in structured language understanding.</p>
<h3 id="heading-encoder-decoder-models">Encoder-decoder models</h3>
<p>Examples: T5, FLAN-T5</p>
<p>These models have two components: an encoder that reads and encodes the input text, and a decoder that generates an output sequence based on that encoded representation. They are ideal for sequence-to-sequence tasks such as summarization, translation, and instruction following. The encoder captures the meaning of the input, while the decoder produces coherent output in the target form.  </p>
<p>They’re flexible and particularly useful in multi-task learning setups</p>
<h3 id="heading-decoder-only-models">Decoder-only models</h3>
<p>Examples: GPT-4, Claude 3, Gemini</p>
<p>These models generate text one token at a time, predicting the next token based on all previous tokens in the sequence. They excel in open-ended text generation, creative writing, and reasoning tasks. Because they are trained to predict the next word given any context, they can perform many tasks simply by being prompted, without additional training.  </p>
<p>They’re typically aligned with human preferences using techniques like Reinforcement Learning from Human Feedback (RHLF).</p>
<p>These models are now widely used in real-world applications, such as chatbots, enterprise tools, and multilingual digital assistants, and many can handle new tasks with just a prompt, requiring no additional training.</p>
<h2 id="heading-the-niche-role-of-drl-in-modern-nlu"><strong>The Niche Role of DRL in Modern NLU</strong></h2>
<p>DRL is not a general-purpose solution for most NLU challenges, such as handling ambiguity or understanding context. These problems are typically addressed using large-scale pretraining and supervised or instruction-based fine-tuning.</p>
<p>That said, DRL still plays a valuable role in specific areas where feedback and long-term optimization are useful. It is commonly applied in:</p>
<ul>
<li><p><strong>Improving dialogue strategy:</strong> DRL helps conversational agents manage turn-taking, adjust tone, and adapt to user preferences across multiple interactions.</p>
</li>
<li><p><strong>Aligning model behavior using RLHF:</strong> Reinforcement learning from human feedback (RLHF – more on this below) uses DRL to train models that respond in ways people find more helpful, safe, or contextually appropriate.</p>
</li>
<li><p><strong>Reward modeling for alignment and safety:</strong> DRL enables the training of reward models that guide language systems toward ethical, culturally aware, or domain-specific behavior.</p>
</li>
</ul>
<p>Looking ahead, DRL is likely to grow in importance for applications that involve real-time interaction, long-horizon reasoning, or agent-driven workflows. For now, it serves as a targeted enhancement alongside more widely used training methods.</p>
<h3 id="heading-reinforcement-learning-from-human-feedback-rlhf">Reinforcement Learning from Human Feedback (RLHF)</h3>
<p>Let’s talk a bit more about RLHF, as it’s pretty important here. It’s also currently the primary way DRL is applied in large-scale language models such as GPT‑4, Claude, and Gemini.  </p>
<p>It works in three main steps:</p>
<ol>
<li><p><strong>Reward model training</strong> – Human annotators rank model outputs for the same prompt. These rankings are used to train a reward model that scores outputs based on how helpful, safe, or relevant they are.</p>
</li>
<li><p><strong>Policy optimization</strong> – Using algorithms such as PPO (Proximal Policy Optimization), the base language model is fine-tuned to maximize the reward model’s score.</p>
</li>
<li><p><strong>Iteration and safety</strong> – RLHF loops are often combined with safety-focused reward modeling, constitutional AI (following explicit guidelines for safe behavior), refusal strategies for harmful requests, and red‑teaming to probe weaknesses.</p>
</li>
</ol>
<p>Data‑efficient variants are increasingly common, such as offline RL, replay buffers, and leveraging implicit feedback like click‑through logs.</p>
<p>In practice, RLHF has significantly improved the ability of models to follow instructions, avoid harmful outputs, and align with human values.</p>
<h2 id="heading-ecosystem-and-tools-for-drl-in-nlp"><strong>Ecosystem and Tools for DRL in NLP</strong></h2>
<p>If you're looking to explore DRL in NLU, you don't have to start from scratch. There’s a solid ecosystem of tools that make it easier to test ideas, build prototypes, and fine-tune models using rewards and feedback.</p>
<p>Here are a few go-to libraries:</p>
<ol>
<li><p><code>trl</code> by Hugging Face: A lightweight framework built specifically for applying reinforcement learning to transformer models. It's widely used for RLHF, reward modeling, and steering model outputs based on human preferences.</p>
</li>
<li><p>Stable-Baselines3: A simple, well-documented library for classic DRL algorithms like PPO, A2C, and DQN. It’s great for testing DRL setups in smaller or custom environments.</p>
</li>
<li><p>RLlib (part of Ray): Designed for scaling up. If you're working on distributed training or combining DRL with larger pipelines, RLlib helps manage the complexity.</p>
</li>
</ol>
<p>These libraries pair well with open-source large language models like LLaMA, Mistral, Gemma, and Command R+. Together, they give you everything you need to experiment with DRL-backed language systems, whether you're tuning responses in a chatbot or building a reward model for alignment.</p>
<h2 id="heading-hands-on-demo-simulating-drl-feedback-in-nlu">Hands-On Demo: Simulating DRL Feedback in NLU</h2>
<p>You don’t need a full reinforcement learning pipeline to understand reward signals. This notebook demonstrates how you can simulate <strong>preference-based feedback</strong> using GPT-3.5. Users interact with the model, provide binary feedback (good or bad), and the system logs each interaction with a corresponding reward. It mirrors the principles behind techniques like RLHF.</p>
<h3 id="heading-setup-and-authentication">Setup and Authentication</h3>
<p>First, you’ll need to install the required packages and set up your API key.</p>
<pre><code class="lang-python">pip install openai ipywidgets pandas matplotlib
</code></pre>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> openai
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> ipywidgets <span class="hljs-keyword">as</span> widgets
<span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> display, Markdown, clear_output
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># Load your OpenAI API key</span>
openai.api_key = os.getenv(<span class="hljs-string">"OPENAI_API_KEY"</span>) <span class="hljs-keyword">or</span> input(<span class="hljs-string">"Enter your OpenAI API key: "</span>)
</code></pre>
<p><strong>What this does</strong>:</p>
<ul>
<li><p>Installs and loads required libraries</p>
</li>
<li><p>Reads your OpenAI key from an environment variable or prompts for it interactively</p>
</li>
</ul>
<h3 id="heading-step-1-generate-a-gpt-35-response">Step 1: Generate a GPT-3.5 Response</h3>
<p>Now, try sending a prompt and seeing what response you get:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_gpt_response</span>(<span class="hljs-params">prompt</span>):</span>
    <span class="hljs-keyword">try</span>:
        response = openai.ChatCompletion.create(
            model=<span class="hljs-string">"gpt-3.5-turbo"</span>,
            messages=[{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: prompt}],
            temperature=<span class="hljs-number">0.7</span>
        )
        <span class="hljs-keyword">return</span> response[<span class="hljs-string">'choices'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'message'</span>][<span class="hljs-string">'content'</span>].strip()
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error: <span class="hljs-subst">{e}</span>"</span>
</code></pre>
<p><strong>What this does</strong>:</p>
<ul>
<li><p>Uses OpenAI’s GPT-3.5 to generate a response</p>
</li>
<li><p>Handles errors if the API call fails</p>
</li>
</ul>
<h3 id="heading-step-2-store-feedback-history">Step 2: Store Feedback History</h3>
<p>You can now track user responses and simulated reward signals like this:</p>
<pre><code class="lang-python">history = []
</code></pre>
<p>This code initializes a list to store logs of each interaction.</p>
<h3 id="heading-step-3-run-feedback-interaction">Step 3: Run Feedback Interaction</h3>
<p>Now you can capture the prompt, display the response, and accept feedback.</p>
<pre><code class="lang-python"><span class="hljs-comment">#  Main interaction logic</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_interaction</span>(<span class="hljs-params">prompt</span>):</span>
    clear_output(wait=<span class="hljs-literal">True</span>)
    response = get_gpt_response(prompt)
    display(Markdown(<span class="hljs-string">f"### Prompt\n`<span class="hljs-subst">{prompt}</span>`"</span>))
    display(Markdown(<span class="hljs-string">f"### GPT-3.5 Response\n&gt; <span class="hljs-subst">{response}</span>"</span>))

    <span class="hljs-comment"># Feedback buttons</span>
    good_btn = widgets.Button(description=<span class="hljs-string">"👍 Good"</span>, button_style=<span class="hljs-string">'success'</span>)
    bad_btn = widgets.Button(description=<span class="hljs-string">"👎 Bad"</span>, button_style=<span class="hljs-string">'danger'</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_feedback</span>(<span class="hljs-params">feedback</span>):</span>
        reward = <span class="hljs-number">1</span> <span class="hljs-keyword">if</span> feedback == <span class="hljs-string">'good'</span> <span class="hljs-keyword">else</span> <span class="hljs-number">-1</span>
        history.append({
            <span class="hljs-string">"prompt"</span>: prompt,
            <span class="hljs-string">"response"</span>: response,
            <span class="hljs-string">"feedback"</span>: feedback,
            <span class="hljs-string">"reward"</span>: reward
        })
        display(Markdown(
            <span class="hljs-string">f"**Feedback Recorded:** `<span class="hljs-subst">{feedback}</span>` — Reward = `<span class="hljs-subst">{reward}</span>`"</span>
        ))
        display(Markdown(<span class="hljs-string">"---"</span>))
        display(Markdown(<span class="hljs-string">"### Reward History"</span>))
        df = pd.DataFrame(history)
        display(df.tail(<span class="hljs-number">5</span>))
        plot_rewards()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_good</span>(<span class="hljs-params">_</span>):</span> on_feedback(<span class="hljs-string">'good'</span>)
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_bad</span>(<span class="hljs-params">_</span>):</span> on_feedback(<span class="hljs-string">'bad'</span>)

    display(widgets.HBox([good_btn, bad_btn]))
    good_btn.on_click(on_good)
    bad_btn.on_click(on_bad)
</code></pre>
<p><strong>What this does</strong>:</p>
<ul>
<li><p>Shows GPT-3.5’s response to the user’s prompt</p>
</li>
<li><p>Displays feedback buttons</p>
</li>
<li><p>Logs reward and shows feedback history</p>
</li>
</ul>
<h3 id="heading-step-4-plot-reward-history">Step 4: Plot Reward History</h3>
<p>You can also visualize reward trends:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">plot_rewards</span>():</span>
    df = pd.DataFrame(history)
    plt.figure(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">3</span>))
    plt.plot(df[<span class="hljs-string">'reward'</span>], marker=<span class="hljs-string">'o'</span>)
    plt.title(<span class="hljs-string">"Reward Over Time"</span>)
    plt.xlabel(<span class="hljs-string">"Interaction"</span>)
    plt.ylabel(<span class="hljs-string">"Reward"</span>)
    plt.grid(<span class="hljs-literal">True</span>)
    plt.show()
</code></pre>
<p>This plots the user’s reward signals over time to simulate policy shaping.</p>
<h3 id="heading-step-5-build-input-interface">Step 5: Build Input Interface</h3>
<p>You can also allow users to type and submit prompts.</p>
<pre><code class="lang-python">prompt_input = widgets.Textarea(
    placeholder=<span class="hljs-string">"Ask something..."</span>,
    description=<span class="hljs-string">"Prompt:"</span>,
    layout=widgets.Layout(width=<span class="hljs-string">'100%'</span>, height=<span class="hljs-string">'80px'</span>),
    style={<span class="hljs-string">'description_width'</span>: <span class="hljs-string">'initial'</span>}
)

generate_btn = widgets.Button(
    description=<span class="hljs-string">"Generate Response"</span>, button_style=<span class="hljs-string">'primary'</span>
)

output_area = widgets.Output()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_generate_click</span>(<span class="hljs-params">_</span>):</span>
    <span class="hljs-keyword">with</span> output_area:
        run_interaction(prompt_input.value)

generate_btn.on_click(on_generate_click)

display(prompt_input)
display(generate_btn)
display(output_area)
</code></pre>
<p>This sets up a simple form to collect prompts and connects the generate button to the main interaction logic.</p>
<p>This gives the output:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753736920176/35079f63-2ca0-4bd4-aea6-3de3589b0c9f.png" alt="Demo result" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This demo captures the fundamentals of preference-based learning using GPT-3.5. It doesn’t update model weights but shows how feedback can be structured as a reward signal. This is the foundation of reinforcement learning in modern LLM pipelines.</p>
<p><strong>Note:</strong> This demo only logs feedback. In true RLHF, a second phase fine-tunes the model weights based on it.</p>
<p>A real-world example of this is <a target="_blank" href="https://openai.com/index/instruction-following/"><strong>InstructGPT</strong></a>. This is a version of OpenAI’s GPT models that’s trained to follow instructions written by people. Instead of just predicting the next word, it tries to really figure out and then do what you’ve asked, the way you asked it.</p>
<p>Despite being over 100× smaller than GPT-3, InstructGPT was preferred by humans in <strong>85%</strong> of blind comparisons. And one of the key reasons was that is uses RLHF. This made it safer, more truthful, and better at following complex instructions, showing how reward signals like the one simulated here can greatly improve real-world model performance.</p>
<h2 id="heading-case-studies-of-drl-in-nlu">Case Studies of DRL in NLU</h2>
<p>While DRL is not the default approach for most NLU tasks, it has shown promising results in targeted use cases, especially where learning from interaction or adapting over time adds value. Below are a few examples that illustrate how DRL can enhance language understanding in practice:</p>
<h3 id="heading-1-welocalize-amp-global-e-commerce-giant-drl-powered-multilingual-nlu">1. Welocalize &amp; Global E-Commerce Giant – DRL-Powered Multilingual NLU</h3>
<p>A global e-commerce platform partnered with Welocalize to <a target="_blank" href="https://www.welocalize.com/insights/case-study-transforming-global-customer-interactions-with-nlu/">launch a DRL-powered multilingual NLU system</a> capable of interpreting customer intent across 30+ languages and domains. This system used reinforcement learning to adapt to cultural nuances and refine predictions through user interaction. Over 13 million high-quality utterances delivered for culturally adaptive, accurate customer support and product recommendations.</p>
<h3 id="heading-2-reinforcement-learning-with-label-sensitive-reward-acl-2024">2. Reinforcement Learning with Label-Sensitive Reward (ACL 2024)</h3>
<p>Researchers introduced a framework called <a target="_blank" href="https://aclanthology.org/anthology-files/pdf/acl/2024.acl-long.231.pdf">RLLR (Reinforcement Learning with Label-Sensitive Reward)</a> to improve NLU tasks like sentiment classification, topic labeling, and intent detection. By incorporating label-sensitive reward signals and optimizing via Proximal Policy Optimization (PPO), the model aligned its predictions with both rationale quality and true label accuracy.</p>
<p>These examples show how DRL, when paired with specific feedback signals or interactive goals, can be a useful layer on top of traditional NLU systems. Though still niche, the approach continues to evolve through research and industry experimentation.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The integration of DRL with NLU has shown promising results in niche but growing areas. Adaptive learning through various interactions and feedback allows DRL to enhance NLU models’ ability to handle ambiguity, context, and linguistic differences. </p>
<p>As research progresses, the link between DRL and NLU is expected to drive advancements in AI-powered language applications, making them more efficient, scalable, and context-aware.</p>
<p>I hope this was helpful!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code ]]>
                </title>
                <description>
                    <![CDATA[ The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design. In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks: Custom class... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-multilayer-perceptron-with-examples-and-python-code/</link>
                <guid isPermaLink="false">6839f729798ea464918cffe8</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ neural networks ]]>
                    </category>
                
                    <category>
                        <![CDATA[ binary classification ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MLP (Multi-Layer Perceptrons) ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Fri, 30 May 2025 18:21:29 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748616370600/01903917-4be7-476b-90d1-18295d19edef.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The <strong>perceptron</strong> is a fundamental concept in deep learning, with many algorithms stemming from its original design.</p>
<p>In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:</p>
<ul>
<li><p>Custom classifier</p>
</li>
<li><p>Scikit-learn’s MLPClassifier</p>
</li>
<li><p>Keras Sequential classifier using SGD and Adam optimizers.</p>
</li>
</ul>
<p>This will help you learn about their various use cases and how they work.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-a-perceptron">What is a Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-optimizers">Understanding Optimizers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-results-generalization">Final Results: Generalization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Mathematics (Calculus, Linear Algebra, Statistics)</p>
</li>
<li><p>Coding in Python</p>
</li>
<li><p>Basic understanding of Machine Learning concepts</p>
</li>
</ul>
<h2 id="heading-what-is-a-perceptron">What is a Perceptron?</h2>
<p>A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.</p>
<p>A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.</p>
<p>But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.</p>
<p>The perceptron consists of four main parts:</p>
<ul>
<li><p><strong>Input layer</strong>: Takes the initial numerical values into the system for further processing.</p>
</li>
<li><p><strong>Weights</strong>: Combines input values with weights (and bias terms).</p>
</li>
<li><p><strong>Activation function</strong>: Determines whether the neuron should fire based on the threshold value.</p>
</li>
<li><p><strong>Output layer</strong>: Produces classification result.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748438698612/5b2920db-4ec1-455b-840e-7b5e9d6c2e75.png" alt="Image: Organization of a perceptron. Source: Rosenblatt 1958" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.</p>
<p>So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.</p>
<h3 id="heading-applications-of-perceptrons">Applications of Perceptrons</h3>
<p>Perceptrons are applied to tasks such as:</p>
<ul>
<li><p><strong>Image classification:</strong> Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.</p>
</li>
<li><p><strong>Linear regression:</strong> Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.</p>
</li>
</ul>
<h3 id="heading-how-the-activation-function-works">How the Activation Function Works</h3>
<p>For a single perceptron used for binary classification, the most common activation function is the <strong>step function</strong> (also known as the threshold function):</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq \theta \\ \\ 0 &amp;\text{if } z &lt; \theta \end{cases}$$</p><p>where:</p>
<ul>
<li><p><code>ϕ(z)</code>: the output of the activation function.</p>
</li>
<li><p><code>z</code>: the weighted sum of the inputs plus the bias:</p>
</li>
</ul>
<p>$$z = \sum_{i=1}^m w_i x_i + b$$</p><p>(xi: input values, w: weight associated with each input, b: bias terms)</p>
<p><code>θ</code> is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.</p>
<p>In that case, the formula becomes:</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq 0 \\ \\ 0 &amp;\text{if } z &lt; 0 \end{cases}$$</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748439460839/e74f1c1c-4e89-419b-aa9e-24a297d81ff5.png" alt="Image: Step Function (Author)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.</p>
<p>This occurs <strong>when the weighted sum is greater than zero,</strong> leading the perceptron to predict the input is in this binary class.</p>
<p>While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.</p>
<p>In modern implementations, we can use other activation functions like the <strong>sigmoid</strong> function:</p>
<p>$$\sigma (z) = \frac {1} {1 + e^{-z}}$$</p><p>The sigmoid function also outputs zero or one depending on the weighted sum (z).</p>
<h3 id="heading-how-the-loss-function-works">How the Loss Function Works</h3>
<p>The <strong>loss function</strong> is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.</p>
<p>Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.</p>
<p>In a binary classification task, the model may adopt the <strong>hinge loss function</strong> to penalize misclassifications by incurring an additional cost for incorrect predictions:</p>
<p>$$L(y, h(x)) = max(0, 1- y*h(x))$$</p><p>(h(x): prediction label, y: true label)</p>
<h2 id="heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</h2>
<p>Now, let’s build a simple single-layer perceptron for binary classification.</p>
<h3 id="heading-1-custom-classifier">1. Custom Classifier</h3>
<h4 id="heading-initialize-the-classifier">Initialize the classifier</h4>
<p>We’ll first initialize the classifier with <code>weights</code>, <code>bias</code>, number of epochs (<code>n_iterations)</code>, and <code>learning_rates</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = <span class="hljs-literal">None</span>
    self.bias = <span class="hljs-literal">None</span>
</code></pre>
<h4 id="heading-define-the-activation-function">Define the activation function</h4>
<p>Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the <code>threshold</code> is set to zero.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
     <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
</code></pre>
<h4 id="heading-train-the-model">Train the model</h4>
<p>Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: <code>weights</code> and <code>bias</code>.</p>
<p>This process is controlled by a specified number of training epochs defined by <code>n_iterations</code>.</p>
<p>In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined <code>learning_rate</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = <span class="hljs-number">0</span>

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
            <span class="hljs-comment"># compute weighted sum (z)</span>
            z = np.dot(X[i], self.weights) + self.bias

            <span class="hljs-comment"># apply the activation function</span>
            y_pred = self._step_function(z)

            <span class="hljs-comment"># update weights and bias</span>
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)
</code></pre>
<h4 id="heading-how-the-weights-work-in-the-iteration-loop">How the weights work in the iteration loop</h4>
<p>The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.</p>
<p>Its iterative update in the <code>for</code> loop aims to reduce classification errors such that:</p>
<p>$$\begin {align*} w_j &amp;:= w_j + \Delta w_j \\ &amp; := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &amp;= \begin{cases} w_j &amp;\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &amp;\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>(<code>w_j</code>: j-th weight, <code>η</code>: learning rate, (<code>yi​−y^​i​</code>): error)</p>
<p>This means that:</p>
<ol>
<li><p>When the prediction is <strong>correct</strong>, the error is zero, so the weight is unchanged.</p>
</li>
<li><p>When the prediction is <strong>too low</strong> (yi​=1 and y^​i​=0), the weight is adjusted to the same direction to increase the weighted sum.</p>
</li>
<li><p>When the prediction is <strong>too high</strong> (yi​=0 and y^​i​=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.</p>
</li>
</ol>
<h4 id="heading-how-the-bias-terms-work-in-the-iteration-loop">How the bias terms work in the iteration loop</h4>
<p>The bias determines the decision boundary’s intercept (position from the origin).</p>
<p>Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:</p>
<p>$$\begin {align*} b &amp;:= b + \Delta b \\ &amp; := b + \eta (y_i - \hat y_i) \\ &amp;= \begin{cases} b &amp;\text{(a) } y_i - \hat y_i = 0\\ b + \eta &amp;\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.</p>
<h4 id="heading-make-a-prediction">Make a prediction</h4>
<p>Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      <span class="hljs-keyword">return</span> predictions
</code></pre>
<p><strong>The entire classifier looks like this:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Perceptron</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = <span class="hljs-literal">None</span>
        self.bias = <span class="hljs-literal">None</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
        <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        <span class="hljs-keyword">return</span> y_pred
</code></pre>
<h4 id="heading-simulate-with-synthetic-datasets">Simulate with synthetic datasets</h4>
<p>First, we generated a synthetic linearly separable dataset using <code>make_blob</code> and computed a decision boundary, then train the classifier we created.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> make_blobs
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># create a mock dataset</span>
X, y = make_blobs(n_features=<span class="hljs-number">2</span>, centers=<span class="hljs-number">2</span>, n_samples=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">12</span>)

<span class="hljs-comment"># split</span>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># train the model</span>
perceptron = Perceptron(learning_rate=<span class="hljs-number">0.1</span>, n_iterations=<span class="hljs-number">1000</span>).fit(X_train, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

<span class="hljs-comment"># evaluate the results</span>
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results">Results</h4>
<p>The classifier generated a clear, highly accurate linear decision boundary.</p>
<ul>
<li><p><em>Accuracy (Train): 0.981</em></p>
</li>
<li><p><em>Accuracy (Test): 0.975</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440470195/0a01c5ad-124e-4f59-b4d5-9ee5dd5b23ce.png" alt="Decision boundary of single-layer perceptron (Custom classifier)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-2-leverage-sckitlearns-mcp-classifier">2. Leverage SckitLearn’s MCP Classifier</h3>
<p>For our convenience, we’ll use sckit-learn’s build-in classifier ( <code>MCPClassifier</code>) to build a similar, yet more robust classifier:</p>
<pre><code class="lang-python">model = MLPClassifier(
    hidden_layer_sizes=(), <span class="hljs-comment"># intentionally set empty to create a single layer perceptron</span>
    activation=<span class="hljs-string">'logistic'</span>, <span class="hljs-comment"># choosing a sigmoid function as an activation function</span>
    solver=<span class="hljs-string">'sgd'</span>, <span class="hljs-comment"># choosing SGD optimizer</span>
    max_iter=<span class="hljs-number">1000</span>,
    random_state=<span class="hljs-number">42</span>, 
    learning_rate=<span class="hljs-string">'constant'</span>, 
    learning_rate_init=<span class="hljs-number">0.1</span>
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"MCPClassifier\nAccuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results-1">Results</h4>
<p>The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.</p>
<ul>
<li><p><em>Accuracy (Train): 0.985</em></p>
</li>
<li><p><em>Accuracy (Test): 0.995</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440118956/f5391f47-711a-4948-b956-1a76dbd7ca92.png" alt="Decision boundary of single-layer perceptron (MCP Classifier)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-limitations-of-single-layer-perceptrons">Limitations of Single-Layer Perceptrons</h3>
<p>Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.</p>
<p>Unlike more general neural networks, single-layer perceptrons use a <strong>step function</strong> as their activation.</p>
<p>Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).</p>
<p>This fundamental property precludes the use of <strong>gradient-based optimization algorithms</strong> such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.</p>
<p>In contrast, most neural networks employ differentiable activation functions (for example, <strong>sigmoid</strong>, <strong>ReLU</strong>) and loss functions (for example, <strong>MSE</strong>, <strong>Cross-Entropy</strong>) for effective optimization.</p>
<p>Other challenges of a single-layer perceptron include:</p>
<ul>
<li><p><strong>Limited to linear separability:</strong> Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.</p>
</li>
<li><p><strong>Lack of depth:</strong> Being single-layered, they cannot learn complex hierarchical representations.</p>
</li>
<li><p><strong>Limited optimizer options:</strong> As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.</p>
</li>
</ul>
<p>So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.</p>
<h2 id="heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</h2>
<p>An MLP is a class of feedforward artificial neural network that consists of at least <strong>three layers</strong> of nodes:</p>
<ul>
<li><p>an input layer,</p>
</li>
<li><p>one or more hidden layers, and</p>
</li>
<li><p>an output layer.</p>
</li>
</ul>
<p>Except for the input nodes, each node is a neuron that uses a <strong>nonlinear</strong> activation function.​</p>
<p>MLPs are widely used for classification problems as well as regression:</p>
<ul>
<li><p><strong>Classification tasks:</strong> MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.​</p>
</li>
<li><p><strong>Regression analysis:</strong> They are also applied in regression problems where the relationship between input and output is complex.​</p>
</li>
</ul>
<h2 id="heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</h2>
<p>Let’s handle a binary classification task using a standard MLP architecture.</p>
<h3 id="heading-outline-of-the-project">Outline of the Project</h3>
<h4 id="heading-objective">Objective</h4>
<ul>
<li>Detect fraudulent transactions</li>
</ul>
<h4 id="heading-evaluation-metrics">Evaluation Metrics</h4>
<ul>
<li><p>Considering the cost of misclassification, we’ll prioritize improving <strong>Recall</strong> and <strong>Precision scores</strong></p>
</li>
<li><p>Then check the accuracy of classification with <strong>Accuracy</strong> Score (TP + TN / (TP + TN + FP + FN ))</p>
</li>
</ul>
<p><strong>Cost of Misclassification (from high to low):</strong></p>
<ul>
<li><p><strong>False Negative (FN):</strong> The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)</p>
</li>
<li><p><strong>False Positive (FP):</strong> The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)</p>
</li>
<li><p><strong>True Positive (TP):</strong> The model correctly identifies a fraudulent transaction as fraud.</p>
</li>
<li><p><strong>True Negative (TN):</strong>  The model correctly identifies a non-fraudulent transaction as non-fraud.</p>
</li>
</ul>
<h3 id="heading-planning-an-mlp-architecture">Planning an MLP Architecture</h3>
<p>In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.</p>
<p>Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.</p>
<p>During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440761512/37753a4c-f7f8-44bc-bea9-c50360830456.png" alt="Standard MLP Architecture for Binary Classification Tasks)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using <a target="_blank" href="https://www.researchgate.net/publication/355148120_SS-MLP_A_Novel_Spectral-Spatial_MLP_Architecture_for_Hyperspectral_Image_Classification">image source</a>)</p>
<p>Especially in deeper network, <strong>ReLU</strong> is advantageous in preventing <a target="_blank" href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem#:~:text=In%20machine%20learning%2C%20the%20vanishing,derivative%20of%20the%20loss%20function">vanishing gradient problems</a> where gradients become extremely small as they are backpropagated from the output layers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440797954/ba19bf66-cdb9-4bfb-9b92-e1e3f72e9fc7.png" alt="Comparison of major activation functions: From left to right: Sigmoid, Tanh, ReLU" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><a target="_blank" href="https://medium.com/data-science-collective/a-comprehensive-guide-on-neural-network-in-deep-learning-442ba9f1f0e5">Learn More: A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h3 id="heading-preprocessing-the-datasets">Preprocessing the Datasets</h3>
<p>First, we consolidate <a target="_blank" href="https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets">three datasets  –  transaction, customer, and credit card</a>  –  into a single DataFrame, independently sanitizing numerical and categorical data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler, OneHotEncoder
<span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

<span class="hljs-comment"># download the raw data to local</span>
<span class="hljs-keyword">import</span> kagglehub
path = kagglehub.dataset_download(<span class="hljs-string">"computingvictor/transactions-fraud-datasets"</span>)
dir = <span class="hljs-string">f'<span class="hljs-subst">{path}</span>/gd_card_flaud_demo'</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sanitize_df</span>(<span class="hljs-params">amount_str</span>):</span>
    <span class="hljs-string">"""Removes '$' and converts the string to a float."""</span>
    <span class="hljs-keyword">if</span> isinstance(amount_str, str):
        <span class="hljs-keyword">return</span> float(amount_str.replace(<span class="hljs-string">'$'</span>, <span class="hljs-string">''</span>))
    <span class="hljs-keyword">return</span> amount_str

<span class="hljs-comment"># load transaction data</span>
trx_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/transactions_data.csv'</span>)

<span class="hljs-comment"># sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)</span>
trx_df = trx_df[trx_df[<span class="hljs-string">'errors'</span>].isna()]
trx_df = trx_df.drop(columns=[<span class="hljs-string">'merchant_city'</span>,<span class="hljs-string">'merchant_state'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'mcc'</span>, <span class="hljs-string">'errors'</span>], axis=<span class="hljs-string">'columns'</span>)
trx_df[<span class="hljs-string">'amount'</span>] = trx_df[<span class="hljs-string">'amount'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge the dataframe with fraud transaction flag.</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/train_fraud_labels.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get(<span class="hljs-string">'target'</span>, {})
fraud_labels_series = pd.Series(fraud_labels_dict, name=<span class="hljs-string">'is_fraud'</span>)
fraud_labels_series.index = fraud_labels_series.index.astype(int) <span class="hljs-comment"># convert the datatype from string to integer</span>
merged_df = pd.merge(trx_df, fraud_labels_series, left_on=<span class="hljs-string">'id'</span>, right_index=<span class="hljs-literal">True</span>, how=<span class="hljs-string">'left'</span>)
merged_df.fillna({<span class="hljs-string">'is_fraud'</span>: <span class="hljs-string">'No'</span>}, inplace=<span class="hljs-literal">True</span>)
merged_df[<span class="hljs-string">'is_fraud'</span>] = merged_df[<span class="hljs-string">'is_fraud'</span>].map({<span class="hljs-string">'Yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'No'</span>: <span class="hljs-number">0</span>})

<span class="hljs-comment"># load card data</span>
card_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/cards_data.csv'</span>)
card_df = card_df.drop(columns=[<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'acct_open_date'</span>, <span class="hljs-string">'card_number'</span>, <span class="hljs-string">'expires'</span>, <span class="hljs-string">'cvv'</span>], axis=<span class="hljs-string">'columns'</span>)
card_df[<span class="hljs-string">'credit_limit'</span>] = card_df[<span class="hljs-string">'credit_limit'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge transaction and card data</span>
merged_df = pd.merge(left=merged_df, right=card_df, left_on=<span class="hljs-string">'card_id'</span>, right_on=<span class="hljs-string">'id'</span>, how=<span class="hljs-string">'inner'</span>)
merged_df = merged_df.drop(columns=[<span class="hljs-string">'id_y'</span>, <span class="hljs-string">'card_id'</span>], axis=<span class="hljs-string">'columns'</span>)

<span class="hljs-comment"># converts categorical variables into a new binary column (0 or 1)</span>
categorical_cols = merged_df.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=<span class="hljs-literal">False</span>, dtype=float) 
df = df.dropna().drop([<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'id_x'</span>], axis=<span class="hljs-number">1</span>)
print(<span class="hljs-string">'\nDataFrame: \n'</span>, df.head(n=<span class="hljs-number">3</span>))
</code></pre>
<p>DataFrame:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440856826/ba79bdaf-e0a1-457f-ab19-fda3e0f08141.png" alt="Base DataFrame" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Our DataFrame shows an extremely <strong>skewed data distribution</strong> with:</p>
<ul>
<li><p>Fraud samples: 1,191</p>
</li>
<li><p>Non-fraud samples: 11,477,397</p>
</li>
</ul>
<p>For classification tasks, <strong>it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact</strong> on classification model performance, especially regarding the minority class.</p>
<p>For our data, we’ll:</p>
<ol>
<li><p>split the 1,191 fraud samples into training, validation, and test sets,</p>
</li>
<li><p>add an equal number of randomly chosen non-fraud samples from the DataFrame, and</p>
</li>
<li><p>adjust split balances later if generalization challenges arise.</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># define the desired size of the fraud samples for the validation and test sets</span>
val_size_per_class = <span class="hljs-number">200</span>
test_size_per_class = <span class="hljs-number">200</span>

<span class="hljs-comment"># create test sets</span>
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced test set</span>
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_test = X_test[<span class="hljs-string">'is_fraud'</span>]
X_test = X_test.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the original dataframes to avoid data leakage</span>
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


<span class="hljs-comment"># create validation sets</span>
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced validation set</span>
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_val = X_val[<span class="hljs-string">'is_fraud'</span>]
X_val = X_val.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the remaining dataframes</span>
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


<span class="hljs-comment"># create training sets</span>
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_train = X_train[<span class="hljs-string">'is_fraud'</span>]
X_train = X_train.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)


print(<span class="hljs-string">"\n--- Final Dataset Shapes and Distributions ---"</span>)
print(<span class="hljs-string">f"X_train shape: <span class="hljs-subst">{X_train.shape}</span>, y_train distribution: <span class="hljs-subst">{np.unique(y_train, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_val shape: <span class="hljs-subst">{X_val.shape}</span>, y_val distribution: <span class="hljs-subst">{np.unique(y_val, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_test shape: <span class="hljs-subst">{X_test.shape}</span>, y_test distribution: <span class="hljs-subst">{np.unique(y_test, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
</code></pre>
<p>After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a <strong>50:50 split between fraud and non-fraud transactions</strong>:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*IZtK3l0hSqmkOrm9h_d9Jw.png" alt="X, y datasets shape" width="600" height="400" loading="lazy"></p>
<p>Considering the high dimensional feature space with 19 input features, we’ll apply <strong>SMOTE</strong> to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> imblearn.over_sampling <span class="hljs-keyword">import</span> SMOTE
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter

train_target = <span class="hljs-number">2000</span>

smote_train = SMOTE(
  sampling_strategy={<span class="hljs-number">0</span>: train_target, <span class="hljs-number">1</span>: train_target},  <span class="hljs-comment"># increase sample size to 2,000</span>
  random_state=<span class="hljs-number">12</span>
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(<span class="hljs-string">f"\nAfter SMOTE with custom sampling_strategy (target train: <span class="hljs-subst">{train_target}</span>):"</span>)
print(<span class="hljs-string">f"X_train_oversampled shape: <span class="hljs-subst">{X_train.shape}</span>"</span>)
print(<span class="hljs-string">f"y_train_oversampled distribution: <span class="hljs-subst">{Counter(y_train)}</span>"</span>)
</code></pre>
<p>We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440986995/ed079321-3972-4226-b1a8-244010445162.png" alt="Training sample shape after SMOTE" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Lastly, we’ll apply <strong>column transformers</strong> to numerical and categorical features separately.</p>
<p>Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

categorical_features = X_train.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns.tolist()
categorical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)),(<span class="hljs-string">'onehot'</span>, OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>))])

numerical_features = X_train.select_dtypes(include=[<span class="hljs-string">'int64'</span>, <span class="hljs-string">'float64'</span>]).columns.tolist()
numerical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)), (<span class="hljs-string">'scaler'</span>, StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        (<span class="hljs-string">'num'</span>, numerical_transformer, numerical_features),
        (<span class="hljs-string">'cat'</span>, categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)
</code></pre>
<h2 id="heading-understanding-optimizers">Understanding Optimizers</h2>
<p>In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.</p>
<p>Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.</p>
<p>In this article, we’ll use the SGD Optimizer and Adam Optimizer.</p>
<h3 id="heading-1-how-a-sgd-stochastic-gradient-descent-optimizer-works">1. How a SGD (Stochastic Gradient Descent) Optimizer Works</h3>
<p>SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:</p>
<p>$$\begin{align*} w_j &amp;:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &amp;:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$</p><p>(w: weight, b: bias, J: cost function, <em>η</em>: learning rate)</p>
<p>In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:</p>
<p>$$\begin{align*} J(y, \hat y) &amp;=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &amp;= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &amp;= \sum_{i=1}^m w_i x_i + b \end {align*}$$</p><h3 id="heading-2-how-adam-adaptive-moment-estimation-optimizer-works">2. How Adam (Adaptive Moment Estimation) Optimizer Works</h3>
<p>Adam is an optimization algorithm that computes <strong>individual adaptive learning rates</strong> for different parameters from estimates of first and second moments of the gradients.</p>
<p>Adam optimizer combines the advantages of <a target="_blank" href="https://keras.io/api/optimizers/rmsprop/"><strong>RMSprop</strong></a> (using squared gradients to scale the learning rate) and <a target="_blank" href="https://optimization.cbe.cornell.edu/index.php?title=Momentum"><strong>Momentum</strong></a> (using past gradients to accelerate convergence):</p>
<p>$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$</p><p>where:</p>
<ul>
<li><p><code>α</code>: The learning rate (default is 0.001)</p>
</li>
<li><p><code>ϵ</code>: A small positive constant used to avoid division by zero</p>
</li>
<li><p><code>m^</code>: First moment (mean) estimate with a bias correction, leveraging <strong>Momentum</strong>:</p>
</li>
</ul>
<p>$$\begin{align*} \hat m_t &amp;= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &amp;= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$</p><p>(<code>β1</code>​​: <strong>Decay rates</strong>, typically set to β1=0.9)</p>
<p><code>v^</code>: Second moment (variance) estimate with a bias correction, leveraging <strong>RMSprop</strong>:</p>
<p>$$\begin{align*} \hat v_t &amp;= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &amp;=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$</p><p>(<code>β2</code>​​: <strong>Decay rates</strong>, typically set to β2=0.999)</p>
<p>Since both <code>m</code>​​ and <code>v</code>​ are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.</p>
<p>Learn More: <a target="_blank" href="https://medium.com/@kuriko-iwai/a-comprehensive-guide-on-neural-network-in-deep-learning-9c795a1f1648">A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</h2>
<h3 id="heading-custom-classifier">Custom Classifier</h3>
<p>This process involves a <strong>forward pass</strong> and <strong>backpropagation</strong>, during which SGD computes optimal weights and biases using gradients:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
    <span class="hljs-comment"># SGD starts with randomly selected mini-batch for the epoch</span>
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    <span class="hljs-comment"># A. forward pass</span>
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[<span class="hljs-number">-1</span>]  <span class="hljs-comment"># final output of the network</span>

    <span class="hljs-comment"># B. backpropagation</span>
    <span class="hljs-comment"># 1) calculating gradients for the output layer)</span>
    delta = y_pred - y_batch
    dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

    <span class="hljs-comment"># 2) update output layer parameters</span>
    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

    <span class="hljs-comment"># 3) iterate backward from last hidden layer to the input layer</span>
    <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
        delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
        dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
        db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db
</code></pre>
<p>In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
    activations = [X]
    zs = []

    <span class="hljs-comment"># forward through hidden layers</span>
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
        z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) <span class="hljs-comment"># using ReLU for hidden layers</span>
        activations.append(a)

    <span class="hljs-comment"># forward through output layer</span>
    z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
    zs.append(z_output)

    <span class="hljs-comment"># computes the final output using sigmoid function</span>
    y_pred = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
    activations.append(y_pred)
    <span class="hljs-keyword">return</span> activations, zs
</code></pre>
<p>So the final classifier looks like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_SGD</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.01</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]
        self.weights = []
        self.biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            <span class="hljs-comment"># shuffle datasets</span>
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>]

                delta = y_pred - y_batch
                dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten() <span class="hljs-comment"># for 1D output</span>
</code></pre>
<h3 id="heading-training-prediction">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python"><span class="hljs-comment"># 1. define the model</span>
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>, ), <span class="hljs-comment"># 2 hidden layers with 30 neurons each</span>
  learning_rate=<span class="hljs-number">0.001</span>,           <span class="hljs-comment"># a step size</span>
  n_epochs=<span class="hljs-number">1000</span>,                 <span class="hljs-comment"># number of epochs</span>
  batch_size=<span class="hljs-number">32</span>                  <span class="hljs-comment"># mini-batch size</span>
)

<span class="hljs-comment"># 2. train the model</span>
mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># 3. make a prediction with training and validation datasets</span>
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

<span class="hljs-comment"># 4. compute evaluation matrics</span>
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
recall = recall_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
f1 = f1_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)


print(<span class="hljs-string">f"\nMLP (Custom SGD) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom SGD) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-2">Results</h3>
<ul>
<li><p>Recall: <em>0.7930 — 0.6650 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7790 — 0.6786 (from training to validation)</em></p>
</li>
</ul>
<p>The model effectively learned and generalized the patterns, achieving a <strong>Recall of 79.3%</strong> (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.</p>
<p><strong>Loss history:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441103897/088deb38-846d-4026-a706-701be93036ca.png" alt="Loss by epoch, weight history, bias history (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>We visualized the <strong>decision boundary</strong> using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442430297/032ee809-1b7e-4bb1-81c0-8715361658a5.png" alt="Image: Decision Boundary of MLP Classifier with SGD optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier">Leverage SckitLearn’s MCP Classifier</h3>
<p>We can use an MCP Classifier to define a similar model, incorporating;</p>
<ul>
<li><p><strong>Early stopping</strong> using internal validation to prevent overfitting and</p>
</li>
<li><p><strong>L2 regularization</strong> with a small tolerance.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPClassifier

<span class="hljs-comment"># define a model</span>
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'sgd'</span>,
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    momentum=<span class="hljs-number">0.9</span>,
    nesterovs_momentum=<span class="hljs-literal">True</span>,
    alpha=<span class="hljs-number">0.00001</span>,           <span class="hljs-comment"># l2 regulation strength</span>
    max_iter=<span class="hljs-number">3000</span>,           <span class="hljs-comment"># max epochs (keep it high)</span>
    batch_size=<span class="hljs-number">16</span>,           <span class="hljs-comment"># mini-batch size</span>
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,     <span class="hljs-comment"># apply early stopping</span>
    n_iter_no_change=<span class="hljs-number">50</span>,     <span class="hljs-comment"># stop the iteration if internal validation score doesn't improve for 50 epochs</span>
    validation_fraction=<span class="hljs-number">0.1</span>, <span class="hljs-comment"># proportion of training data for internal validation (default is 0.1)</span>
    tol=<span class="hljs-number">1e-4</span>,                <span class="hljs-comment"># tolerance for optimization</span>
    verbose=<span class="hljs-literal">False</span>,
)

<span class="hljs-comment"># training</span>
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-3">Results</h3>
<ul>
<li><p>Recall: <em>0.7830 - 0.6200 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.8208  - 0.6703 (from training to validation)</em></p>
</li>
</ul>
<p>The model showed strong performance during training, achieving a Recall <strong>of 78.30%</strong>. Its performance declined on the validation set.</p>
<p>This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.</p>
<h3 id="heading-leverage-keras-sequential-classifier">Leverage Keras Sequential Classifier</h3>
<p>For the sequential classifier, we can further enhance the classifier by:</p>
<ul>
<li><p>Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train​) to address dataset imbalance and promote faster convergence,</p>
</li>
<li><p>Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,</p>
</li>
<li><p>Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,</p>
</li>
<li><p>Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and</p>
</li>
<li><p>Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> SGD
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


<span class="hljs-comment"># calculates an initial bias for the output layer </span>
initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])


<span class="hljs-comment"># defines the model</span>
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>), <span class="hljs-comment"># 10% of the neurons in that layer randomly dropped out</span>
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, <span class="hljs-comment"># binary classification</span>
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) <span class="hljs-comment"># to address the imbalanced datasets</span>
])



<span class="hljs-comment"># compiles the model with the SGD optimizer</span>
opt = SGD(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_sgd.compile(
    optimizer=opt, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>,
    metrics=[
        <span class="hljs-string">'accuracy'</span>, <span class="hljs-comment"># add several metrics to return</span>
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)


<span class="hljs-comment"># defines early stopping to prevent overfitting</span>
early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,  <span class="hljs-comment"># monitor recall </span>
    mode=<span class="hljs-string">'max'</span>,         <span class="hljs-comment"># maximize recall</span>
    patience=<span class="hljs-number">50</span>,        <span class="hljs-comment"># stop after 50 epochs without loss improvement</span>
    min_delta=<span class="hljs-number">1e-4</span>,     <span class="hljs-comment"># minimum change to be considered an improvement (tol)</span>
    verbose=<span class="hljs-number">0</span>
)


<span class="hljs-comment"># compute the class weight</span>
class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


<span class="hljs-comment"># train the model</span>
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val), <span class="hljs-comment"># use our external val set</span>
    callbacks=[early_stopping_callback], <span class="hljs-comment"># early stopping to prevent overfitting</span>
    class_weight=class_weights_dict, <span class="hljs-comment"># penarlize more misclassification on minority class</span>
    verbose=<span class="hljs-number">0</span>
)

<span class="hljs-comment"># evaluate</span>
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)

<span class="hljs-comment"># display model summary</span>
model_keras_sgd.summary()
</code></pre>
<h3 id="heading-results-4">Results</h3>
<ul>
<li><p>Recall: <em>0.7125 — 0.7250 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7607 — 0.7545 (from training to validation)</em></p>
</li>
</ul>
<p>Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.</p>
<p>It suggests that the regularization techniques are likely effective in preventing significant overfitting.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441165170/4e0528e3-514a-454c-b52a-2a0318ba405a.png" alt="Image: Summary of the Keras Sequential Model with SGD Optimizer" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</h2>
<h3 id="heading-custom-classifier-1">Custom Classifier</h3>
<p>This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:</p>
<pre><code class="lang-python"><span class="hljs-comment"># apply Adam updates for output layer parameters</span>
<span class="hljs-comment"># 1) weights (w)</span>
self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

<span class="hljs-comment"># 2) bias (b)</span>
self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
</code></pre>
<p>Following the principles of forward and backward passes, we construct the final classifier by initializing it with <code>beta1</code> and <code>beta2</code>, built upon an <code>MLP_SGD</code> architecture:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_Adam</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span>,
                 beta1=<span class="hljs-number">0.9</span>, beta2=<span class="hljs-number">0.999</span>, epsilon=<span class="hljs-number">1e-8</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        <span class="hljs-comment"># Adam optimizer internal states for each parameter (weights and biases)</span>
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
            self.v_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-comment"># global time step for Adam bias correction</span>
        t = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># Mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += <span class="hljs-number">1</span>

                <span class="hljs-comment"># 1. forward pass</span>
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>] <span class="hljs-comment"># Output of the network</span>

                <span class="hljs-comment"># 2. backpropagation</span>
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>] <span class="hljs-comment"># Average over batch</span>
                grad_b_output = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                <span class="hljs-comment"># apply Adam updates to weights</span>
                self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
                self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
                m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                <span class="hljs-comment"># apply Adam updates to bias</span>
                self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
                self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
                m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                <span class="hljs-comment"># Propagate gradients backward through hidden layers</span>
                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    grad_b_hidden = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    <span class="hljs-comment"># apply Adam updates to weights</span>
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_hidden ** <span class="hljs-number">2</span>)
                    m_w_hat = self.m_weights[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    <span class="hljs-comment"># apply Adam updates to bias</span>
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_hidden ** <span class="hljs-number">2</span>)
                    m_b_hat = self.m_biases[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten()
</code></pre>
<h3 id="heading-training-prediction-1">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python">mlp_adam = MLP_Adam(hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">10</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">500</span>, batch_size=<span class="hljs-number">32</span>)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(<span class="hljs-string">f"\nMLP (Custom Adam) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom Adam) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-5">Results</h3>
<ul>
<li><p>Recall: <em>0.9870–0.6150 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.9811–0.6474 (from training to validation)</em></p>
</li>
</ul>
<p>While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.</p>
<p><strong>Loss History</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442341394/3183a9b1-5df0-4f74-9473-6b5b595dc9c0.png" alt="Loss by epoch, middle: weights history by epoch, right: bias history by epoch (source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442311514/34f004c9-bf1d-41e5-a0af-08c62802b78c.png" alt="Decision Boundary of MLP with Adam Optimizer (source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier-1">Leverage SckitLearn’s MCP Classifier</h3>
<p>We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:</p>
<pre><code class="lang-python">model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'adam'</span>,             <span class="hljs-comment"># update the optimizer from SGD to Adam</span>
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    alpha=<span class="hljs-number">0.0001</span>,
    max_iter=<span class="hljs-number">3000</span>,
    batch_size=<span class="hljs-number">16</span>,
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,
    n_iter_no_change=<span class="hljs-number">50</span>,
    validation_fraction=<span class="hljs-number">0.1</span>,
    tol=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-literal">False</span>,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-6">Results</h3>
<ul>
<li><p><em>Recall: 0.8975–0.6400 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8864 —  0.6305 (from training to validation)</em></p>
</li>
</ul>
<p>Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.</p>
<h3 id="heading-leverage-keras-sequential-classifier-1">Leverage Keras Sequential Classifier</h3>
<p>Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> Adam
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>)),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>, 
    metrics=[
        <span class="hljs-string">'accuracy'</span>,
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,
    mode=<span class="hljs-string">'max'</span>,
    patience=<span class="hljs-number">50</span>,
    min_delta=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-number">0</span>
)

class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=<span class="hljs-number">0</span>
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)


model_keras_adam.summary()
</code></pre>
<h3 id="heading-results-7">Results</h3>
<ul>
<li><p><em>Recall: 0.7995–0.7500 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8409–0.8065 (from training to validation)</em></p>
</li>
</ul>
<p>The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).</p>
<p>This indicates good generalization, with only minor performance degradation on unseen data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441767800/fe43f181-4323-461f-b56a-125fc78e9c84.png" alt="Image: Keras Sequential Model with Adam Optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-final-results-generalization">Final Results: Generalization</h2>
<p>Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Custom classifiers</span>
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># MLPClassifer</span>
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># Keras Sequential</span>
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
</code></pre>
<p>Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an <strong>AUPRC (Area Under Precision-Recall Curve) of 0.72.</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748874699534/f0f008c4-9067-4e2a-b070-4bb5cbae8f23.png" alt="Precision-Recall Curves for Six Classifier Models (Comparing Custom, MLP, and Keras Sequential Classifiers with SGD and Adam Optimizers (Source: Kuriko Iwai)" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.</p>
<p>Our findings underscore that effective machine learning hinges on three critical factors:</p>
<ol>
<li><p><strong>robust data preprocessing</strong> (tailored to objectives and data distribution),</p>
</li>
<li><p><strong>judicious model selection</strong>, and</p>
</li>
<li><p><strong>strategic framework or library choices</strong>.</p>
</li>
</ol>
<h3 id="heading-choosing-the-right-framework"><strong>Choosing the right framework</strong></h3>
<p>Generally speaking, choose <code>MLPClassifier</code> when:</p>
<ul>
<li><p>You’re primarily working with <strong>tabular data,</strong></p>
</li>
<li><p>You want to prioritize <strong>simplicity, quick iteration, and seamless integration,</strong></p>
</li>
<li><p>You have simple, shallow architectures, and</p>
</li>
<li><p>You have a moderate dataset size (manageable on a CPU).</p>
</li>
</ul>
<p>Choose Keras <code>Sequential</code> when:</p>
<ul>
<li><p>You’re dealing with <strong>image, text, audio, or other sequential data,</strong></p>
</li>
<li><p>You’re building <strong>deep learning models</strong> such as CNNs, RNNs, LSTMs,</p>
</li>
<li><p>You need <strong>fine-grained control</strong> over the model architecture, training process, or custom components,</p>
</li>
<li><p>You need to leverage <strong>GPU acceleration</strong>,</p>
</li>
<li><p>You’re planning for <strong>production deployment</strong>, and</p>
</li>
<li><p>You want to experiment with more advanced deep learning techniques.</p>
</li>
</ul>
<h3 id="heading-limitation-of-mlps">Limitation of MLPs</h3>
<p>While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.</p>
<p>Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.</p>
<p>You can find more info about me on my <a target="_blank" href="https://kuriko.vercel.app/">Portfolio</a> / <a target="_blank" href="https://www.linkedin.com/in/k-i-i">LinkedIn</a> / <a target="_blank" href="https://github.com/versionhq/multi-agent-system">Github</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a DeepSeek R1 API in R with Plumber ]]>
                </title>
                <description>
                    <![CDATA[ To create an AI chatbot and integrate it with another platform, you have to communicate with large language model using an API. This API receives prompts from the client and sends them to the model to generate answers. In this tutorial, you will lear... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-deepseek-r1-api-in-r-with-plumber/</link>
                <guid isPermaLink="false">67b7b91a534c03e678009324</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ APIs ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Adejumo Ridwan Suleiman ]]>
                </dc:creator>
                <pubDate>Thu, 20 Feb 2025 23:22:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740093558918/453118b9-3c93-4e57-a1ad-7471e1046ef1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>To create an AI chatbot and integrate it with another platform, you have to communicate with large language model using an API. This API receives prompts from the client and sends them to the model to generate answers.</p>
<p>In this tutorial, you will learn how to create such an API using the DeepSeek R1 large language model so external applications can call it. We will use the <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-R1">DeepSeek R1 model</a>, available on HuggingFace, and the Plumber R package to deploy it as an API.</p>
<p>HuggingFace is an open source platform for building, training, and deploying machine learning models, while Plumber is an R package that expose R code as a RESTful APIs accessible to other applications through HTTP requests.</p>
<p>With this API, you can:</p>
<ul>
<li><p>Build AI applications</p>
</li>
<li><p>Connect to external data and extract meaningful insights</p>
</li>
<li><p>Integrate into existing applications to provide customer support, create documentations, and so on.</p>
</li>
</ul>
<h2 id="heading-what-is-the-deepseek-r1-model">What is the DeepSeek R1 Model?</h2>
<p>DeepSeek R1 is the latest large language model from the Chinese company <a target="_blank" href="https://www.deepseek.com/">DeepSeek</a>. It was designed to enhance the problem-solving and analytic capabilities of AI systems.</p>
<p>DeepSeek-R1 uses reinforcement learning and supervised fine-tuning to handle complex reasoning tasks. Unlike proprietary models, DeepSeek R1 is open-source and free to use.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>Sign up for a <a target="_blank" href="https://huggingface.co/">HuggingFace account</a> if you don’t already have one</p>
</li>
<li><p>Install <a target="_blank" href="https://posit.co/download/rstudio-desktop/">R and R Studio</a>.</p>
</li>
<li><p>Install the <a target="_blank" href="https://www.rplumber.io/"><code>plumber</code></a> R package to build the API endpoint</p>
</li>
<li><p>Install the <a target="_blank" href="https://httr2.r-lib.org/"><code>httr2</code></a> R package to work with HTTP requests and interact with the Hugging Face API</p>
</li>
</ul>
<h2 id="heading-step-1-create-your-project-repository">Step 1: Create Your Project Repository</h2>
<p>You need to create an R project to create an API application in R. This ensures that all the files needed to keep your API working are kept together under the same directory. R Studio already has a template provided for API projects, so you can follow the steps below to create yours.</p>
<p>In your R Studio IDE, click on the File menu and go to New Project to open the New Project Wizard. Once in the wizard, select New Directory, then click New Plumber API Project. Inside the directory name field, give it a name (for example <code>DeepSeek-R1 API</code>), and then click on Create Project.</p>
<p>You will see a file called <code>plumber.R</code> with a sample API template. This is where you’ll write the code to connect to the DeepSeek R1 model on HuggingFace. Make sure that you clear this template before proceeding.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738503866976/60b959cd-b564-486d-8b65-c9ca0278e239.gif" alt="GIF showing how to create a new Plumber project in R" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Next, go to your terminal and create a <code>.env</code> file. This is where you will store the Hugging Face API key.</p>
<pre><code class="lang-plaintext">touch .env
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738504109388/6ce9bda3-305a-4f2e-87b8-adbe4c245861.png" alt="Image showing how to create a .env variable on the terminal" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Create a <code>.gitignore</code> file and add the <code>.env</code> file to it. This ensures that sensitive information like access tokens and API keys are not pushed to your Git repository.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738504889229/0d433bcb-2a7d-4379-a0c7-e09fb53e288f.png" alt="Image showing the .env file in the .gitignore file" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-step-2-create-a-hugging-face-access-token">Step 2: Create a Hugging Face Access Token</h2>
<p>We need to create an access token to connect to Hugging Face models. Go to your profile, click Settings, and click Create New Token to create your access token for the Hugging Face repository.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738504360986/077a2778-d790-4ff9-94e1-c2c372b2efef.png" alt="Image showing the access tokens page, with options to create a new token " class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Copy the access token and paste it into your <code>.env</code> file, and give it the name <code>HUGGINGFACE_ACCESS_TOKEN</code>.</p>
<pre><code class="lang-plaintext">HUGGINGFACE_ACCESS_TOKEN="&lt;your-access-token&gt;"
</code></pre>
<p>Next is to install the <code>dotenv</code> package, and paste the following code at the top of your <code>plumber.R</code> file:</p>
<pre><code class="lang-r"><span class="hljs-comment"># Load environment variables from .env</span>
dotenv::load_dot_env()
</code></pre>
<p><code>dotenv::load_dot_env()</code> loads all environment variables in the <code>.env</code> file, making them available to the <code>plumber.R</code> script.</p>
<h2 id="heading-step-3-build-the-deepseek-api-endpoint">Step 3: Build the DeepSeek API Endpoint</h2>
<p>Now that we have our project environment set up and API token ready, we’ll write the code to build the API application by connecting to the DeepSeek R1 model on HuggingFace.</p>
<p>Go to the <code>plumber.R</code> file and load the following libraries:</p>
<pre><code class="lang-r"><span class="hljs-keyword">library</span>(plumber)
<span class="hljs-keyword">library</span>(httr2)
</code></pre>
<p>Copy and paste the following code into <code>plumber.R</code>:</p>
<pre><code class="lang-r">
api_key &lt;- Sys.getenv(<span class="hljs-string">"HUGGINGFACE_ACCESS_TOKEN"</span>)



<span class="hljs-comment">#* @post /deepseek_chat</span>
<span class="hljs-keyword">function</span>(prompt) {
  url &lt;- <span class="hljs-string">"https://huggingface.co/api/inference-proxy/together/v1/chat/completions"</span>

  <span class="hljs-comment"># Create a request object</span>
  req &lt;- request(url) |&gt;
    req_auth_bearer_token(api_key) |&gt;
    req_body_json(list(
      model = <span class="hljs-string">"deepseek-ai/DeepSeek-R1"</span>,
      messages = list(
        list(role = <span class="hljs-string">"user"</span>, content = prompt)
      ),
      max_tokens = <span class="hljs-number">500</span>,
      stream = <span class="hljs-literal">FALSE</span>
    ))

  <span class="hljs-comment"># Perform the request and capture the response</span>
  res &lt;- req_perform(req)

  <span class="hljs-comment"># Parse the JSON response</span>
  parsed_data &lt;- res |&gt;
    resp_body_json()

  <span class="hljs-comment"># Extract the content from the response</span>
  content &lt;- parsed_data$choices
  <span class="hljs-keyword">return</span>(content)
}
</code></pre>
<p>Here’s what’s going on in the above code:</p>
<ul>
<li><p><code>Sys.getenv</code> gets the HuggingFace access token and stores it in the variable <code>access_token</code>.</p>
</li>
<li><p>The <code>url</code> variable contains the API link to access the DeepSeek model on HuggingFace. You can get this by searching the model name <code>deepseek-ai/DeepSeek-R1</code> on HuggingFace. Go to the <strong>View Code</strong> button, and under the <strong>cURL</strong> tab, copy the API URL</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739177037117/0781bce2-7bf8-411d-ad71-cb2bf11fe1bb.gif" alt="GIF showing how to copy the API url you are going to use in your plumber API code" width="600" height="400" loading="lazy"></p>
</li>
<li><p><code>#* @post /deepseek_chat</code> means that the endpoint makes a POST request through the path <code>/deepseek_chat</code>.</p>
</li>
<li><p>This endpoint takes an argument <code>prompt</code>, a text, or a question a user is expected to give.</p>
</li>
<li><p>The <code>req</code> object is a chain of various operations, which makes a <code>request()</code> to the <code>url</code>, and then takes the <code>api_key</code> inside the <code>req_auth_bearer_token()</code> function. Model properties such as <code>model</code> name, <code>role</code>, <code>prompt</code>, and <code>max_tokens</code> are passed to the <code>req</code> object through the <code>req_body_json</code> function.</p>
</li>
<li><p>The <code>headers</code> variable contains the authorization required to make a request to HuggingFace API.</p>
</li>
<li><p>The request is performed and captured in a response object <code>res</code> using the <code>req_perform()</code> function.</p>
</li>
<li><p>The <code>res</code> object returns a JSON object, which is now parsed to R using the<code>resp_body_json()</code> function.</p>
</li>
<li><p>The <code>content</code> of the <code>parsed_data</code> is now returned so you can extract the information you need from the application for which you want to use the API.</p>
</li>
</ul>
<h2 id="heading-step-4-test-the-api-endpoint">Step 4: Test the API Endpoint</h2>
<p>Let’s run the API endpoint to see how the application performs. Click on Run API. This will automatically open the API endpoint on your browser on the URL <a target="_blank" href="http://127.0.0.1:8634/docs/"><code>http://127.0.0.1:8634/docs/</code></a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739282692303/82a029ea-31f5-4088-9e72-2fe1b69d0f7d.png" alt="Image showing the Run API button" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Click on the API endpoint dropdown, provide a prompt, and click the Execute button. You should receive a reply in a few minutes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739282620577/b1a52679-b397-4d82-af56-0f81ebc5888e.gif" alt="Image showing how the API endpoint returns a response when a prompt is given" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>With your API, you can make inferences to the Hugging Face model and build AI applications in R or other programing languages. You need to host your API to make it accessible to clients online. There are various <a target="_blank" href="https://www.rplumber.io/articles/hosting.html">ways of hosting an R Plumber application</a>: you can use Docker or host it on DigitalOcean using the plumberDeploy R package. However, the simplest and easiest way is to use <a target="_blank" href="https://posit.co/products/enterprise/connect/">Posit Connect</a>.</p>
<p>You can use the same approach used in this tutorial to try out other HuggingFace models, build an API to generate images or translate different languages. R Plumber is easy to use, and the documentation provides many resources.</p>
<p>If you are interested in model deployment using R Plumber, you can check out <a target="_blank" href="https://learndata.xyz/posts/forecasting%20time%20series%20data%20with%20facebook%20prophet/">this article</a> on how to deploy a Time Series model built on Prophet using R Plumber.</p>
<p>If you find this article interesting, please check my other articles on <a target="_blank" href="https://learndata.xyz/blog">learndata.xyz</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Understanding Deep Learning Research Tutorial - Theory, Code, and Math ]]>
                </title>
                <description>
                    <![CDATA[ Understanding deep learning research can often feel like unraveling a dense and intricate puzzle. From decoding mathematical notation to navigating complex code bases, the process can be daunting, especially for newcomers. But with the right guidance... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/understanding-deep-learning-research-tutorial-theory-code-and-math/</link>
                <guid isPermaLink="false">678921f0a510ee899ead3f7c</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 16 Jan 2025 15:12:48 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737040354258/4ad88afd-82ee-4b59-bc6a-cdc9a5537c59.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Understanding deep learning research can often feel like unraveling a dense and intricate puzzle. From decoding mathematical notation to navigating complex code bases, the process can be daunting, especially for newcomers. But with the right guidance, you can build the skills necessary to break down cutting-edge AI research and make it accessible.</p>
<p>We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to read, understand, and implement deep learning research. Taught by Yacine, a published researcher and machine learning practitioner, this tutorial provides a step-by-step approach to mastering essential skills like interpreting technical papers, understanding advanced mathematics, and navigating research codebases. With practical examples and a focus on recent AI papers, this course empowers you to confidently engage with the latest developments in machine learning.</p>
<h3 id="heading-what-youll-learn-in-this-course">What You’ll Learn in This Course</h3>
<p>The course is structured to address the key challenges that aspiring researchers and practitioners face when diving into deep learning:</p>
<h4 id="heading-1-how-to-read-research-papers"><strong>1. How to Read Research Papers</strong></h4>
<p>This section provides a comprehensive framework for effectively breaking down research papers:</p>
<ul>
<li><p>Learn how to <strong>get external context</strong> and perform an initial casual read to grasp the paper’s main idea.</p>
</li>
<li><p>Dive deeper into <strong>filling knowledge gaps</strong> and achieving conceptual understanding.</p>
</li>
<li><p>Explore how to conduct a <strong>code deep dive</strong> and meticulously analyze the paper’s methods and results.</p>
</li>
<li><p>Develop strategies to identify and address <strong>weird gaps</strong> or inconsistencies in the paper.</p>
</li>
</ul>
<h4 id="heading-2-understanding-deep-learning-math"><strong>2. Understanding Deep Learning Math</strong></h4>
<p>Many papers rely heavily on mathematical notation, which can be intimidating. Yacine simplifies this process by teaching:</p>
<ul>
<li><p>Techniques to <strong>relax and approach formulas systematically</strong>.</p>
</li>
<li><p>How to <strong>translate symbols into meaning</strong> and build intuition around complex equations (e.g., the QHAdam optimizer).</p>
</li>
<li><p>Methods to summarize mathematical insights for practical application.</p>
</li>
</ul>
<h4 id="heading-3-learning-math-efficiently"><strong>3. Learning Math Efficiently</strong></h4>
<p>Mastering the mathematics behind deep learning doesn’t have to be overwhelming. This section focuses on:</p>
<ul>
<li><p>Selecting the <strong>right subfields of math</strong> to study based on your goals.</p>
</li>
<li><p>Leveraging <strong>exercise-rich resources</strong> to reinforce learning.</p>
</li>
<li><p>Using the <strong>Green-Yellow-Red method</strong> to identify strengths and weaknesses.</p>
</li>
<li><p>Fixing gaps in understanding through targeted study of theory.</p>
</li>
</ul>
<h4 id="heading-4-navigating-deep-learning-codebases"><strong>4. Navigating Deep Learning Codebases</strong></h4>
<p>Research codebases are often sprawling and complex. Yacine walks you through:</p>
<ul>
<li><p>How to <strong>map the structure of a codebase</strong> after reading the related research paper.</p>
</li>
<li><p>Strategies to <strong>run, debug, and understand the code</strong>.</p>
</li>
<li><p>Methods to elucidate unclear components and take detailed notes for clarity.</p>
</li>
</ul>
<h4 id="heading-5-segment-anything-model-sam-deep-dive"><strong>5. Segment Anything Model (SAM) Deep Dive</strong></h4>
<p>The course culminates in an in-depth exploration of the Segment Anything Model (SAM), a groundbreaking approach to segmentation in computer vision. You’ll learn about:</p>
<ul>
<li><p>The <strong>task and testing</strong> process for SAM.</p>
</li>
<li><p>Its <strong>theoretical underpinnings</strong> and key model components, including the image encoder, prompt encoder, and mask decoder.</p>
</li>
<li><p>How the <strong>data pipeline and engine</strong> are structured.</p>
</li>
<li><p>Insights into SAM’s <strong>zero-shot results</strong> and limitations.</p>
</li>
</ul>
<h3 id="heading-why-this-course">Why This Course</h3>
<p>Whether you're a beginner curious about deep learning or an experienced developer aiming to engage with AI research, this course equips you with practical tools and methodologies to demystify deep learning research. By combining theory, hands-on practice, and real-world examples, Yacine ensures that you’ll walk away with actionable insights and confidence in your ability to tackle even the most complex papers.</p>
<p>Check out the <a target="_blank" href="https://youtu.be/onU5Hbb3qao">Deep Learning Research Tutorial</a> now on the freeCodeCamp.org YouTube channel (2-hour course).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/onU5Hbb3qao" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Do Generative Models Work in Deep Learning? Generative Models For Data Augmentation Explained ]]>
                </title>
                <description>
                    <![CDATA[ Data is at the heart of model training in the world of deep learning. The quantity and quality of training data determine the effectiveness of machine learning algorithms. On the other hand, obtaining massive amounts of precisely categorized data is ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/generative-models-for-data-augmentation/</link>
                <guid isPermaLink="false">66d4608ac7632f8bfbf1e469</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Fri, 26 Jul 2024 12:22:23 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/glenn-carstens-peters-1F4MukO0UNg-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Data is at the heart of model training in the world of deep learning. The quantity and quality of training data determine the effectiveness of machine learning algorithms.</p>
<p>On the other hand, obtaining massive amounts of precisely categorized data is a difficult and resource-intensive operation. This is where data augmentation comes into play as an appealing solution, with the innovative potential of generative models at its forefront.</p>
<p>In this article, we'll look into the fundamental relevance of generative models in data augmentation for deep learning, such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).</p>
<h2 id="heading-what-are-generative-models">What are Generative Models?</h2>
<p>Generative models are a type of machine learning model that create new data samples that are similar to those in a given dataset. They discover hidden trends and structures in the data, allowing them to generate synthetic data points that are similar to the actual data.</p>
<p>These models are used in a variety of applications, such as image generation, text generation, data augmentation, and others. For example, in an image generation project, a generative model could be trained on images of cats and dogs to learn how to generate new images of cats and dogs.</p>
<p>They learn patterns and styles from existing data and apply that information to create similar things. It’s like your computer having a creative engine that generates fresh ideas after studying the tactics utilized in prior ones.</p>
<h2 id="heading-what-is-data-augmentation">What is Data Augmentation?</h2>
<p>Data augmentation is a machine learning and deep learning technique that uses various transformations and adjustments to existing data to improve the quality and quantity of a training dataset. This entails generating new data samples from existing ones to expand the size and diversity of a dataset.</p>
<p>The basic purpose of data augmentation is to increase a machine learning models’ performance, generalization, and robustness, notably in computer vision tasks and other data-driven areas.</p>
<p>Data augmentation can be used to improve datasets for a wide range of machine-learning applications, such as image classification, object detection, and natural language processing. Data augmentation, for example, can be used to create synthetic photos of faces, which can then be used to train a deep-learning model to detect faces in real-world images.</p>
<p>Data augmentation is an important method in the data world because it addresses the underlying concerns of data quantity and quality. Access to large amounts of diverse, well-labeled data is required for building strong and accurate models in many machine learning and deep learning applications.</p>
<p>Data augmentation is a beneficial method for expanding limited datasets by creating new samples, which improves model generalization and performance. Furthermore, it improves the ability of machine learning algorithms to manage real-world fluctuations, resulting in more trustworthy and flexible AI systems.</p>
<h2 id="heading-why-use-generative-models-for-data-augmentation">Why Use Generative Models for Data Augmentation?</h2>
<p>There are several reasons why generative models are employed for data augmentation in machine learning:</p>
<ol>
<li><p><strong>Increased Data Diversity:</strong> Generative models can help boost dataset variety, making machine learning models more resilient to real-world fluctuations. A generative model could be used to generate synthetic images of faces with various expressions, ages, and ethnicities. This could help a machine learning model learn to detect faces more reliably in a wide range of real-world scenarios.</p>
</li>
<li><p><strong>Improved Model Generalization:</strong> Using generative models to augment data exposes machine learning models to a broader collection of data variables during training. This procedure improves the model’s ability to generalize to new, previously unknown data and its overall performance. This is particularly relevant for deep learning models, which require vast volumes of data to adequately train.</p>
</li>
<li><p><strong>Overcoming Data Scarcity:</strong> Obtaining a large and diverse labeled dataset can be a substantial issue in many machine learning applications. By developing synthetic data, generative models can assist in managing data scarcity by lowering reliance on limited real data.</p>
</li>
<li><p><strong>Reduction of Bias:</strong> By generating new data samples that address underrepresented or biased categories, generative models can be used to eliminate bias in training data, improving balance in AI applications.</p>
</li>
</ol>
<h2 id="heading-generative-models-for-data-augmentation">Generative Models for Data Augmentation</h2>
<p>Two main types of generative models can be used for data augmentation:</p>
<ul>
<li><p>Generative Adversarial Networks (GANs)</p>
</li>
<li><p>Variational AutoEncoders (VAEs)</p>
</li>
</ul>
<h3 id="heading-generative-adversarial-networks-gans">Generative Adversarial Networks (GANs)</h3>
<p>GANs are neural network designs that are used to create fresh data samples that are comparable to the training data. They are learning models that can construct new items that appear to be drawn from a certain dataset. GANs, for example, can be trained on a group of photos and then used to produce new images that look like they came from the original set.</p>
<p>Here’s a short explanation of how GANs work:</p>
<ul>
<li><p>A new data sample is generated by the generator. The discriminator is provided with both new and real data samples.</p>
</li>
<li><p>The discriminator attempts to determine which samples are real and which are fabricated.</p>
</li>
<li><p>The output of the discriminator is used to update both the generator and the discriminator.</p>
</li>
</ul>
<p>The generator creates a synthetic image by taking noisy data as input. The discriminator tries to correctly categorize both the generator’s fake image and an actual image from the training set.</p>
<p>The generator tries to improve its variables to produce a more convincing false image that can mislead the discriminator. The discriminator seeks to improve by adjusting its variables to distinguish between actual and fraudulent images. The two networks continue to compete and improve until the generator produces data that is similar to real data.</p>
<p>It is suitable for data augmentation due to its capacity to generate synthetic data indistinguishable from genuine data samples. This is significant because machine learning algorithms learn from data, and the more data used to train a model, the better it will perform. On the other hand, collecting enough real-world data to train a machine-learning model may be costly and time-consuming.</p>
<p>GANs can help to reduce the cost and time required to collect data by producing synthetic data that is similar to real-world data. This is especially beneficial for applications when collecting real-world data is difficult or expensive, such as medical imaging or video surveillance data.</p>
<p>GANs can also be used because of their variety. This is because GANs can be used to produce data samples that did not exist in the original dataset. This can help improve the robustness of machine learning models for real-world variations.</p>
<h3 id="heading-variational-autoencoders-vaes">Variational AutoEncoders (VAEs)</h3>
<p>VAEs are a type of generative model and a variation of autoencoders used in machine learning and deep learning. They are a form of generative model that may generate fresh data samples that are comparable to the data on which they were trained.</p>
<p>VAEs are a sort of Bayesian model, which implies that they employ probability distributions to represent the uncertainty in the data. This allows VAEs to create data samples that are more realistic than other types of generative models.</p>
<p>VAEs work by learning about data representation in latent space. The latent space is a compressed representation of data that captures the data’s most relevant qualities. By sampling from the latent space and decoding the samples back into the original data space, VAEs can then be utilized to produce new data samples.</p>
<p>Here’s a simple illustration of how a VAE works:</p>
<ul>
<li><p>As input, the encoder receives a data sample, such as an image of an animal.</p>
</li>
<li><p>The encoder generates a latent space representation of the data, which is a compressed version of the image that captures the cat’s most relevant characteristics, such as shape, size, and fur color.</p>
</li>
<li><p>The latent space representation is fed into the decoder.</p>
</li>
<li><p>The decoder generates a reconstructed data sample, which is a new image of an animal that resembles the original image.</p>
</li>
</ul>
<p>The encoder and decoder are taught to reduce the difference between the reconstructed and original images. This is accomplished by employing a loss function that compares the similarity of the two photos.</p>
<p>VAEs are a strong generative modeling tool that can be used for image production, text generation, data compression, and data denoising. They provide a probabilistic framework for modeling and producing complex data distributions while preserving a structured latent space for data production and interpolation.</p>
<p>The ability to generate data that is similar to real-world data also qualifies it for data augmentation. This means that the augmented data produced by VAEs is highly realistic and aligned with the underlying data distribution, which is required for effective data augmentation.</p>
<p>Each point in the structured latent space of VAEs represents a meaningful data variation. This enables controlled data creation. Users can build new data instances with specific attributes or variants by sampling different places in the latent space, making it suited for targeted data augmentation.</p>
<p>VAEs can address data scarcity issues by generating synthetic data when real data is limited. This is particularly valuable in scenarios where collecting more real data is impractical or expensive.</p>
<p>As VAEs continue to improve, they will likely play an increasingly important role in training machine learning models.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Generative models have played a significant part in the practice of data augmentation in the machine-learning field.</p>
<p>For instance, GANs have been used to generate synthetic images of faces, which have been used to train machine learning models to detect faces in real-world images.</p>
<p>VAEs were also utilized to create synthetic images of automobiles that were then used to train machine-learning models to recognize autos in real-world photographs.</p>
<p>These are all real-life applications of generative models in data Augmentation.</p>
<p>I hope this article was helpful.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build an Interpretable Artificial Intelligence Model – Simple Python Code Example ]]>
                </title>
                <description>
                    <![CDATA[ Artificial Intelligence is being used everywhere these days. And many of the groundbreaking applications come from Machine Learning, a subfield of AI. Within Machine Learning, a field called Deep Learning represents one of the main areas of research.... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-an-interpretable-ai-deep-learning-model/</link>
                <guid isPermaLink="false">66ba533a79bcbbffd5d70c56</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiago Capelo Monteiro ]]>
                </dc:creator>
                <pubDate>Tue, 23 Jul 2024 22:11:31 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/pexels-dmitry-demidov-515774-3852577.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Artificial Intelligence is being used everywhere these days. And many of the groundbreaking applications come from Machine Learning, a subfield of AI.</p>
<p>Within Machine Learning, a field called Deep Learning represents one of the main areas of research. It is from Deep Learning that most new, truly effective AI systems are born.</p>
<p>But typically, the AI systems born from Deep Learning are quite narrow, focused systems. They can outperform humans in one very specific area for which they were made.</p>
<p>Because of this, many new developments in AI come from specialized systems or a combination of systems working together.</p>
<p>One of the bigger problems in the field of Deep Learning models is their lack of interpretability. Interpretability means understanding how decisions are made. </p>
<p>This is a big problem that has its own field, called explainable AI. This is the field within AI that focuses on making an AI model's decisions more easily understandable.</p>
<p>Here's what we'll cover in this article:</p>
<ul>
<li><a class="post-section-overview" href="#heading-artificial-intelligence-and-the-rise-of-deep-learning">Artificial Intelligence and the Rise of Deep Learning</a></li>
<li><a class="post-section-overview" href="#heading-a-big-problem-in-deep-learning-lack-of-interpretability">A big problem in deep learning: Lack of interpretability</a></li>
<li><a class="post-section-overview" href="#heading-a-solution-to-interpretability-glass-box-models">A solution to interpretability: Glass Box models</a></li>
<li><a class="post-section-overview" href="#heading-code-example-solving-the-problem-with-explainable-ai">Code example: Solving the problem with Explainable AI</a></li>
<li><a class="post-section-overview" href="#heading-conclusion-kan-kolmogorovarnold-networks">Conclusion: KAN (Kolmogorov–Arnold Networks)</a></li>
</ul>
<p>This article won't cover dropout or other regularization techniques, hyperparameter optimization, complex architectures like CNNs, or detailed differences in gradient descent variants.</p>
<p>We'll just discuss the basics of deep learning, the lack of interpretability problem, and a code example.</p>
<h2 id="Artificial">Artificial Intelligence and the Rise of Deep Learning</h2>

<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/AI.jpg" alt="Image" width="600" height="400" loading="lazy">
<em>Photo by <a target="_blank" href="https://www.pexels.com/photo/robot-pointing-on-a-wall-8386440/">Tara Winstead</a></em></p>
<h3 id="heading-what-is-deep-learning-in-artificial-intelligence">What is Deep Learning in Artificial Intelligence?</h3>
<p>Deep Learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.</p>
<p>The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.</p>
<p>To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.</p>
<h3 id="heading-deep-learning-training-neural-networks">Deep Learning = Training Neural Networks</h3>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/01/4-2.png" alt="4-2" width="600" height="400" loading="lazy">
<em>Simple neural network</em></p>
<p>At the core of deep learning is the training of neural networks.</p>
<p>That means basically using data to get the right values of each neuron to be able to predict what we want.</p>
<p>Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.</p>
<p>This layered structure allows deep learning models to analyze and interpret complex data.</p>
<h2 id="problem">A Big Problem in Deep Learning: Lack of Interpretability</h2>

<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/interptret.jpg" alt="Image" width="600" height="400" loading="lazy">
_Photo by <a target="_blank" href="https://www.pexels.com/photo/crop-unrecognizable-woman-reading-book-on-soft-bed-4170628/">Koshevaya_k</a>_</p>
<p>Deep Learning has revolutionized many fields by achieving great results in very complex tasks.</p>
<p>However, there is a big problem: the lack of interpretability</p>
<p>While it is true that neural networks can perform every well, we don't understand internally how neural networks can achieve great results.</p>
<p>In other words, we know they do very well with the tasks we give them, but not how they do them in detail.</p>
<p>It is important to know how the model thinks in fields such as healthcare and autonomous driving.</p>
<p>By understanding how a model thinks, we can be more confident in its reliability in certain critical areas.</p>
<p>So models that work in fields with strict regulations are more transparent to the law and build more trust when they're interpretable.</p>
<p>Models that allow interpretability are called <strong>glass box models</strong>. On the other hand, models that do not have this capability (that is, most of them) are called <strong>black box models.</strong></p>
<h2 id="solution">A Solution to Interpretability: Glass Box Models</h2>

<h3 id="heading-glass-box-models">Glass Box Models</h3>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/glass-pixabay-416528.jpg" alt="Image" width="600" height="400" loading="lazy">
<em>Photo by Pixabay: https://www.pexels.com/photo/fluid-pouring-in-pint-glass-416528/</em></p>
<p>Glass box models are machine learning models designed to be easily understood by humans.</p>
<p>Glass box models provide clear insights into how they make their decisions.</p>
<p>This transparency in the decision-making process is important for trust, compliance, and improvement.</p>
<p>Below we will see a code example of an AI model that, based on a dataset to predict breast cancer, it achieves an accuracy of 97%.</p>
<p>We'll also find, based on the characteristics of the data, which were of greater importance in predicting the cancer.</p>
<h3 id="heading-black-box-models">Black Box Models</h3>
<p>In addition to glass box models, there are also black box models. </p>
<p>These models are essentially different neural network architectures used in various datasets. Some examples are:</p>
<ul>
<li><strong>CNN (Convolutional Neural Networks)</strong>: Designed specifically for image classification and interpretation.</li>
<li><strong>RNN (Recurrent Neural Networks) and LSTM (Long Short Term Memory)</strong>: Primarily used for sequential data – text and time series data. In 2017, they were surpassed by a neural network architecture called transformers in a paper called <a target="_blank" href="https://arxiv.org/abs/1706.03762">Attention is All You Need.</a></li>
<li><strong>Transformer-based architectures</strong>: Revolutionized AI in 2017 due to their ability to handle sequential data more efficiently. RNN and LSTM have limited capabilities in this regard.</li>
</ul>
<p>Nowadays, most models that process text are transformer-based models.</p>
<p>For instance, in ChatGPT, <strong>GPT</strong> stands for <strong>Generative Pre-trained Transformer</strong>, indicating a transformer neural network architecture that generates text.</p>
<p>All these models—CNN, RNN, LSTM and Transformers—are examples of narrow artificial intelligence (AI).</p>
<p>Achieving general intelligence, in my view, involves combining many of these narrow AI models to mimic human behavior.</p>
<h2 id="example">Code Example: Solving the Problem with Explainable AI</h2>

<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/cancer-chokniti-khongchum-1197604-2280571.jpg" alt="Image" width="600" height="400" loading="lazy">
<em>Photo by Chokniti Khongchum: https://www.pexels.com/photo/person-holding-laboratory-flask-2280571/</em></p>
<p>In this code example, we will create an interpretable AI model based on 30 characteristics.</p>
<p>We'll also learn what the 5 characteristics are that are more important in the detection of breast cancer, based on this dataset.</p>
<p>We will use a machine learning glass box model called the Explainable Boosting Machine</p>
<p>Here is the code below, which we will see block by block below:</p>
<pre><code><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score
<span class="hljs-keyword">from</span> interpret.glassbox <span class="hljs-keyword">import</span> ExplainableBoostingClassifier
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

# Load a sample dataset
<span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

# Train an EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

# Make predictions
y_pred = ebm.predict(X_test)
print(f<span class="hljs-string">"Accuracy: {accuracy_score(y_test, y_pred)}"</span>)

# Interpret the model
ebm_global = ebm.explain_global(name=<span class="hljs-string">'EBM'</span>)

# Extract feature importances
feature_names = ebm_global.data()[<span class="hljs-string">'names'</span>]
importances = ebm_global.data()[<span class="hljs-string">'scores'</span>]

# Sort features by importance
sorted_idx = np.argsort(importances)
sorted_feature_names = np.array(feature_names)[sorted_idx]
sorted_importances = np.array(importances)[sorted_idx]

# Increase spacing between the feature names
y_positions = np.arange(len(sorted_feature_names)) * <span class="hljs-number">1.5</span>  # Increase multiplier <span class="hljs-keyword">for</span> more space

# Plot feature importances
plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">14</span>))  # Increase figure height <span class="hljs-keyword">if</span> necessary
plt.barh(y_positions, sorted_importances, color=<span class="hljs-string">'skyblue'</span>, align=<span class="hljs-string">'center'</span>)
plt.yticks(y_positions, sorted_feature_names)
plt.xlabel(<span class="hljs-string">'Importance'</span>)
plt.title(<span class="hljs-string">'Feature Importances from Explainable Boosting Classifier'</span>)
plt.gca().invert_yaxis()

# Adjust spacing
plt.subplots_adjust(left=<span class="hljs-number">0.3</span>, right=<span class="hljs-number">0.95</span>, top=<span class="hljs-number">0.95</span>, bottom=<span class="hljs-number">0.08</span>)  # Fine-tune the margins <span class="hljs-keyword">if</span> needed

plt.show()
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/1-4.png" alt="Image" width="600" height="400" loading="lazy">
<em>Full Code</em></p>
<p>Alright, now let's break it down.</p>
<h3 id="heading-importing-libraries">Importing Libraries</h3>
<p>First, we'll import the libraries we need for our example. You can do that with the following code:</p>
<pre><code><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score
<span class="hljs-keyword">from</span> interpret.glassbox <span class="hljs-keyword">import</span> ExplainableBoostingClassifier
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/2-3.png" alt="Image" width="600" height="400" loading="lazy">
<em>Importing libraries</em></p>
<p>These are the libraries we are going to use:</p>
<ul>
<li><a target="_blank" href="https://pandas.pydata.org/">Pandas</a>: This is a Python library used for data manipulation and analysis.</li>
<li><a target="_blank" href="https://scikit-learn.org/stable/index.html">sklearn</a>: The <a target="_blank" href="https://scikit-learn.org/stable/index.html">scikit-learn library</a> is used to implement machine learning algorithms. We're importing it for data pre processing and model evaluation.</li>
<li><a target="_blank" href="https://interpret.ml/">Interpret</a>: The <a target="_blank" href="https://interpret.ml/">interpretAI</a> Python library is what we'll use to import the model we'll use. </li>
<li><a target="_blank" href="https://matplotlib.org/">Matplotlib</a>: A Python library used to make graphs in Python.</li>
<li><a target="_blank" href="https://numpy.org/">Numpy</a>: Used for very fast numerical computations.</li>
</ul>
<h3 id="heading-loading-preparing-the-dataset-and-splitting-the-data">Loading, Preparing the Dataset, and Splitting the Data</h3>
<pre><code># Load a sample dataset
<span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/3-3.png" alt="Image" width="600" height="400" loading="lazy">
<em>Loading, Preparing the Dataset, and Splitting the Data</em></p>
<p><strong>First, we load a sample dataset</strong>: We import a breast cancer dataset using the Interpret library.</p>
<p><strong>Next, we prepare the data</strong>: The features (data points) from the dataset are organized into a table format, where each column is labeled with a specific feature name. The target outcomes (labels) from the dataset are stored separately.</p>
<p><strong>Then we split the data into training and testing sets</strong>: The data is divided into two parts: one for training the model and one for testing the model. 80% of the data is used for training, while 20% is reserved for testing.</p>
<p>A specific random seed is set to ensure that the data split is consistent every time the code is run.</p>
<p>Quick note: In real life, the dataset is pre-processed with data manipulation techniques to make the AI model faster and to make it smaller.</p>
<h3 id="heading-training-the-model-making-predictions-and-evaluating-the-model">Training the Model, Making Predictions, and Evaluating the Model</h3>
<pre><code># Train an EBM model
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

# Make predictions
y_pred = ebm.predict(X_test)
print(f<span class="hljs-string">"Accuracy: {accuracy_score(y_test, y_pred)}"</span>)
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/4-2.png" alt="Image" width="600" height="400" loading="lazy">
<em>Training the Model, Making Predictions and Evaluating the Model</em></p>
<p><strong>First, we train an EBM model</strong>: We initialize an Explainable Boosting Machine model and then train it using the training data. In this step, with the data we have, we create the model. </p>
<p>This way, with one line of code, we create the AI model based on the dataset that will predict breast cancer.</p>
<p><strong>Then we make our predictions</strong>: The trained EBM model is used to make predictions on the test data. Next, we calculate and print the accuracy of the model's predictions.</p>
<h3 id="heading-interpreting-the-model-extracting-and-sorting-feature-importances">Interpreting the Model, Extracting, and Sorting Feature Importances</h3>
<pre><code># Interpret the model
ebm_global = ebm.explain_global(name=<span class="hljs-string">'EBM'</span>)

# Extract feature importances
feature_names = ebm_global.data()[<span class="hljs-string">'names'</span>]
importances = ebm_global.data()[<span class="hljs-string">'scores'</span>]

# Sort features by importance
sorted_idx = np.argsort(importances)
sorted_feature_names = np.array(feature_names)[sorted_idx]
sorted_importances = np.array(importances)[sorted_idx]
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/5-2.png" alt="Image" width="600" height="400" loading="lazy">
<em>Interpreting the Model, Extracting and Sorting Feature Importances</em></p>
<p><strong>At this point, we need to interpret the model</strong>: The global explanation of the trained Explainable Boosting Machine (EBM) model is obtained, providing an overview of how the model makes decisions.</p>
<p>In this model, we conclude that the accuracy is approximately 0.9736842105263158 – which means the model is accurate 97 % of the time.</p>
<p>Of course, this only applies to the breast cancer data from <strong>this dataset</strong> – not for every single case of breast cancer detection. Since this is a sample, the dataset does not represent the full population of people seeking to detect breast cancer.</p>
<p>Quick note: In the real world, for classification, we'd use the <strong>F1 score</strong> instead of accuracy to predict how accurate a model is due to its consideration of both <strong>precision</strong> and <strong>recall</strong>.</p>
<p><strong>Next, we extract feature importances</strong>: We extract the names and corresponding importance scores of the features used by the model from the global explanation.</p>
<p><strong>Then we sort the features by importance</strong>: The features are sorted based on their importance scores, resulting in a list of feature names and their respective importance scores ordered from least to most important.</p>
<h3 id="heading-plotting-feature-importances">Plotting Feature Importances</h3>
<pre><code># Increase spacing between the feature names
y_positions = np.arange(len(sorted_feature_names)) * <span class="hljs-number">1.5</span>  # Increase multiplier <span class="hljs-keyword">for</span> more space

# Plot feature importances
plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">14</span>))  # Increase figure height <span class="hljs-keyword">if</span> necessary
plt.barh(y_positions, sorted_importances, color=<span class="hljs-string">'skyblue'</span>, align=<span class="hljs-string">'center'</span>)
plt.yticks(y_positions, sorted_feature_names)
plt.xlabel(<span class="hljs-string">'Importance'</span>)
plt.title(<span class="hljs-string">'Feature Importances from Explainable Boosting Classifier'</span>)
plt.gca().invert_yaxis()

# Adjust spacing
plt.subplots_adjust(left=<span class="hljs-number">0.3</span>, right=<span class="hljs-number">0.95</span>, top=<span class="hljs-number">0.95</span>, bottom=<span class="hljs-number">0.08</span>)  # Fine-tune the margins <span class="hljs-keyword">if</span> needed

plt.show()
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/6-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Plotting Feature Importances</em></p>
<p><strong>Now we need to increase the spacing between feature names</strong>: The positions of the feature names on the y-axis are adjusted to increase the spacing between them.</p>
<p><strong>Then we plot feature importances</strong>: A horizontal bar plot is created to visualize the feature importances. The plot's size is set to ensure it is clear and readable.</p>
<p>The bars represent the importance scores of the features, and the feature names are displayed along the y-axis.</p>
<p>The plot's x-axis is labeled "Importance," and the title "Feature Importances from Explainable Boosting Classifier" is added. The y-axis is inverted to have the most important features at the top.</p>
<p><strong>Then we adjust the spacing</strong>: The margins around the plot are fine-tuned to ensure proper spacing and a neat appearance.</p>
<p><strong>Finally, we display the olot</strong>: The plot is displayed to visualize the feature importances effectively.</p>
<p>The final result should look like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/interpret-1.png" alt="Image" width="600" height="400" loading="lazy">
<em>Features importance graph</em></p>
<p>This way, we can conclude from an artificial intelligence model that is interpretable and has an accuracy of 97%, that the five most important factors in detecting breast tumors are: </p>
<ul>
<li>Worst concave points </li>
<li>Worst texture </li>
<li>Worst area </li>
<li>Mean concave points </li>
<li>Area error &amp; worst concavity</li>
</ul>
<p>Again, this is according to the provided dataset.</p>
<p>So according to the population that this sample dataset represents, we can conclude in a <strong>data-driven way</strong> that these factors are key indicators for breast cancer tumor detection. </p>
<p>This way, we can conclude from an artificial intelligence model, which methods interpret the model, that it provides clear insights into the significant features for prediction.</p>
<h2 id="Conclusion">Conclusion: KAN (Kolmogorov–Arnold Networks) </h2>

<p>Thanks to explainable AI, we can study populations using new data-driven methods.</p>
<p>Instead of only using traditional statistics, surveys, and manual data analysis, we can draw conclusions more accurately using an AI programming library and a database or Excel file.</p>
<p>But this is not the only way to have models built with explainable AI.</p>
<p>In April 2024, a paper called <a target="_blank" href="https://arxiv.org/html/2404.19756v1">KAN: Kolmogorov–Arnold Networks</a> was published that might shake up the field even more.</p>
<p>Kolmogorov–Arnold Networks (KANs) promise to be more accurate and easier to understand than traditional models and perform better.</p>
<p>They are also easier to visualize and interact with. So we'll see what happens with them.</p>
<p>You can find the full code here:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code">https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Quantum Artificial Intelligence Model – With Python Code Examples ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning (ML) is one of the most important subareas of AI used in building great AI systems. In ML, deep learning is a narrow area focused solely on neural networks. Through the field of deep learning, systems like ChatGPT and many other AI m... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-quantum-ai-model/</link>
                <guid isPermaLink="false">66ba5330f77647345442b9d1</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Tiago Capelo Monteiro ]]>
                </dc:creator>
                <pubDate>Tue, 23 Jul 2024 18:28:43 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/article_cover.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning (ML) is one of the most important subareas of AI used in building great AI systems.</p>
<p>In ML, deep learning is a narrow area focused solely on neural networks. Through the field of deep learning, systems like ChatGPT and many other AI models can be created. In other words, ChatGPT is just a giant system based on neural networks. </p>
<p>However, there is a big problem with deep learning: computational efficiency. Creating big and effective AI systems with neural networks often requires a lot of energy, which is expensive.</p>
<p>So, the more efficient the hardware is, the better. There are many solutions to solve this problem, one of which is quantum computing.</p>
<p>This article hopes to show, in plain English, the connection between quantum computing and artificial intelligence.</p>
<p>We'll talk about these:</p>
<ul>
<li><a class="post-section-overview" href="#heading-artificial-intelligence-and-the-rise-of-deep-learning">Artificial Intelligence and the Rise of Deep Learning</a></li>
<li><a class="post-section-overview" href="#heading-a-big-problem-in-deep-learning-computational-efficiency">A Big Problem in Deep Learning: Computational Efficiency</a></li>
<li><a class="post-section-overview" href="#heading-a-solution-quantum-computing">A Solution: Quantum Computing</a></li>
<li><a class="post-section-overview" href="#heading-code-example-a-quantum-ai-model-for-quantum-chemistry">Code Example: A Quantum AI Model for Quantum Chemistry</a></li>
<li><a class="post-section-overview" href="#heading-conclusion-limitations-of-quantum-computing-and-development">Conclusion: Limitations of Quantum Computing and Development</a></li>
</ul>
<h2 id="Artificial">Artificial Intelligence and the Rise of Deep Learning</h2>

<h3 id="heading-what-is-deep-learning-in-artificial-intelligence">What is Deep Learning in Artificial Intelligence?</h3>
<p>Deep learning is a subfield of artificial intelligence. It uses neural networks to process complex patterns, just like the strategies a sports team uses to win a match.</p>
<p>The bigger the neural network, the more capable it is of doing awesome things – like ChatGPT, for example, which uses natural language processing to answer questions and interact with users.</p>
<p>To truly understand the basics of neural networks – what every single AI model has in common that enables it to work – we need to understand activation layers.</p>
<h3 id="heading-deep-learning-training-neural-networks">Deep Learning = Training Neural Networks</h3>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/01/4-2.png" alt="4-2" width="600" height="400" loading="lazy">
<em>Simple neural network</em></p>
<p>At the core of deep learning is the training of neural networks. That means using data to get the right values for each neuron to be able to predict what we want.</p>
<p>Neural networks are made of neurons organized in layers. Each layer extracts unique features from the data.</p>
<p>This layered structure allows deep learning models to analyze and interpret complex data.</p>
<h2 id="problem">A Big Problem in Deep learning: Computational Efficiency</h2>

<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/data-brett-sayles-4597280.jpg" alt="Image" width="600" height="400" loading="lazy">
<em>Photo by Brett Sayles: https://www.pexels.com/photo/black-hardwares-on-data-server-room-4597280/</em></p>
<p>Deep learning powers a lot of the transformation AI makes in the society. However, it comes with a big problem: computational efficiency.</p>
<p>Training deep learning AI systems requires massive amounts of data and computational power. This can take minutes to weeks and in the process, it consumes a lot of energy and computational resources.</p>
<p>There are many solutions to this problem, such as better algorithmic efficiency. </p>
<p>In large language models, this has been the focus of much AI research. To make smaller models have the same performance as larger ones.</p>
<p>Another solution, besides algorithmic efficiency, is better computational efficiency. Quantum computing is one of the solutions related to better computational efficiency.</p>
<h2 id="Solution">A Solution: Quantum Computing</h2>

<p>Quantum computing is a promising solution to the computational efficiency problem in deep learning.</p>
<p>While normal computers work in bits (either 0 or 1), quantum computers work with qubits – can be 0 and 1 at the same time.</p>
<p>With the qubits representing 0 and 1 at the same time, it is possible to process many possibilities simultaneously, thanks to a property called superposition in quantum physics.</p>
<p>This makes the quantum computers, for certain tasks, far more efficient than normal computers.</p>
<p>This way, it is also possible to have quantum-based algorithms that are more efficient than normal algorithms. This way, reducing the energy consumption used when creating AI models.</p>
<h3 id="heading-why-are-quantum-computers-not-so-widely-used">Why Are Quantum Computers Not So Widely Used?</h3>
<p>The problem with quantum computation is that there isn't a good, cheap physical representation of qubits.</p>
<p>Bits are created and managed with logic gates made from tiny transistors, which can be easily created by the billions.</p>
<p>Qubits are created and managed by superconducting circuits, trapped ions, and topological qubits, which are all very expensive.</p>
<p>This is the biggest problem in quantum computation. However, IBM, Amazon, and many others in cloud services allow people to run code on their quantum computers.</p>
<h2 id="Code">Code Example: A Quantum AI Model for Quantum Chemistry</h2>

<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/chemnistry-pixabay-248152.jpg" alt="Image" width="600" height="400" loading="lazy">
<em>Photo by Pixabay: https://www.pexels.com/photo/two-clear-glass-jars-beside-several-flasks-248152/</em></p>
<p>In this code example, we'll solve a quantum chemistry problem:</p>
<p><em>What is the lowest energy level of the H₂ molecule using quantum computing?</em></p>
<p>Before understanding the problem at hand, let's discuss quantum chemistry.</p>
<h3 id="heading-what-is-quantum-chemistry">What is Quantum Chemistry?</h3>
<p>Quantum chemistry is a field of science that looks at how electrons behave in atoms and molecules.</p>
<p>It is about using quantum physics to understand how electrons, atoms, molecules and many more tiny particles interact and form different chemical substances.</p>
<h4 id="heading-the-problem-we-want-to-solve">The Problem We Want to Solve</h4>
<p>We want to find the "ground state energy" of the H₂ molecule. </p>
<p>The H₂ molecule means hydrogen gas, which is present in:</p>
<ul>
<li>Water</li>
<li>Organic compounds</li>
<li>Stars</li>
</ul>
<p>Actually, life on Earth would not be possible without it. </p>
<p>By finding the "ground state energy," which is the lowest possible energy that the molecule can have, we can know its most stable form and properties. </p>
<p>This allows scientists to better understand chemical reactions related to H₂. </p>
<p>With classical computers, this problem can be very complex due to a huge number of possibilities and intricate interactions. </p>
<p>With quantum computers, qubits are good representations of electrons, which can directly simulate the behavior of electrons in molecules.</p>
<h3 id="heading-approximating-with-the-vqe-variational-quantum-eigensolver-vqe">Approximating with the VQE (Variational Quantum Eigensolver (VQE)</h3>
<p>The Variational Quantum Eigensolver (VQE) is a hybrid algorithm that leverages both quantum and classical computing. </p>
<p>In this example, the VQE algorithm is used to find the ground state energy of a simple H₂ molecule. </p>
<p>The code is designed to run on a quantum simulator (which is a classical computer running a quantum algorithm).</p>
<p>However, it can be adapted to run on actual quantum hardware through a cloud-based quantum computing service. </p>
<p>This would involve using both quantum and classical resources in practice. Let’s go through the code step by step!</p>
<pre><code><span class="hljs-keyword">import</span> pennylane <span class="hljs-keyword">as</span> qml
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

# Define the molecule (H2 at bond length <span class="hljs-keyword">of</span> <span class="hljs-number">0.74</span> Å)
symbols = [<span class="hljs-string">"H"</span>, <span class="hljs-string">"H"</span>]
coordinates = np.array([<span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.74</span>])

# Generate the Hamiltonian <span class="hljs-keyword">for</span> the molecule
hamiltonian, qubits = qml.qchem.molecular_hamiltonian(
    symbols, coordinates
)

# Define the quantum device
dev = qml.device(<span class="hljs-string">"default.qubit"</span>, wires=qubits)

# Define the ansatz (variational quantum circuit)
def ansatz(params, wires):
    qml.BasisState(np.array([<span class="hljs-number">0</span>] * qubits), wires=wires)
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(qubits):
        qml.RY(params[i], wires=wires[i])
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(qubits - <span class="hljs-number">1</span>):
        qml.CNOT(wires=[wires[i], wires[i + <span class="hljs-number">1</span>]])

# Define the cost <span class="hljs-function"><span class="hljs-keyword">function</span>
@<span class="hljs-title">qml</span>.<span class="hljs-title">qnode</span>(<span class="hljs-params">dev</span>)
<span class="hljs-title">def</span> <span class="hljs-title">cost_fn</span>(<span class="hljs-params">params</span>):
    <span class="hljs-title">ansatz</span>(<span class="hljs-params">params, wires=range(qubits)</span>)
    <span class="hljs-title">return</span> <span class="hljs-title">qml</span>.<span class="hljs-title">expval</span>(<span class="hljs-params">hamiltonian</span>)

# <span class="hljs-title">Set</span> <span class="hljs-title">a</span> <span class="hljs-title">fixed</span> <span class="hljs-title">seed</span> <span class="hljs-title">for</span> <span class="hljs-title">reproducibility</span>
<span class="hljs-title">np</span>.<span class="hljs-title">random</span>.<span class="hljs-title">seed</span>(<span class="hljs-params"><span class="hljs-number">42</span></span>)

# <span class="hljs-title">Set</span> <span class="hljs-title">the</span> <span class="hljs-title">initial</span> <span class="hljs-title">parameters</span>
<span class="hljs-title">params</span> = <span class="hljs-title">np</span>.<span class="hljs-title">random</span>.<span class="hljs-title">random</span>(<span class="hljs-params">qubits, requires_grad=True</span>)

# <span class="hljs-title">Choose</span> <span class="hljs-title">an</span> <span class="hljs-title">optimizer</span>
<span class="hljs-title">optimizer</span> = <span class="hljs-title">qml</span>.<span class="hljs-title">GradientDescentOptimizer</span>(<span class="hljs-params">stepsize=<span class="hljs-number">0.4</span></span>)

# <span class="hljs-title">Number</span> <span class="hljs-title">of</span> <span class="hljs-title">optimization</span> <span class="hljs-title">steps</span>
<span class="hljs-title">max_iterations</span> = 100
<span class="hljs-title">conv_tol</span> = 1<span class="hljs-title">e</span>-06

# <span class="hljs-title">Optimization</span> <span class="hljs-title">loop</span>
<span class="hljs-title">energies</span> = []

<span class="hljs-title">for</span> <span class="hljs-title">n</span> <span class="hljs-title">in</span> <span class="hljs-title">range</span>(<span class="hljs-params">max_iterations</span>):
    <span class="hljs-title">params</span>, <span class="hljs-title">prev_energy</span> = <span class="hljs-title">optimizer</span>.<span class="hljs-title">step_and_cost</span>(<span class="hljs-params">cost_fn, params</span>)

    <span class="hljs-title">energy</span> = <span class="hljs-title">cost_fn</span>(<span class="hljs-params">params</span>)
    <span class="hljs-title">energies</span>.<span class="hljs-title">append</span>(<span class="hljs-params">energy</span>)
    <span class="hljs-title">if</span> <span class="hljs-title">np</span>.<span class="hljs-title">abs</span>(<span class="hljs-params">energy - prev_energy</span>) &lt; <span class="hljs-title">conv_tol</span>:
        <span class="hljs-title">break</span>

    <span class="hljs-title">print</span>(<span class="hljs-params">f<span class="hljs-string">"Step = {n}, Energy = {energy:.8f} Ha"</span></span>)

<span class="hljs-title">print</span>(<span class="hljs-params">f<span class="hljs-string">"Final ground state energy = {energy:.8f} Ha"</span></span>)

# <span class="hljs-title">Visualize</span> <span class="hljs-title">the</span> <span class="hljs-title">results</span>
<span class="hljs-title">import</span> <span class="hljs-title">matplotlib</span>.<span class="hljs-title">pyplot</span> <span class="hljs-title">as</span> <span class="hljs-title">plt</span>

<span class="hljs-title">iterations</span> = <span class="hljs-title">range</span>(<span class="hljs-params">len(energies)</span>)

<span class="hljs-title">plt</span>.<span class="hljs-title">plot</span>(<span class="hljs-params">iterations, energies</span>)
<span class="hljs-title">plt</span>.<span class="hljs-title">xlabel</span>(<span class="hljs-params"><span class="hljs-string">'Iteration'</span></span>)
<span class="hljs-title">plt</span>.<span class="hljs-title">ylabel</span>(<span class="hljs-params"><span class="hljs-string">'Energy (Ha)'</span></span>)
<span class="hljs-title">plt</span>.<span class="hljs-title">title</span>(<span class="hljs-params"><span class="hljs-string">'Convergence of VQE for H2 Molecule'</span></span>)
<span class="hljs-title">plt</span>.<span class="hljs-title">show</span>(<span class="hljs-params"></span>)</span>
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/1-5.png" alt="Image" width="600" height="400" loading="lazy">
<em>Full Code Image</em></p>
<h3 id="heading-importing-libraries">Importing Libraries</h3>
<pre><code><span class="hljs-keyword">import</span> pennylane <span class="hljs-keyword">as</span> qml
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/2-4.png" alt="Image" width="600" height="400" loading="lazy">
<em>Importing libraries</em></p>
<ul>
<li><a target="_blank" href="https://pennylane.ai/">pennylane</a>: A library for quantum computing that provides tools for creating and optimizing quantum circuits, and for running machine learning quantum based algorithms.</li>
<li><a target="_blank" href="https://numpy.org/">numpy</a>: A library for numerical operations in Python, used here for handling arrays and mathematical computations.</li>
<li><a target="_blank" href="https://matplotlib.org/">matplotlib</a>: A library for creating visualizations and plots in Python, used here to graph the convergence of the VQE algorithm.</li>
</ul>
<h3 id="heading-defining-the-molecule-and-generating-the-hamiltonian">Defining the Molecule and Generating the Hamiltonian</h3>
<pre><code># Define the molecule (H2 at bond length <span class="hljs-keyword">of</span> <span class="hljs-number">0.74</span> Å)
symbols = [<span class="hljs-string">"H"</span>, <span class="hljs-string">"H"</span>]
coordinates = np.array([<span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.74</span>])

# Generate the Hamiltonian <span class="hljs-keyword">for</span> the molecule
hamiltonian, qubits = qml.qchem.molecular_hamiltonian(
    symbols, coordinates
)
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/3-4.png" alt="Image" width="600" height="400" loading="lazy">
<em>Defining the Molecule and generating the Hamiltonian</em></p>
<p><strong>Defining the Molecule</strong>:</p>
<ul>
<li>We are defined a hydrogen molecule (H₂).</li>
<li><code>symbols = ["H", "H"]</code>: This means the molecule consists of two hydrogen (H) atoms.</li>
<li><code>coordinates = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.74])</code>: This gives the positions of the two hydrogen atoms. The first hydrogen atom is at the origin (0.0, 0.0, 0.0), and the second hydrogen atom is at (0.0, 0.0, 0.74), which means it is 0.74 angstroms away from the first atom along the z-axis.</li>
</ul>
<p><strong>Generating the Hamiltonian</strong>:</p>
<ul>
<li><code>hamiltonian, qubits = qml.qchem.molecular_hamiltonian(symbols, coordinates)</code>: This line generates the Hamiltonian for the hydrogen molecule. The Hamiltonian is a mathematical object used to describe the energy of the molecule.</li>
<li><code>hamiltonian</code>: Represents the energy operator for the molecule.</li>
<li><code>qubits</code>: Represents the number of quantum bits (qubits) needed to simulate the molecule on a quantum computer.</li>
</ul>
<h3 id="heading-defining-the-quantum-device-and-ansatz-variational-quantum-circuit">Defining the Quantum Device and Ansatz (Variational Quantum Circuit)</h3>
<pre><code># Define the quantum device
dev = qml.device(<span class="hljs-string">"default.qubit"</span>, wires=qubits)

# Define the ansatz (variational quantum circuit)
def ansatz(params, wires):
    qml.BasisState(np.array([<span class="hljs-number">0</span>] * qubits), wires=wires)
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(qubits):
        qml.RY(params[i], wires=wires[i])
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(qubits - <span class="hljs-number">1</span>):
        qml.CNOT(wires=[wires[i], wires[i + <span class="hljs-number">1</span>]])
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/4-3.png" alt="Image" width="600" height="400" loading="lazy">
<em>Defining the Quantum Device and Ansatz (Variational Quantum Circuit)</em></p>
<p><strong>Defining the Quantum Device</strong>:</p>
<ul>
<li><code>dev = qml.device("default.qubit", wires=qubits)</code>: This line sets up a quantum computing device to simulate our molecule.</li>
<li><code>"default.qubit"</code>: This specifies the type of quantum simulator we are using (a default qubit-based simulator).</li>
<li><code>wires=qubits</code>: This tells the simulator how many qubits (quantum bits) it needs to use, based on the number of qubits we determined earlier.</li>
</ul>
<p><strong>Defining the Ansatz (Variational Quantum Circuit)</strong>:</p>
<ul>
<li><code>def ansatz(params, wires)</code>: This defines a function named <code>ansatz</code> which describes the variational quantum circuit. This circuit will be used to find the ground state energy of the molecule.</li>
<li><code>qml.BasisState(np.array([0] * qubits), wires=wires)</code>: This initializes the qubits in the state 0. The <code>np.array([0] * qubits)</code> creates an array with zeros, one for each qubit.</li>
<li><code>for i in range(qubits): qml.RY(params[i], wires=wires[i])</code>: This loop applies a rotation around the Y-axis to each qubit. <code>params[i]</code> provides the angle for each rotation.</li>
<li><code>for i in range(qubits - 1): qml.CNOT(wires=[wires[i], wires[i + 1]])</code>: This loop applies Controlled-NOT (CNOT) gates between consecutive qubits, entangling them.</li>
</ul>
<h3 id="heading-defining-the-cost-function-setting-initial-parameters-and-optimizer">Defining the Cost Function, Setting Initial Parameters and Optimizer</h3>
<pre><code># Define the cost <span class="hljs-function"><span class="hljs-keyword">function</span>
@<span class="hljs-title">qml</span>.<span class="hljs-title">qnode</span>(<span class="hljs-params">dev</span>)
<span class="hljs-title">def</span> <span class="hljs-title">cost_fn</span>(<span class="hljs-params">params</span>):
    <span class="hljs-title">ansatz</span>(<span class="hljs-params">params, wires=range(qubits)</span>)
    <span class="hljs-title">return</span> <span class="hljs-title">qml</span>.<span class="hljs-title">expval</span>(<span class="hljs-params">hamiltonian</span>)

# <span class="hljs-title">Set</span> <span class="hljs-title">a</span> <span class="hljs-title">fixed</span> <span class="hljs-title">seed</span> <span class="hljs-title">for</span> <span class="hljs-title">reproducibility</span>
<span class="hljs-title">np</span>.<span class="hljs-title">random</span>.<span class="hljs-title">seed</span>(<span class="hljs-params"><span class="hljs-number">42</span></span>)

# <span class="hljs-title">Set</span> <span class="hljs-title">the</span> <span class="hljs-title">initial</span> <span class="hljs-title">parameters</span>
<span class="hljs-title">params</span> = <span class="hljs-title">np</span>.<span class="hljs-title">random</span>.<span class="hljs-title">random</span>(<span class="hljs-params">qubits, requires_grad=True</span>)

# <span class="hljs-title">Choose</span> <span class="hljs-title">an</span> <span class="hljs-title">optimizer</span>
<span class="hljs-title">optimizer</span> = <span class="hljs-title">qml</span>.<span class="hljs-title">GradientDescentOptimizer</span>(<span class="hljs-params">stepsize=<span class="hljs-number">0.4</span></span>)</span>
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/5-3.png" alt="Image" width="600" height="400" loading="lazy">
<em>Defining the Cost Function, Setting Initial Parameters and Optimizer</em></p>
<p><strong>Defining the Cost Function</strong>:</p>
<ul>
<li><code>@qml.qnode(dev)</code>: This line is a decorator that transforms the <code>cost_fn</code> function into a quantum node, allowing it to run on the quantum device <code>dev</code>.</li>
<li><code>def cost_fn(params)</code>: This defines a function named <code>cost_fn</code> that takes some parameters (<code>params</code>) as input.</li>
<li><code>ansatz(params, wires=range(qubits))</code>: Inside this function, we call the previously defined <code>ansatz</code> function, passing in the parameters and specifying that it should use all the qubits.</li>
<li><code>return qml.expval(hamiltonian)</code>: This line returns the expected value of the Hamiltonian, which represents the energy of the molecule. The cost function is what we aim to minimize to find the ground state energy.</li>
</ul>
<p><strong>Setting a Fixed Seed for Reproducibility</strong>:</p>
<ul>
<li><code>np.random.seed(42)</code>: This line sets a fixed seed for the random number generator. This ensures that the random numbers generated will be the same each time the code is run, making the results reproducible.</li>
</ul>
<p><strong>Setting the Initial Parameters</strong>:</p>
<ul>
<li><code>params = np.random.random(qubits, requires_grad=True)</code>: This line initializes the parameters for the ansatz with random values. The number of parameters is equal to the number of qubits. The <code>requires_grad=True</code> part indicates that these parameters can be adjusted during optimization.</li>
</ul>
<p><strong>Choosing an Optimizer</strong>:</p>
<ul>
<li><code>optimizer = qml.GradientDescentOptimizer(stepsize=0.4)</code>: This line creates an optimizer that will adjust the parameters to minimize the cost function. Specifically, it uses gradient descent with a step size of 0.4.</li>
</ul>
<h3 id="heading-optimization-loop">Optimization Loop</h3>
<pre><code># <span class="hljs-built_in">Number</span> <span class="hljs-keyword">of</span> optimization steps
max_iterations = <span class="hljs-number">100</span>
conv_tol = <span class="hljs-number">1e-06</span>

# Optimization loop
energies = []

<span class="hljs-keyword">for</span> n <span class="hljs-keyword">in</span> range(max_iterations):
    params, prev_energy = optimizer.step_and_cost(cost_fn, params)

    energy = cost_fn(params)
    energies.append(energy)
    <span class="hljs-keyword">if</span> np.abs(energy - prev_energy) &lt; conv_tol:
        <span class="hljs-keyword">break</span>

    print(f<span class="hljs-string">"Step = {n}, Energy = {energy:.8f} Ha"</span>)

print(f<span class="hljs-string">"Final ground state energy = {energy:.8f} Ha"</span>)
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/6-2.png" alt="Image" width="600" height="400" loading="lazy">
<em>Optimization Loop</em></p>
<p><strong>Setting the Number of Optimization Steps</strong>:</p>
<ul>
<li><code>max_iterations = 100</code>: This sets the maximum number of steps the optimization will take. In this case, it is 100 steps.</li>
<li><code>conv_tol = 1e-06</code>: This defines the convergence tolerance. If the change in energy between steps is less than this value, the optimization will stop.</li>
</ul>
<p><strong>Optimization Loop</strong>:</p>
<ul>
<li><code>energies = []</code>: This initializes an empty list to store the energies calculated at each step.</li>
</ul>
<p><strong>Looping Through Optimization Steps</strong>:</p>
<ul>
<li><code>for n in range(max_iterations):</code>: This starts a loop that will run up to <code>max_iterations</code> times.</li>
<li><code>params, prev_energy = optimizer.step_and_cost(cost_fn, params)</code>: This line performs one step of optimization. It updates the parameters and returns the new parameters and the previous energy.</li>
<li><code>energy = cost_fn(params)</code>: This calculates the current energy using the updated parameters.</li>
<li><code>energies.append(energy)</code>: This adds the current energy to the <code>energies</code> list.</li>
<li><code>if np.abs(energy - prev_energy) &lt; conv_tol: break</code>: This checks if the absolute difference between the current energy and the previous energy is less than the convergence tolerance. If it is, the loop stops early because the optimization has converged.</li>
<li><code>print(f"Step = {n}, Energy = {energy:.8f} Ha")</code>: This prints the current step number and the energy in Hartree (Ha) to eight decimal places.</li>
</ul>
<p><strong>Printing the Final Energy</strong>:</p>
<ul>
<li><code>print(f"Final ground state energy = {energy:.8f} Ha")</code>: After the loop, this prints the final ground state energy.</li>
</ul>
<h3 id="heading-visualizing-the-results">Visualizing the Results</h3>
<pre><code># Visualize the results
iterations = range(len(energies))

plt.plot(iterations, energies)
plt.xlabel(<span class="hljs-string">'Iteration'</span>)
plt.ylabel(<span class="hljs-string">'Energy (Ha)'</span>)
plt.title(<span class="hljs-string">'Convergence of VQE for H2 Molecule'</span>)
plt.show()
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2024/07/7.png" alt="Image" width="600" height="400" loading="lazy">
<em>Visualizing the Results</em></p>
<p><strong>Setting Up the Data for Visualization</strong>:</p>
<ul>
<li><code>iterations = range(len(energies))</code>: This creates a range object representing the number of iterations (steps) the optimization went through. <code>len(energies)</code> gives the number of energy values recorded.</li>
</ul>
<p><strong>Plotting the Results</strong>:</p>
<ul>
<li><code>plt.plot(iterations, energies)</code>: This line creates a plot with the iteration numbers on the x-axis and the corresponding energy values on the y-axis.</li>
<li><code>plt.xlabel('Iteration')</code>: This sets the label for the x-axis to "Iteration".</li>
<li><code>plt.ylabel('Energy (Ha)')</code>: This sets the label for the y-axis to "Energy (Ha)", where "Ha" stands for Hartree, a unit of energy.</li>
<li><code>plt.title('Convergence of VQE for H2 Molecule')</code>: This sets the title of the plot to "Convergence of VQE for H2 Molecule".</li>
<li><code>plt.show()</code>: This displays the plot.</li>
</ul>
<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/H2H.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The graph titled "Convergence of VQE for H2 Molecule" shows the energy (in Hartree, Ha) of the H2 molecule plotted against the number of iterations of the Variational Quantum Eigensolver (VQE) algorithm.</p>
<ul>
<li><strong>X-Axis (Iteration):</strong> Number of VQE iterations.</li>
<li><strong>Y-Axis (Energy (Ha)):</strong> Energy of the H2 molecule in Hartree.</li>
</ul>
<h3 id="heading-key-points">Key Points:</h3>
<ul>
<li><strong>Initial Energy:</strong> Approximately 1.4 Ha at iteration 0.</li>
<li><strong>Rapid Decrease:</strong> Energy quickly drops within the first 20 iterations.</li>
<li><strong>Plateau:</strong> Energy stabilizes around 0.4 Ha after 20 iterations, indicating convergence to an optimal or near-optimal solution.</li>
</ul>
<h2 id="Conclusion">Conclusion: Limitations of Quantum Computing and Development</h2>

<p><img src="https://www.freecodecamp.org/news/content/images/2024/07/PC-richasharma96-4247412.jpg" alt="Image" width="600" height="400" loading="lazy">
<em>Photo by Richa Sharma: https://www.pexels.com/photo/ceramic-mug-on-black-laptop-on-table-in-office-4247412/</em></p>
<p>Besides making AI algorithms far more computationally efficient, quantum computing can revolutionize many fields like:</p>
<ul>
<li>Drug discovery</li>
<li>Material science</li>
<li>Cryptography</li>
<li>Financial modeling</li>
<li>Optimization problems</li>
<li>Climate modeling</li>
<li>Machine learning</li>
</ul>
<p>However, for all of us to use quantum computing, a way to physically represent qubits small enough to fit on our laptops is needed. That will take years.</p>
<p>The full code can be found here:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code">https://github.com/tiagomonteiro0715/freecodecamp-my-articles-source-code</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Does Knowledge Distillation Work in Deep Learning Models? ]]>
                </title>
                <description>
                    <![CDATA[ Deep learning models have transformed several industries, including computer vision and natural language processing. However, the rising complexity and resource requirements of these models have motivated academics to look into ways to condense their... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/knowledge-distillation-in-deep-learning-models/</link>
                <guid isPermaLink="false">66d4608c677cb8c6c15f316b</guid>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Tue, 09 Jul 2024 13:35:16 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/kenny-eliason-5afenxnLDjs-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Deep learning models have transformed several industries, including computer vision and natural language processing. However, the rising complexity and resource requirements of these models have motivated academics to look into ways to condense their knowledge into more compact and efficient forms.</p>
<p>Knowledge distillation, a strategy for transferring knowledge from a complicated model to a simpler one has emerged as an effective instrument for accomplishing this goal. In this article, we’ll look at the notion of knowledge distillation in deep learning models and its applications.</p>
<h2 id="heading-concept-of-knowledge-distillation">Concept of Knowledge Distillation</h2>
<p>Knowledge distillation is a deep learning process in which knowledge is transferred from a complicated, well-trained model, known as the “teacher,” to a simpler and lighter model, known as the “student.”</p>
<p>The basic purpose of knowledge distillation is to produce a more efficient model that retains the important information and performance of the bigger model while being computationally less demanding.</p>
<p>The process consists of two steps:</p>
<h3 id="heading-1-training-the-teacher-model">1. Training the “teacher” Model</h3>
<ul>
<li><p>The teacher model is trained on labeled data to discover patterns and correlations within it.</p>
</li>
<li><p>The teacher model’s large capacity allows it to capture minute details, resulting in superior performance on the assigned task.</p>
</li>
<li><p>The instructor model’s predictions on the training data provide a source of knowledge that the student model seeks to imitate.</p>
</li>
</ul>
<h3 id="heading-2-transferring-knowledge-to-the-student-model">2. Transferring Knowledge to the “student” Model:</h3>
<ul>
<li><p>The student model is then trained using the same data as the teacher but with a difference.</p>
</li>
<li><p>Instead of typical hard labels (a data point’s final class assignment), the student model is trained with soft labels (a significantly richer representation of the data), which are probability distributions over the classes supplied by the teacher model.</p>
</li>
<li><p>Using soft labels, the student learns not just to copy the teacher’s final judgments, but also to understand the uncertainty and logic behind those predictions.</p>
</li>
<li><p>The goal is for the student model to generalize and approximate the knowledge encoded in the teacher model, resulting in a more compact representation of the data.</p>
</li>
</ul>
<p>Knowledge distillation uses the teacher model soft targets to reflect not just the anticipated class, but also the probability distribution across all conceivable classes. These soft targets provide subtle indications, exposing not just the objective but also the terrain that the student model must negotiate. By adding these cues into its training, the student learns to not only replicate the teacher model outcomes but also to recognize the larger patterns and correlations buried in the data.</p>
<p>The soft labels give a smoother gradient during training, allowing the student model to benefit more from the teacher’s knowledge. This procedure helps the student model to generalize more well and frequently results in a smaller model that retains a considerable percentage of the teacher’s performance.</p>
<p>The temperature parameter used in the <a target="_blank" href="https://en.wikipedia.org/wiki/Softmax_function">softmax</a> function during the knowledge distillation process influences the sharpness of the probability distributions. Higher temperatures cause softer probability distributions, emphasizing information transfer, whereas lower temperatures produce sharper distributions, favoring precise predictions.</p>
<p>Overall, knowledge distillation is the process of transferring gained knowledge from a powerful and complicated model to a smaller one, making it more suitable for use in circumstances with limited computational resources.</p>
<h2 id="heading-relevance-of-knowledge-distillation-in-deep-learning">Relevance of Knowledge Distillation in Deep Learning</h2>
<p>Knowledge distillation is important in deep learning for a variety of reasons, and its applications encompass multiple fields. Here are some major factors that demonstrate the importance of knowledge distillation in the field of deep learning:</p>
<ol>
<li><p><strong>Model Compression:</strong> Model compression is a fundamental motivator for knowledge distillation. Deep learning models, particularly those with millions of parameters, can be computationally expensive and resource-consuming. Knowledge distillation allows for the production of smaller, more lightweight models that retain a significant fraction of the performance of their larger counterparts.</p>
</li>
<li><p><strong>Model Pruning:</strong> Knowledge distillation can be used to find and eliminate duplicate or irrelevant neurons and connections in a deep learning model. Training a student model to emulate the behavior of a teacher model allows the student model to learn which aspects of the teacher model are most important and which can be safely deleted.</p>
</li>
<li><p><strong>Enhanced Generalization:</strong> Knowledge distillation frequently produces student models with increased generalization capabilities. By learning not only the final predictions but also the logic and uncertainty from the teacher model, the student may better generalize to previously unseen data, making it a powerful strategy for boosting model resilience.</p>
</li>
<li><p><strong>Transfer Learning:</strong> Knowledge distillation can be used to transfer knowledge from a pre-trained deep learning model to a new model trained on a separate but related problem. By training a student model to imitate the behavior of a pre-trained teacher model, the student model can learn the broad characteristics and patterns common to both tasks, allowing it to perform effectively on the new task with less data and computational resources.</p>
</li>
<li><p><strong>Scalability and Accessibility:</strong> Knowledge distillation helps to make complex artificial intelligence technology more accessible to a wider audience. Smaller models demand fewer computational resources, making it easier for researchers, developers, and businesses to implement and incorporate deep learning technologies into their applications.</p>
</li>
<li><p><strong>Performance Improvement:</strong> In special cases, knowledge distillation can even result in enhanced performance on specific tasks, particularly when data is scarce. The student model benefits from the teacher’s deeper understanding of data distribution, resulting in improved generalization and robustness.</p>
</li>
</ol>
<h2 id="heading-applications-of-knowledge-distillation">Applications of Knowledge Distillation</h2>
<p>Knowledge distillation can be applied in a variety of fields in deep learning, providing advantages such as model compression, enhanced generalization, and efficient deployment. Here are some notable applications for knowledge distillation:</p>
<ol>
<li><p><strong>Computer Vision:</strong> Object detection uses knowledge distillation to compress large and complicated object identification models, making them acceptable for deployment on devices with limited processing resources, such as security cameras and drones.</p>
</li>
<li><p><strong>Natural Language Processing (NLP):</strong> Knowledge distillation is used to generate compact models for text classification, sentiment analysis, and other NLP applications. These models are more suitable for real-time applications and can be implemented on platforms such as chatbots and mobile devices.<br> Distilled models in NLP are also utilized for language translation, enabling effective language processing across multiple platforms.</p>
</li>
<li><p><strong>Recommendation Systems:</strong> Knowledge distillation is used in recommendation systems to build efficient models capable of providing individualized recommendations depending on user behavior. These models are better suited for distribution across several platforms.</p>
</li>
<li><p><strong>Edge Computing:</strong> Knowledge-distilled models enable the deployment of deep learning models on edge devices with low resources. This is critical for applications such as real-time video analysis, edge-based image processing, and IoT devices.</p>
</li>
<li><p><strong>Anomaly Detection:</strong> In cybersecurity and anomaly detection, knowledge distillation is used to generate lightweight models for detecting unexpected patterns in network traffic or user behavior. These models help to detect threats quickly and efficiently.</p>
</li>
<li><p><strong>Quantum Computing:</strong> In the growing field of quantum computing, knowledge distillation is being investigated to create more compact quantum models that can run efficiently on quantum hardware.</p>
</li>
<li><p><strong>Transfer Learning:</strong> Knowledge distillation enhances transfer learning, allowing pre-trained models to quickly apply their knowledge to new tasks. This is useful in cases where labeled data for the target job is limited.</p>
</li>
</ol>
<p>There are numerous case studies demonstrating the effectiveness of knowledge distillation in diverse fields. These case studies highlight the versatility of knowledge distillation across different domains, including natural language processing, computer vision, and finance. Examples include:</p>
<ul>
<li><p>In the healthcare industry, knowledge distillation is being used to train smaller, faster models for medical image analysis and illness detection. Early research indicates that lowering model size while retaining diagnostic accuracy is a promising approach.</p>
</li>
<li><p>Knowledge distillation has been used to increase speech recognition models’ accuracy and resilience, particularly for low-resource languages with limited data. Baidu and Google have shown considerable improvements in word error rate (WER) by extracting information from large pre-trained models.</p>
</li>
<li><p>Knowledge distillation can be used to train robot gripping devices to handle a variety of things efficiently. By extracting knowledge from a pre-trained model that has gripped a variety of items, a smaller model can acquire efficient grasping methods with less training data and processing resources.</p>
</li>
<li><p>Knowledge distillation can help train AI models for resource-constrained IoT devices. A smaller variant can run on low-power devices while still performing important activities like sensor data analysis and anomaly detection.</p>
</li>
</ul>
<p>These examples demonstrate knowledge distillation’s adaptability beyond its conventional use in vision and language tasks. Its capacity to bridge the gap between model accuracy and efficiency has major real-world applications, allowing AI solutions to function effectively in diverse and resource-constrained situations.</p>
<h3 id="heading-techniques-and-methods-for-knowledge-distillation">Techniques and Methods for Knowledge Distillation</h3>
<p>To ensure effective knowledge distillation, a variety of strategies and tactics are used. Here are some important strategies for knowledge distillation:</p>
<p><strong>1. Soft Target Labels:</strong> Soft target labels in knowledge distillation include utilizing probability distributions, known as soft labels, instead of standard hard labels during the training of a student model. These soft labels are created by using a softmax function on the output logits of a more advanced instructor model. The temperature parameter in the softmax function affects the smoothness of probability distributions.</p>
<p>By training the student model to match these soft target labels, it learns not only the teacher’s final predictions but also the level of confidence and uncertainty in each session. This refined approach improves the student model’s capacity to generalize and capture the complex knowledge embedded in the instructor model, yielding a more efficient and compact model.</p>
<p><strong>2.</strong> <strong>Feature Mimicry:</strong> Feature mimicry is a knowledge distillation technique in which a simpler student model is trained to replicate the intermediate feature representations of a more complex teacher model.</p>
<p>Rather than just reproducing the teacher’s final predictions, the student model is instructed to match its internal feature maps at various layers with those of the teacher.</p>
<p>This method tries to convey both the high-level information embodied in the teacher’s predictions and the deep hierarchical features learned throughout the network. By including feature mimicry, the student model can capture deeper information and linkages in the teacher’s representations, resulting in better generalization and performance.</p>
<p><strong>3. Self Distillation:</strong> This is a knowledge distillation technique in which a model converts its knowledge to a simplified version of itself. The instructor and student models share the same architecture. This process can be iterative, with the distilled student serving as the instructor for the subsequent round of distillation.</p>
<p>Self-distillation uses the model’s inherent complexity to guide the learning of a more compact version, allowing for a gradual refining of understanding. This strategy is especially beneficial when a model needs to adapt and reduce its information into a smaller form, resulting in a balance of model size and performance.</p>
<p><strong>4. Multi-Teacher Distillation:</strong> Multi-teacher distillation is a method for transferring knowledge from many teacher models to a single student model. Each teaching model brings a distinct viewpoint or skill to the task at hand.</p>
<p>The student model learns from the combined knowledge of these varied teachers, intending to capture a more complete comprehension of facts.</p>
<p>This method frequently improves the robustness and generality of the student model by combining information from different sources. Multi-teacher distillation is especially useful when the work requires complicated and diverse patterns that can be better grasped from multiple perspectives.</p>
<p><strong>5. Attention Transfer:</strong> Attention transfer is a knowledge distillation technique that trains a simpler student model to emulate the attention mechanisms of a more complicated teacher model.</p>
<p>Attention mechanisms highlight relevant portions of the input data, allowing the model to concentrate on key elements. In this strategy, the student model learns not only to imitate the teacher’s final predictions but also to emulate attention patterns.</p>
<p>This improves the student model’s interpretability and performance by capturing the selective focus and reasoning used by the instructor model during decision-making.</p>
<h2 id="heading-challenges-and-limitations-of-knowledge-distillation">Challenges and Limitations of Knowledge Distillation</h2>
<p>While knowledge distillation is a strong process with many benefits, it also has its drawbacks and limitations. Understanding these difficulties is critical for professionals hoping to use knowledge distillation effectively. Here are some obstacles and constraints related to knowledge distillation:</p>
<ol>
<li><p><strong>Computational Overhead:</strong> Knowledge distillation necessitates training both a teacher and a student model, potentially increasing the overall computational burden. The technique requires more steps than training a solo model, which may make it less suitable for resource-constrained applications.</p>
</li>
<li><p><strong>Finding the Optimal Teacher-Student Pair:</strong> It is critical to select the correct instructor model who has qualities that complement the student’s. A mismatch might result in poor performance or overfitting to the teacher’s biases.</p>
</li>
<li><p><strong>Hyperparameter Tuning:</strong> The performance of knowledge distillation depends on the hyperparameters used, such as the temperature parameter in soft label production. Finding the ideal balance can be difficult and may necessitate significant tinkering.</p>
</li>
<li><p><strong>Risk of Overfitting to Teacher’s Biases:</strong> If the teacher model has biases or was trained on biased data, the student model may inherit them throughout the distillation process. Care must be taken to address and reduce any potential biases in the teacher model.</p>
</li>
<li><p><strong>Sensitivity to Noisy Labels:</strong> Knowledge distillation can be vulnerable to noisy labels in training data, potentially resulting in the transmission of incorrect or unreliable data from the instructor to the student.</p>
</li>
</ol>
<p>Despite these obstacles and limits, knowledge distillation is nevertheless an effective method for moving knowledge from a large, complicated model to a smaller, simpler model.</p>
<p>With careful consideration and modification, knowledge distillation can improve the performance of machine learning models in a variety of applications.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Knowledge distillation is a powerful technique in the field of deep learning, providing a road to more efficient, compact, and flexible models.</p>
<p>Knowledge distillation solves model size, computational efficiency, and generalization issues by transferring knowledge from large instructor models to simpler student models in a nuanced way.</p>
<p>The distilled models not only preserve their professors’ prediction capabilities, but they frequently perform better, have faster inference times, and are more adaptable.</p>
<p>I hope this article was helpful!</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
