<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Deep Learning - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Deep Learning - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Thu, 11 Jun 2026 05:18:31 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/deep-learning/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Training Language Models to Follow Instructions
with Human Feedback (InstructGPT) ]]>
                </title>
                <description>
                    <![CDATA[ GPT-3 was a major breakthrough in natural language processing. With 175 billion parameters, it demonstrated remarkable few-shot learning abilities and showed that scaling large language models could u ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-training-language-models-to-follow-instructions-with-human-feedback-instructgpt/</link>
                <guid isPermaLink="false">6a206bf72a223bf98b13dcfc</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ large language models ]]>
                    </category>
                
                    <category>
                        <![CDATA[ llm ]]>
                    </category>
                
                    <category>
                        <![CDATA[ chatgpt ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Wed, 03 Jun 2026 18:01:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/494c3fa7-d7a0-448b-9983-99575f91836d.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>GPT-3 was a major breakthrough in natural language processing. With 175 billion parameters, it demonstrated remarkable few-shot learning abilities and showed that scaling large language models could unlock a wide range of capabilities.</p>
<p>Yet despite its impressive performance, GPT-3 revealed an important limitation: raw capability doesn't automatically create a useful assistant.</p>
<p>A language model can generate fluent text, answer questions, and solve complex tasks while still failing to follow what the user actually wants.</p>
<p>GPT-3 could produce responses that were inconsistent, overly confident, difficult to control, or misaligned with user instructions. It was a powerful prediction engine, but it wasn't designed to reliably act as a helpful assistant.</p>
<p>This challenge motivated one of the most influential papers in modern AI: <em>Training Language Models to Follow Instructions with Human Feedback</em>. Rather than making the model larger, the researchers focused on teaching it how to better follow human intent.</p>
<p>The result was InstructGPT, a system fine-tuned from GPT-3 that demonstrated how human feedback could transform a capable language model into a far more useful and aligned assistant.</p>
<p>This challenge became one of the most important problems in modern AI: alignment.</p>
<p>Researchers realized that building larger models was only part of the solution. While scaling improved capabilities, it didn't guarantee that models would reliably follow instructions or behave in ways that matched user expectations. The next stage of progress required teaching models how to respond in a more helpful, truthful, and safe manner.</p>
<p>This led to the development of instruction-following systems and Reinforcement Learning from Human Feedback (RLHF). Instead of optimizing models solely to predict the next word, researchers began training them to better align with human preferences and intentions.</p>
<p>This shift marked a major turning point in the evolution of large language models.</p>
<p>GPT-3 demonstrated the power of large-scale language modeling and introduced many people to prompting and few-shot learning.</p>
<p>InstructGPT built on that foundation by showing how human feedback could significantly improve instruction following and model behavior. ChatGPT then brought these ideas to a much broader audience by packaging aligned language models into an accessible conversational interface used by millions of people.</p>
<p>In many ways, language models became capable before they became aligned.</p>
<p>That's why the transition from GPT-3 to InstructGPT represents one of the most important milestones in the history of artificial intelligence. The focus was no longer only on making models more capable. It was also about making them more useful, reliable, and responsive to human intent.</p>
<p>The success of InstructGPT pioneered many of the alignment techniques that later became a core part of systems such as ChatGPT and GPT-4.</p>
<h2 id="heading-paper-overview"><strong>Paper Overview:</strong></h2>
<p>In this article, we’ll mainly focus on the paper <a href="https://arxiv.org/pdf/2203.02155"><strong>Training Language Models to Follow Instructions with Human Feedback</strong></a>, published by OpenAI in 2022.</p>
<p>This paper introduced <strong>InstructGPT</strong>, one of the most important transitions in the history of large language models. While earlier GPT systems focused heavily on scaling model size and improving raw capabilities, this work shifted attention toward something equally important: <strong>alignment</strong>.</p>
<p>The paper explores how language models can be trained to better follow human instructions using reinforcement learning from human feedback (RLHF). Instead of optimizing only for next-token prediction, the model is further optimized to produce responses that humans actually prefer – responses that are more helpful, safer, and more aligned with user intent.</p>
<p>What makes this paper historically important is that it became the foundation for the modern ChatGPT alignment pipeline.</p>
<p>Many of the interaction patterns people now associate with ChatGPT (like instruction following, conversational behavior, refusal handling, and safer responses) can be traced directly back to the ideas introduced here.</p>
<p>Here’s the original paper again if you want to explore it directly: <a href="https://arxiv.org/pdf/2203.02155">Training language models to follow instructions with human feedback</a></p>
<p>And here’s a quick infographic of what we’ll cover throughout this review:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/6986f1fe-7ee5-4bc6-b144-44aad5d2bb3e.png" alt="AI Papers Quick Insights- InstructGPT" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<h2 id="heading-table-of-contents"><strong>Table of Contents:</strong></h2>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-the-core-problem">The Core Problem</a></p>
</li>
<li><p><a href="#heading-why-gpt-3-was-not-enough">Why GPT-3 Was Not Enough</a></p>
</li>
<li><p><a href="#heading-instructgpt-the-birth-of-alignment-centered-llms">InstructGPT: The Birth of Alignment-Centered LLMs</a></p>
</li>
<li><p><a href="#heading-rlhf-pipeline-how-instructgpt-learned-to-behave-like-an-assistant">RLHF Pipeline: How InstructGPT Learned to Behave Like an Assistant</a></p>
<ul>
<li><p><a href="#heading-stage-1-supervised-fine-tuning-sft">Stage 1 — Supervised Fine-Tuning (SFT)</a></p>
</li>
<li><p><a href="#heading-stage-2-reward-model-training">Stage 2 — Reward Model Training</a></p>
</li>
<li><p><a href="#heading-stage-3-ppo-reinforcement-learning">Stage 3 — PPO Reinforcement Learning</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-helpful-honest-harmless">Helpful, Honest, Harmless</a></p>
</li>
<li><p><a href="#heading-human-feedback-as-the-new-scaling-factor">Human Feedback as the New Scaling Factor</a></p>
</li>
<li><p><a href="#heading-why-chatgpt-exploded-globally">Why ChatGPT Exploded Globally</a></p>
</li>
<li><p><a href="#heading-chatgpt-as-an-interface-revolution">ChatGPT as an Interface Revolution</a></p>
</li>
<li><p><a href="#heading-benchmarks-and-results">Benchmarks and Results</a></p>
</li>
<li><p><a href="#heading-truthfulness-and-hallucinations">Truthfulness and Hallucinations</a></p>
</li>
<li><p><a href="#heading-safety-and-refusal-behavior">Safety and Refusal Behavior</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-historical-importance">Historical Importance</a></p>
</li>
<li><p><a href="#heading-discussion-the-real-shift">Discussion: The Real Shift</a></p>
</li>
<li><p><a href="#heading-connection-to-gpt-4">Connection to GPT-4</a></p>
</li>
<li><p><a href="#heading-gpt-3-vs-instructgpt-vs-chatgpt-vs-gpt-4-key-differences">GPT-3 vs InstructGPT vs ChatGPT vs GPT-4: Key Differences</a></p>
</li>
<li><p><a href="#heading-from-gpt-1-to-gpt-4-a-timeline-of-modern-ai-systems-and-alignment-evolution">From GPT-1 to GPT-4: A Timeline of Modern AI Systems and Alignment Evolution</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to already be familiar with a few foundational ideas.</p>
<p>Reading the previous reviews in this series will be especially helpful:</p>
<ul>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/">AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-unsupervised-multitask-learners-gpt-2/">AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-few-shot-learners-gpt-3/">AI Paper Review: Language Models are Few-Shot Learners (GPT-3)</a></p>
</li>
</ul>
<p>Even though GPT-4 was released after InstructGPT, reading the GPT-4 review can still be helpful. It provides a broader view of how alignment techniques evolved and how they were combined with stronger reasoning and multimodal capabilities in later generations of GPT models.</p>
<ul>
<li><a href="https://www.freecodecamp.org/news/ai-paper-review-gpt-4-technical-report/">AI Paper Review: GPT-4 Technical Report (GPT-4)</a></li>
</ul>
<p>It also helps to have:</p>
<ul>
<li><p>A general understanding of natural language processing (NLP) and large language models</p>
</li>
<li><p>A high-level idea of Transformer-based autoregressive models</p>
</li>
<li><p>Familiarity with prompting, few-shot learning, and in-context learning</p>
</li>
<li><p>A basic understanding of reinforcement learning and human feedback systems</p>
</li>
<li><p>General machine learning concepts like training data, fine-tuning, scaling, and inference</p>
</li>
<li><p>Some familiarity with alignment, safety, and AI behavior control concepts</p>
</li>
</ul>
<p>You don't need to be an AI researcher to follow this article, though.</p>
<p>I’ll keep the explanations practical and intuitive, focusing more on understanding how InstructGPT changed modern AI systems rather than getting lost in dense mathematical details or academic terminology.</p>
<h2 id="heading-executive-summary">Executive Summary</h2>
<p>The paper <em>Training Language Models to Follow Instructions with Human Feedback</em> marks one of the biggest turning points in the history of modern AI systems. Instead of asking only how to make language models larger or smarter, OpenAI focused on a different question: how do we make these models actually helpful for real people?</p>
<p>The paper introduces <strong>InstructGPT</strong>, a version of GPT-3 fine-tuned to follow human instructions more accurately using a method called <strong>Reinforcement Learning from Human Feedback (RLHF)</strong>.</p>
<p>The core insight of the paper is simple but extremely important:</p>
<p>Bigger language models don't automatically become better assistants.</p>
<p>Even highly capable models like GPT-3 could still:</p>
<ul>
<li><p>ignore instructions</p>
</li>
<li><p>hallucinate facts</p>
</li>
<li><p>generate toxic or biased outputs</p>
</li>
<li><p>produce responses that were technically fluent but not actually useful to users</p>
</li>
</ul>
<p>To solve this problem, OpenAI built a multi-stage alignment pipeline: humans first demonstrate ideal answers, humans then rank model outputs, and finally the model learns from those preferences using reinforcement learning.</p>
<p>This changed the direction of modern AI development.</p>
<p>The paper shows that alignment and usability can matter more than raw model size itself. One of the most surprising findings was that the 1.3B InstructGPT model was often preferred by human evaluators over the original 175B GPT-3 model, despite being dramatically smaller.</p>
<p>The paper also demonstrates improvements in instruction following, truthfulness, toxicity reduction, conversational behavior, and general user preference.</p>
<p>Historically, this paper became the foundation behind modern conversational AI systems.</p>
<p>GPT-3 proved that language models could learn from prompts.</p>
<p>GPT-4 later proved that scaling and multimodal reasoning could unlock even stronger capabilities.</p>
<p>But InstructGPT showed something equally important: AI systems must be aligned with human intent to become truly usable products.</p>
<p>In many ways, this paper represents the transition from raw language modeling to aligned assistants, capability scaling to behavior shaping, and research demos to real-world conversational AI systems.</p>
<p>And that transition eventually led directly to ChatGPT.</p>
<h2 id="heading-the-core-problem">The Core Problem</h2>
<p>One of the most important ideas in this paper is that raw language modeling is not the same thing as building a useful assistant.</p>
<p>Before InstructGPT, models like GPT-3 were trained mainly with a simple objective: predict the next token in a sequence.</p>
<p>That objective made language models extremely powerful at generating fluent text, but it also created a major limitation. The model learned how to continue internet text, not necessarily how to help humans.</p>
<p>This became one of the defining realizations behind modern AI alignment research.</p>
<p>Despite its impressive capabilities, GPT-3 often struggled to behave like a reliable assistant. The model could produce fluent text, but it was not explicitly trained to follow user intent.</p>
<p>Here are some examples that highlight the differences between GPT-3 and InstructGPT in how they respond to user prompts:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/22cfce35-8c0e-4560-9419-15c6e33123ce.png" alt="Comparison of GPT-3 and InstructGPT responses to the same prompts. GPT-3 often continues generating similar prompts instead of completing the requested task, while InstructGPT follows the instruction directly and produces the requested answer, demonstrating stronger instruction-following behavior." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Source: <a href="https://openai.com/index/instruction-following/"><strong>Aligning language models to follow instructions</strong></a></p>
<img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/cd366a10-f872-4468-bff3-64d05d0597d6.png" alt="cd366a10-f872-4468-bff3-64d05d0597d6" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Source: <a href="https://openai.com/index/instruction-following/"><strong>Aligning language models to follow instructions</strong></a></p>
<p>These examples reveal the central weakness of early GPT systems. GPT-3 often continued the pattern of the prompt rather than completing the requested task. InstructGPT, by contrast, responded directly to the user's instruction. The difference wasn't a matter of raw intelligence. It was a difference in training objectives.</p>
<p>GPT models were trained on massive internet-scale datasets where the goal was simply to predict what text comes next. As a result, the model optimized for plausibility, continuation, and pattern completion. Not necessarily for truthfulness, safety, helpfulness, or alignment with human goals.</p>
<p>This created a major gap between: language capability and useful assistant behavior.</p>
<p>For example, if a user asked a harmful, misleading, or nonsensical question, the model might still attempt to continue the pattern naturally instead of recognizing the issue. In many cases, the model behaved more like an internet text simulator than a reliable assistant.</p>
<p>The paper repeatedly emphasizes that scaling alone couldn't solve this problem.</p>
<p>Researchers increasingly recognized that better behavior would require more than scaling alone.</p>
<p>Models also needed stronger instruction following, better alignment with human intent, improved safety behavior, greater truthfulness, and optimization around real user needs.</p>
<h2 id="heading-why-gpt-3-was-not-enough">Why GPT-3 Was Not Enough</h2>
<p>When GPT-3 was released, it felt like a massive leap forward in AI capabilities.</p>
<p>The model could perform few-shot learning, answer questions, summarize text, generate code, translate languages, and even solve certain reasoning tasks: all without traditional fine-tuning. For many researchers, it was the first time a language model started to feel genuinely general-purpose.</p>
<p>Yet using GPT-3 in practice was often less reliable than its benchmark performance suggested.</p>
<p>In practice, using GPT-3 often required careful prompt engineering. Small wording changes could completely change the quality of the response. Sometimes the model followed instructions well, and other times it ignored them entirely.</p>
<p>Users often found themselves rewriting prompts repeatedly to obtain the response they actually wanted.</p>
<p>This became the core motivation behind InstructGPT.</p>
<p>OpenAI responded by exploring ways to make model behavior more consistent, predictable, and useful for users.</p>
<h2 id="heading-instructgpt-the-birth-of-alignment-centered-llms">InstructGPT: The Birth of Alignment-Centered LLMs</h2>
<p>The release of InstructGPT marked one of the biggest shifts in the history of large language models.</p>
<p>Before InstructGPT, most advances in language models came from scaling data, compute, and model size.</p>
<p>The focus shifted toward alignment: building systems that could follow instructions more reliably and behave in ways users actually preferred.</p>
<p>This is where InstructGPT introduced one of the most important ideas in modern AI systems: Reinforcement Learning from Human Feedback (RLHF).</p>
<p>Instead of optimizing models only to predict internet text, OpenAI started optimizing models based on what humans actually preferred. Human labelers ranked model outputs, and those preferences became part of the training process itself.</p>
<p>This fundamentally changed the objective of language models.</p>
<p>Rather than optimizing solely for next-token prediction, the system was increasingly optimized to produce responses that humans judged to be helpful, safe, and aligned with their intentions.</p>
<p>That distinction may sound subtle, but it completely changed the direction of AI development.</p>
<p>InstructGPT combined instruction-following training with human preference optimization, creating a model whose behavior could be shaped directly through feedback rather than solely through pretraining.</p>
<p>The model was no longer trained only to imitate the internet. It was trained to behave more like an assistant.</p>
<h2 id="heading-rlhf-pipeline-how-instructgpt-learned-to-behave-like-an-assistant">RLHF Pipeline: How InstructGPT Learned to Behave Like an Assistant</h2>
<p>At the center of the InstructGPT paper is a training pipeline that completely changed how modern AI assistants are built.</p>
<p>RLHF was designed to build on traditional language-model pretraining rather than replace it.</p>
<p>The InstructGPT paper introduced a different idea: instead of training models only on internet text, why not train them using human preferences directly?</p>
<p>This led to the development of the RLHF pipeline: Reinforcement Learning from Human Feedback. This approach would later become a standard component of modern conversational AI systems.</p>
<p>The paper’s Figure 2 is especially important because it visualizes the entire alignment pipeline introduced by OpenAI. Rather than relying on a single training stage, the system uses multiple stages where human feedback gradually shapes model behavior.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/d1ccebd1-00b4-48ea-8bc7-e3953bc88fc6.png" alt="RLHF Training Pipeline for InstructGPT" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p><strong>Source:</strong> <em>Training Language Models to Follow Instructions with Human Feedback</em> (OpenAI, 2022).</p>
<p>As you can see in the image above, the process happens in three major stages.</p>
<h3 id="heading-stage-1-supervised-fine-tuning-sft">Stage 1 — Supervised Fine-Tuning (SFT)</h3>
<p>The first stage starts with human-written demonstrations.</p>
<p>Labelers are given prompts and asked to write ideal responses – the kinds of answers a helpful assistant should produce. These examples become the initial training dataset for the model.</p>
<p>At this stage, the model learns the basic patterns of assistant-style responses.</p>
<p>This is still traditional supervised learning, but the goal is different from standard language modeling. Instead of learning only from web text, the model now learns from examples of preferred assistant behavior.</p>
<p>This stage creates what the paper calls the Supervised Fine-Tuned model (SFT model).</p>
<p>And while this already improves behavior significantly, OpenAI realized something important: human preferences are more complex than simple “correct answers.”</p>
<p>There are often many possible responses to a prompt, but humans may strongly prefer some answers over others.</p>
<p>That leads to the next stage.</p>
<h3 id="heading-stage-2-reward-model-training">Stage 2 — Reward Model Training</h3>
<p>In the second stage, humans no longer write responses directly.</p>
<p>Instead, the model generates multiple answers for the same prompt, and human labelers rank them from best to worst.</p>
<p>For a given prompt, one response may be clearer, another more accurate, and another safer or more appropriate. Human labelers rank these alternatives according to their preferences</p>
<p>The rankings are then used to train a separate neural network called the Reward Model (RM).</p>
<p>This model learns something extremely important: which outputs humans prefer.</p>
<p>In other words, the system converts human preferences into a trainable reward signal.</p>
<p>This becomes one of the biggest conceptual breakthroughs in the paper. Instead of manually programming behavior rules, OpenAI trains the model to approximate human judgment itself.</p>
<p>The reward model captures patterns in human preferences and turns them into a training signal.</p>
<p>That reward signal becomes the foundation for the final training stage.</p>
<h3 id="heading-stage-3-ppo-reinforcement-learning">Stage 3 — PPO Reinforcement Learning</h3>
<p>The final stage uses reinforcement learning to optimize the language model against the reward model.</p>
<p>More specifically, the paper uses PPO (Proximal Policy Optimization), a reinforcement learning algorithm commonly used in policy optimization tasks.</p>
<p>At this stage, the model generates responses, receives scores from the reward model, and gradually updates its behavior to maximize those scores.</p>
<p>The model gradually shifts toward responses that receive higher scores from the reward model.</p>
<p>The key innovation is that optimization now occurs against a learned representation of human preferences rather than only a language-modeling objective.</p>
<p>According to the paper, this RLHF pipeline significantly improved instruction following and user preference ratings while also reducing toxic and unsafe behavior.</p>
<p>And in many ways, this pipeline became the blueprint for the modern era of conversational AI systems.</p>
<h2 id="heading-helpful-honest-harmless">Helpful, Honest, Harmless</h2>
<p>The authors argue that evaluating language models requires more than measuring capability alone. They should also be evaluated by how they behave around humans.</p>
<p>At the time, this represented a significant shift in how researchers evaluated language models.</p>
<p>That is why the paper repeatedly emphasizes a new alignment philosophy centered around three goals:</p>
<ul>
<li><p>Helpful</p>
</li>
<li><p>Honest</p>
</li>
<li><p>Harmless</p>
</li>
</ul>
<p>These ideas became the conceptual foundation behind modern alignment research and conversational AI systems.</p>
<h3 id="heading-helpful">Helpful</h3>
<p>The first goal is straightforward: the model should genuinely help the user accomplish what they want.</p>
<p>In practice, helpfulness means following instructions clearly, answering questions directly, providing relevant information, and adapting to the user's intent.</p>
<p>This may seem simple, but it fundamentally changes the training objective.</p>
<p>The model is no longer optimized only for linguistic fluency. It's optimized for usefulness.</p>
<h3 id="heading-honest">Honest</h3>
<p>The second goal is honesty.</p>
<p>One of the biggest problems with large language models is that they often produce convincing answers even when those answers are wrong. The models can hallucinate facts, invent references, or respond confidently despite uncertainty.</p>
<p>The paper recognizes that a useful assistant shouldn't merely sound intelligent. It should also behave truthfully and acknowledge uncertainty when necessary.</p>
<p>This is especially important because language models are optimized to generate plausible text, not verified truth.</p>
<p>As a result, earlier models sometimes prioritized sounding coherent over being accurate.</p>
<p>The alignment process introduced in InstructGPT attempts to reduce this behavior through human feedback and preference optimization. Human evaluators consistently prefer responses that are more accurate, transparent, and reliable, and those preferences gradually shape the model during RLHF training.</p>
<p>The paper doesn't claim that hallucinations disappear completely. Far from it. But it marks one of the first large-scale attempts to explicitly optimize language models for truthfulness and reliability rather than pure text generation quality.</p>
<h3 id="heading-harmless">Harmless</h3>
<p>The third goal is harmlessness.</p>
<p>Large language models trained on internet data inevitably absorb toxic, biased, unsafe, or harmful patterns from that data. Without alignment, models may generate dangerous instructions, offensive content, or manipulative behavior.</p>
<p>The paper directly addresses this concern and treats safety as a central part of model development.</p>
<p>Through RLHF and human preference ranking, the model learns to refuse certain harmful requests, avoid toxic generations, produce safer responses, and behave more responsibly during interaction.</p>
<p>This became one of the defining characteristics of modern conversational AI systems.</p>
<p>Instead of maximizing unrestricted generation, the system begins balancing usefulness, safety, and alignment with human values.</p>
<p>But the paper is also honest about limitations.</p>
<p>The authors acknowledge that harmful outputs, biases, and unsafe behavior can still appear. Alignment is imperfect, and human values themselves are complex and difficult to define universally.</p>
<p>But historically, this paper marks the moment when safety and alignment became core engineering goals rather than secondary concerns.</p>
<p>Taken together, these three principles (helpful, honest, and harmless) became much more than training objectives. They became the philosophical foundation behind ChatGPT-era AI systems.</p>
<p>Earlier GPT papers mainly explored how to scale intelligence. But InstructGPT explored something deeper: how to make intelligence usable for humans.</p>
<h2 id="heading-human-feedback-as-the-new-scaling-factor">Human Feedback as the New Scaling Factor</h2>
<p>One of the most fascinating ideas behind the InstructGPT paper is that it quietly changed what “scaling” meant in modern AI.</p>
<p>For years, progress in language models was largely measured through scaling.</p>
<p>GPT-1 showed that pretraining works. GPT-2 showed that larger models develop stronger zero-shot behavior. GPT-3 pushed this idea even further by scaling to 175 billion parameters and demonstrating impressive few-shot learning abilities.</p>
<p>And to some extent, that was true. Larger models became better at reasoning, code generation, language understanding, translation, and generalization.</p>
<p>That is where human feedback became central.</p>
<p>Instead of relying purely on internet-scale text, OpenAI introduced a training pipeline where human preferences directly shaped model behavior. Human labelers ranked responses, evaluated quality, and guided the system toward outputs people actually preferred.</p>
<p>In many ways, this created a completely new scaling dimension for AI systems:</p>
<ul>
<li><p>scaling human feedback</p>
</li>
<li><p>scaling preference learning</p>
</li>
<li><p>scaling alignment pipelines</p>
</li>
</ul>
<p>Historically, this shifted attention from model scale alone toward the quality of model behavior</p>
<p>InstructGPT focused on scaling usability. And the results were surprisingly powerful.</p>
<p>According to the paper, a much smaller aligned model was often preferred over the original 175B GPT-3 model by human evaluators.</p>
<p>That finding changed how the industry thought about progress.</p>
<p>The result suggested that improving behavior could sometimes matter as much as increasing scale.</p>
<p>This is why RLHF became one of the defining ideas of the ChatGPT era.</p>
<p>After InstructGPT, modern AI systems were no longer evaluated only by benchmark scores, parameter counts, or scaling curves.</p>
<p>They were increasingly evaluated by usefulness, conversational quality, safety, reliability, and how well they interact with humans.</p>
<p>And that shift fundamentally changed the future direction of large language models.</p>
<h2 id="heading-why-chatgpt-exploded-globally">Why ChatGPT Exploded Globally</h2>
<p>When ChatGPT launched publicly, the reaction was immediate and unlike anything the AI industry had seen before.</p>
<p>Millions of people started using it within days. Developers, students, writers, researchers, businesses, and everyday users suddenly felt like they were interacting with AI in a completely different way.</p>
<p>What made this moment so important was that advanced AI capabilities finally became accessible to ordinary users. After all, the underlying language models were already extremely capable before ChatGPT existed. GPT-3 could generate essays, answer questions, write code, summarize text, and perform impressive few-shot learning tasks. GPT-4 later pushed reasoning and multimodal abilities even further.</p>
<p>The challenge was no longer whether language models could perform useful tasks, but whether people could interact with them naturally.</p>
<p>ChatGPT combined powerful language-model capabilities with RLHF-based alignment, conversational interaction, safer behavior, and a user-friendly chat interface.</p>
<p>Earlier systems often required significant prompt experimentation to achieve consistent results. Users had to carefully engineer prompts, retry questions, or work around strange outputs. The models could be brilliant one moment and confusing the next.</p>
<p>ChatGPT changed that experience dramatically.</p>
<p>Thanks to the alignment techniques introduced in the InstructGPT paper, the system became far better at following instructions, maintaining conversational flow, understanding intent, and responding in a way that felt cooperative rather than purely generative.</p>
<p>The conversational interface itself also mattered enormously.</p>
<p>Before ChatGPT, interacting with advanced AI systems often required APIs, coding knowledge, prompt experimentation, or technical understanding.</p>
<p>ChatGPT simplified everything into a familiar chat format: you simply typed naturally, and the system responded naturally.</p>
<p>That design decision may sound small, but historically it was transformative. It turned large language models from research tools into consumer products.</p>
<p>Although imperfect, the system felt substantially more reliable than earlier language-model interfaces.</p>
<p>The system was designed to communicate in ways that felt more natural and cooperative.</p>
<p>The breakthrough was not simply that the AI became smarter. The breakthrough was that the AI became usable.</p>
<p>And that usability is what transformed large language models from impressive research demonstrations into globally adopted AI assistants.</p>
<h2 id="heading-chatgpt-as-an-interface-revolution">ChatGPT as an Interface Revolution</h2>
<p>One of the most important things about ChatGPT is that it changed how humans interact with computers.</p>
<p>Before ChatGPT, powerful AI systems mostly lived behind APIs, research demos, developer tools, and complex prompting workflows.</p>
<p>Using advanced language models often required technical knowledge. Developers experimented with prompt engineering, API parameters, temperature settings, and carefully structured inputs just to get reliable outputs from the model.</p>
<p>Even GPT-3, despite being extremely powerful, still felt like a research system for many users. You had to learn how to “talk to the model.”</p>
<p>And in many cases, the interaction felt fragile. Slight changes in wording could completely change the quality of the response.</p>
<p>ChatGPT changed that dynamic almost overnight.</p>
<p>Instead of making users adapt to the AI, the AI became much better at adapting to humans.</p>
<p>Natural conversation became the interface.</p>
<p>For decades, human-computer interaction depended on commands, menus, search boxes, forms, programming languages, and specialized software interfaces.</p>
<p>ChatGPT introduced something different: you could simply explain what you wanted in plain language. And the system would usually understand.</p>
<p>This made AI feel accessible to people who had never written code, used APIs, or interacted with machine learning systems before.</p>
<p>In many ways, ChatGPT transformed prompting into a universal interface for computing. And that single shift affected nearly every digital field.</p>
<p>In education, students started using conversational AI to explain difficult concepts, summarize lessons, practice languages, and receive tutoring-style help.</p>
<p>In coding, developers began using AI systems for debugging, code generation, documentation, and learning new frameworks.</p>
<p>This eventually led to the rise of AI coding assistants integrated directly into development environments.</p>
<p>In writing and content creation, conversational AI became a brainstorming partner capable of drafting ideas, rewriting text, organizing articles, and helping people communicate more effectively.</p>
<p>Search behavior also started changing. Instead of searching through lists of links, users increasingly expected direct conversational answers. This fundamentally challenged traditional search-engine interaction models.</p>
<p>And across productivity tools, AI systems began acting less like software features and more like collaborative assistants.</p>
<p>This shift was enabled by advances in conversational AI and interaction design that made dialogue feel natural and useful.</p>
<p>The alignment techniques introduced by InstructGPT were an important part of making these conversational experiences practical.</p>
<p>Historically, this may become one of the most important consequences of the GPT era: earlier software required humans to learn interfaces. ChatGPT pushed computing toward interfaces that learn humans instead.</p>
<h2 id="heading-benchmarks-and-results">Benchmarks and Results</h2>
<p>We've already discussed how one of the biggest improvement didn't come from making the model larger. Instead, it came from making the model better aligned with humans.</p>
<p>This is one of the central findings of the entire paper, and it changed how many researchers thought about progress in large language models.</p>
<p>Before this work, the dominant belief was that scaling was the main path forward, with bigger models, more parameters, more compute, and more data. And GPT-3 seemed to confirm that idea. Larger models consistently showed stronger few-shot learning, reasoning, and generalization abilities.</p>
<p>But the InstructGPT paper introduced a different perspective. The researchers found that a relatively small 1.3B parameter InstructGPT model was often preferred by human evaluators over the original 175B GPT-3 model.</p>
<p>That result was extremely important. It suggested that alignment sometimes outperformed scale.</p>
<p>This became one of the defining insights of the ChatGPT era.</p>
<p>According to the paper, human evaluators consistently preferred InstructGPT responses because they were more helpful, more accurate, safer, and better aligned with what users were actually asking for.</p>
<p>The improvements appeared across several important areas.</p>
<p>One major improvement was instruction following. Earlier GPT models often ignored instructions, drifted off-topic, or generated responses that sounded fluent but failed to solve the user’s actual task. InstructGPT behaved much more like a cooperative assistant and followed prompts more reliably.</p>
<p>The paper also reports improvements in truthfulness. Large language models are known for hallucinating information and confidently generating false statements. Through RLHF and preference optimization, InstructGPT reduced some of these behaviors and produced answers humans judged to be more truthful and reliable.</p>
<p>Another important improvement involved toxicity and harmful outputs. The researchers evaluated the system on toxicity benchmarks and found that aligned models generated fewer toxic or unsafe responses compared to earlier GPT systems.</p>
<p>What makes these findings historically important is that they changed the industry’s understanding of what “better AI” actually meant.</p>
<p>Before InstructGPT, improvement was mostly measured through benchmark scores, scaling curves, and parameter counts.</p>
<p>After InstructGPT, researchers increasingly focused on usability, safety, alignment, conversational quality, and human preference satisfaction.</p>
<p>This was a major shift in AI development philosophy.</p>
<h2 id="heading-truthfulness-and-hallucinations">Truthfulness and Hallucinations</h2>
<p>A major challenge for language models is that fluent responses are not always truthful.</p>
<p>This behavior is now commonly called hallucination.</p>
<p>Hallucinations can take many forms, including invented facts, fabricated references, incorrect explanations, or confident answers that lack factual support.</p>
<p>And because the responses are fluent and natural, the mistakes can sometimes look believable to users. The InstructGPT paper treats this as a serious issue rather than a minor flaw.</p>
<p>The authors note that language models are optimized for plausibility rather than verified truth. This is an important distinction: a language model can generate text that <em>looks</em> correct while still being inaccurate.</p>
<p>This is why the paper places particular emphasis on truthfulness and factual reliability.</p>
<p>Through RLHF and human preference optimization, InstructGPT was trained to produce answers humans judged to be more accurate and trustworthy. Human evaluators generally preferred responses that were more transparent about uncertainty and less likely to contain misleading information.</p>
<p>The paper also evaluates the model on truthfulness benchmarks such as <a href="https://arxiv.org/pdf/2109.07958">TruthfulQA</a>, where aligned models demonstrated improvements compared to earlier GPT systems.</p>
<p>But the paper is also careful not to overstate the results. Hallucinations didn't disappear. The aligned models could still make reasoning mistakes, generate false information, misunderstand prompts, or produce overconfident answers.</p>
<p>This nuance is extremely important: the paper doesn't claim that RLHF solved factuality or reasoning completely. Instead, alignment improved behavior, not perfection.</p>
<p>That distinction became increasingly important as ChatGPT and later GPT-4 systems reached millions of users worldwide.</p>
<p>The models became more useful, more truthful, and more aligned, but they still remained probabilistic language models rather than guaranteed fact engines.</p>
<p>In many ways, the InstructGPT paper marks the beginning of large-scale efforts to make AI systems not only intelligent, but also trustworthy enough for real-world human interaction.</p>
<h2 id="heading-safety-and-refusal-behavior">Safety and Refusal Behavior</h2>
<p>As language models became more powerful, researchers realized that safety was becoming a deployment problem.</p>
<p>A model that can generate human-like language at scale can also generate harmful instructions, produce toxic content, spread misinformation, or be manipulated into unsafe behavior.</p>
<p>The InstructGPT paper treats these risks very seriously and frames alignment as a necessary part of deploying large language models responsibly.</p>
<p>One of the biggest changes introduced through RLHF was safer refusal behavior.</p>
<p>Earlier GPT systems often attempted to answer almost anything. As a result, they often responded to unsafe prompts rather than recognizing when a refusal was appropriate.</p>
<p>InstructGPT begins changing that behavior.</p>
<p>Through human feedback and preference optimization, the model learns that some requests shouldn't be answered directly. Human labelers consistently prefer safer responses, refusals for harmful instructions, and outputs that avoid dangerous or toxic behavior.</p>
<p>This leads to systems that are better at refusing unsafe requests, avoiding toxic generations, and behaving more cautiously during interaction.</p>
<p>The paper also evaluates toxicity reduction using safety-related benchmarks and finds that aligned models generally produce fewer harmful outputs than earlier GPT systems.</p>
<p>Another important issue is harmful content filtering. Large language models absorb patterns from massive internet datasets, which inevitably contain biased language, misinformation, unsafe instructions, and toxic behavior.</p>
<p>Without alignment, models may reproduce these patterns surprisingly easily.</p>
<p>RLHF acts as a corrective layer on top of pretraining. Instead of only imitating internet text, the model is further optimized toward responses humans judge to be safer and more appropriate.</p>
<p>Of course, the paper is also realistic about limitations.</p>
<p>The authors acknowledge that alignment is incomplete and that unsafe outputs can still occur. Models may still be vulnerable to adversarial prompting or attempts to bypass safety behavior (what later became widely known as jailbreaks).</p>
<p>This is an important nuance: alignment reduces risk, but it doesn't eliminate it.</p>
<p>And historically, this realization became incredibly important for the future of large-scale AI deployment.</p>
<p>In many ways, the InstructGPT paper marks the beginning of modern AI safety engineering inside flagship language models.</p>
<p>InstructGPT introduced large-scale behavior alignment. Then GPT-4 expanded this even further with red teaming, adversarial testing, deployment monitoring, and much larger safety evaluation pipelines.</p>
<p>So this paper becomes a direct bridge between early generative language models and the much more safety-focused AI systems that followed in the GPT-4 era.</p>
<h2 id="heading-limitations">Limitations</h2>
<p>One of the strongest aspects of the InstructGPT paper is that it doesn't present alignment as a solved problem.</p>
<p>Even though the results are impressive, the authors are careful and surprisingly honest about the system’s remaining weaknesses and risks.</p>
<p>This balance is important because the paper isn't arguing that RLHF creates perfect AI systems. The authors consistently frame alignment as a work in progress rather than a finished solution.</p>
<p>One major limitation is that the models still hallucinate.</p>
<p>The paper acknowledges that hallucinations remain a significant challenge despite alignment improvements.</p>
<p>RLHF improves truthfulness and instruction adherence, but it doesn't fundamentally solve the probabilistic nature of language models. The system still predicts likely text patterns rather than verifying objective truth.</p>
<p>Another important issue is <a href="https://arxiv.org/pdf/2209.13085">reward hacking</a>.</p>
<p>Because the model is optimized against a learned reward signal, it can sometimes discover shortcuts that maximize reward without genuinely improving reasoning or understanding. In other words, the model may learn behaviors that <em>look</em> aligned to evaluators while still hiding deeper problems underneath.</p>
<p>This is a common challenge in reinforcement learning systems more broadly.</p>
<p>The paper also hints at a problem that later became widely discussed in ChatGPT-era systems: <a href="https://arxiv.org/pdf/2406.11717">over-refusal</a> and <a href="https://arxiv.org/pdf/2310.13548">sycophancy</a>.</p>
<p>Sometimes aligned models become too cautious and refuse harmless requests unnecessarily. In other cases, models may become overly agreeable, telling users what they appear to want to hear instead of providing more balanced or truthful responses.</p>
<p>This creates a difficult tension between safety, helpfulness, and honesty.</p>
<p>Another major limitation is bias.</p>
<p>Since these systems are trained on massive internet datasets and further shaped through human labeling, they inevitably inherit biases from both sources. The paper explicitly acknowledges that alignment doesn't remove all harmful or biased behavior.</p>
<p>And perhaps most importantly, the paper emphasizes that RLHF aligns models to labeler preferences not universal human values. This is a very important nuance.</p>
<p>The system learns from the judgments of specific human annotators operating within specific cultural and organizational contexts. That means alignment itself is subjective and imperfect.</p>
<p>There is no single universally agreed definition of helpfulness, fairness, safety, or acceptable behavior.</p>
<p>The paper discusses these concerns carefully and recognizes that human feedback introduces its own limitations and assumptions.</p>
<p>The alignment itself is also fragile. Even aligned systems can sometimes be manipulated through adversarial prompting or jailbreak-style attacks that bypass safety behavior. This later became one of the defining challenges of ChatGPT and GPT-4 deployment.</p>
<p>And finally, there's the practical issue of scale.</p>
<p>RLHF requires large amounts of human labeling, ranking, evaluation, and monitoring. Building these alignment pipelines is expensive, time-consuming, and operationally complex. Unlike raw pretraining data scraped automatically from the internet, human feedback doesn't scale nearly as easily.</p>
<p>In many ways, the paper reveals an important truth about modern AI systems: making models intelligent is difficult. But making them reliably aligned with humans may be even harder.</p>
<h2 id="heading-historical-importance">Historical Importance</h2>
<p>Looking back now, it's difficult to overstate how important the InstructGPT paper became for the entire AI industry.</p>
<p>Earlier GPT papers focused mostly on one central question: How do we make language models more capable?</p>
<p>That era was largely driven by larger datasets, larger parameter counts, scaling laws, and benchmark performance.</p>
<p>The models became increasingly impressive at generating text, solving tasks, and demonstrating emergent abilities. But they still behaved primarily like prediction engines trained to continue internet text.</p>
<p>InstructGPT changed the focus completely. For the first time, large-scale AI development began shifting from model-centric AI to interaction-centric AI.</p>
<p>This was a major philosophical transition: the industry realized that users didn't only care about raw intelligence, benchmark scores, or parameter counts.</p>
<p>They cared about usability, conversational quality, safety, trust, and whether the system could actually help them effectively.</p>
<p>This is why ChatGPT felt so different to the public. The underlying language model capabilities were important, but the real breakthrough came from how those capabilities were shaped into a usable human experience.</p>
<p>The interface became conversational. The system became more cooperative. The AI became more aligned with user intent.</p>
<p>That shift fundamentally changed public perception of artificial intelligence.</p>
<p>Before ChatGPT, most people saw AI as research software, technical demos, or specialized tools for experts.</p>
<p>After ChatGPT, millions of people started interacting with AI systems conversationally on a daily basis.</p>
<p>And that changed everything.</p>
<p>Earlier GPT papers focused mainly on discovering what scaling could achieve. InstructGPT introduced a different challenge: How do we safely deploy these systems in the real world?</p>
<p>That shift helped create entirely new areas of research and engineering, including RLHF pipelines, safety tuning, refusal behavior, red teaming, adversarial testing, policy frameworks, and large-scale human-feedback infrastructure.</p>
<p>In many ways, the ChatGPT era began the moment researchers realized that building powerful models was only part of the problem.</p>
<p>The harder challenge was making those systems reliable enough for human interaction at global scale.</p>
<p>It also helps explain why later systems placed much greater emphasis on safety, alignment, deployment practices, and real-world reliability.</p>
<p>The industry was no longer building language models only for research papers. It was building AI systems intended to operate in the real world. And the InstructGPT paper became one of the clearest turning points in that transformation.</p>
<h2 id="heading-discussion-the-real-shift">Discussion: The Real Shift</h2>
<p>The transition from GPT-3 to ChatGPT represents something much deeper than a simple improvement in model performance.</p>
<p>It changed the central question driving the entire AI industry.</p>
<p>During the GPT-3 era, the big question was, “Can language models learn tasks directly from prompts?”</p>
<p>That was the breakthrough introduced by GPT-3.</p>
<p>Research attention shifted toward scaling and emergent capabilities.</p>
<p>But the ChatGPT era introduced a completely different challenge: the question was no longer simply “Can the model perform the task?” Instead, it became, “Can humans actually trust and use these systems every day?”</p>
<p>That shift changed everything.</p>
<p>Once millions of people began interacting with AI systems directly, raw intelligence alone was no longer sufficient. Users needed systems that were understandable, reliable, safe, conversational, and aligned with human expectations.</p>
<p>This is exactly why the InstructGPT paper became so historically important. It introduced the idea that large language models should not only optimize for capability, but also for human interaction quality.</p>
<p>In many ways, the industry moved from “How smart is the model?” to “How usable is the model?”</p>
<p>And that transition fundamentally changed AI development.</p>
<p>After ChatGPT, success was no longer measured only by benchmark scores, parameter counts, or scaling curves.</p>
<p>It was increasingly measured by alignment, conversational quality, safety, and real-world usability.</p>
<p>This also explains why alignment research suddenly became central to modern AI systems.</p>
<p>GPT-3 showed that models could learn from prompts. ChatGPT showed that humans needed models that could cooperate.</p>
<p>That was the real shift.</p>
<p>And it may ultimately become one of the most important turning points in the history of artificial intelligence.</p>
<h2 id="heading-connection-to-gpt-4">Connection to GPT-4</h2>
<p>One of the most important things to understand about GPT-4 is that it didn't appear out of nowhere.</p>
<p>It was built on top of the alignment ideas introduced by InstructGPT and refined through the large-scale deployment experience of ChatGPT.</p>
<p>GPT-4 is often discussed in terms of its reasoning, multimodal abilities, and benchmark performance.</p>
<p>But beneath all of those improvements is something equally important: the alignment pipeline.</p>
<p>Without the work introduced in the InstructGPT paper, GPT-4 would likely feel far less usable as a real-world assistant.</p>
<p>That distinction matters enormously.</p>
<p>Many of GPT-4's alignment techniques can be traced back to ideas introduced by InstructGPT, including RLHF, instruction tuning, conversational alignment, safer refusal behavior, and human preference optimization.</p>
<p>ChatGPT then became the large-scale real-world testing ground for these ideas.</p>
<p>Millions of user interactions exposed weaknesses ranging from hallucinations and jailbreak attempts to broader safety and usability issues.</p>
<p>Those deployment lessons became incredibly valuable.</p>
<p>By the time GPT-4 arrived, OpenAI was no longer simply training a larger language model. It was building a large-scale aligned conversational system shaped by RLHF pipelines, human feedback, safety engineering, adversarial testing, and real-world user interaction.</p>
<p>This is why GPT-4 feels fundamentally different from earlier GPT models.</p>
<p>In many ways, GPT-4 represents the convergence of two major ideas: scaling capability and scaling alignment.</p>
<ul>
<li><p>GPT-3 proved that language models could learn tasks from prompts.</p>
</li>
<li><p>InstructGPT proved that models could be shaped through human feedback.</p>
</li>
<li><p>ChatGPT proved that aligned conversational AI could work at global scale.</p>
</li>
<li><p>GPT-4 combined all of those ideas into a much more capable multimodal system.</p>
</li>
</ul>
<p>That historical progression is important because it shows that modern AI systems aren't built through scaling alone. They're built through the combination of intelligence, alignment, interaction design, and deployment experience.</p>
<p>And the InstructGPT paper became one of the key foundations that made GPT-4 possible.</p>
<h2 id="heading-gpt-3-vs-instructgpt-vs-chatgpt-vs-gpt-4-key-differences">GPT-3 vs InstructGPT vs ChatGPT vs GPT-4: Key Differences</h2>
<p>By this point, we've discussed GPT-3, InstructGPT, ChatGPT, and GPT-4 individually. But it can be helpful to see them side by side.</p>
<p>Although these systems are closely related, each one introduced a different shift in the evolution of modern AI.</p>
<p>GPT-3 focused on capability through scale, InstructGPT focused on alignment through human feedback, ChatGPT focused on conversational usability, and GPT-4 combined these ideas with stronger reasoning and multimodal capabilities.</p>
<p>The table below summarizes the main differences between them and shows how each system built on the progress of the previous generation.</p>
<table style="min-width:125px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>GPT-3</strong></p></td><td><p><strong>InstructGPT</strong></p></td><td><p><strong>ChatGPT</strong></p></td><td><p><strong>GPT-4</strong></p></td></tr><tr><td><p><strong>Core Idea</strong></p></td><td><p>Large-scale language model enabling few-shot and in-context learning</p></td><td><p>Align language models with human instructions using RLHF</p></td><td><p>Conversational AI assistant optimized for dialogue and usability</p></td><td><p>Aligned multimodal foundation model with stronger reasoning and deployment maturity</p></td></tr><tr><td><p><strong>Main Goal</strong></p></td><td><p>Scale capability through massive pretraining</p></td><td><p>Improve instruction following and alignment</p></td><td><p>Deliver usable conversational AI for the public</p></td><td><p>Build reliable multimodal AI systems for real-world deployment</p></td></tr><tr><td><p><strong>Training Objective</strong></p></td><td><p>Predict next token from internet-scale text</p></td><td><p>Optimize outputs using human feedback and preference learning</p></td><td><p>Conversational interaction optimized through RLHF and dialogue tuning</p></td><td><p>Large-scale multimodal pretraining combined with RLHF, safety tuning, and deployment optimization</p></td></tr><tr><td><p><strong>Alignment Focus</strong></p></td><td><p>Minimal explicit alignment</p></td><td><p>Central focus of the paper</p></td><td><p>Strong conversational alignment</p></td><td><p>Advanced alignment and safety engineering</p></td></tr><tr><td><p><strong>RLHF Usage</strong></p></td><td><p>Not central</p></td><td><p>Core innovation of the system</p></td><td><p>Major component of interaction quality</p></td><td><p>Expanded and refined at larger scale</p></td></tr><tr><td><p><strong>Human Feedback Role</strong></p></td><td><p>Limited</p></td><td><p>Human rankings shape model behavior directly</p></td><td><p>Human feedback improves conversation flow and usability</p></td><td><p>Human feedback combined with large-scale safety evaluation and red teaming</p></td></tr><tr><td><p><strong>Interaction Style</strong></p></td><td><p>Prompt-based text generation</p></td><td><p>Instruction-following assistant</p></td><td><p>Natural multi-turn conversational assistant</p></td><td><p>Advanced conversational and multimodal assistant</p></td></tr><tr><td><p><strong>Prompting Style</strong></p></td><td><p>Zero-shot, one-shot, and few-shot prompting</p></td><td><p>Instruction prompts become more reliable</p></td><td><p>Conversational prompting becomes primary interface</p></td><td><p>Conversational and multimodal prompting</p></td></tr><tr><td><p><strong>Conversation Memory</strong></p></td><td><p>Limited contextual continuity</p></td><td><p>Better instruction adherence</p></td><td><p>Maintains dialogue flow across interactions</p></td><td><p>Stronger contextual reasoning across longer interactions</p></td></tr><tr><td><p><strong>Instruction Following</strong></p></td><td><p>Often inconsistent</p></td><td><p>Significantly improved</p></td><td><p>Strong conversational instruction following</p></td><td><p>More reliable and nuanced instruction handling</p></td></tr><tr><td><p><strong>Truthfulness</strong></p></td><td><p>Frequent hallucinations and overconfidence</p></td><td><p>Improved factual alignment through RLHF</p></td><td><p>More reliable but still hallucinates</p></td><td><p>Improved reasoning and factual performance, though hallucinations remain</p></td></tr><tr><td><p><strong>Safety Behavior</strong></p></td><td><p>Weak safety control</p></td><td><p>Safer refusal behavior introduced</p></td><td><p>More robust refusal and moderation behavior</p></td><td><p>Advanced safety pipelines and adversarial testing</p></td></tr><tr><td><p><strong>Harmful Output Handling</strong></p></td><td><p>Often continues unsafe prompts</p></td><td><p>Learns safer refusals from human feedback</p></td><td><p>Stronger refusal behavior in public deployment</p></td><td><p>More sophisticated alignment and safety systems</p></td></tr><tr><td><p><strong>Reasoning Ability</strong></p></td><td><p>Strong emergent reasoning for its time</p></td><td><p>Similar base capability but behaviorally improved</p></td><td><p>Improved practical reasoning in conversation</p></td><td><p>Major leap in reasoning and problem-solving</p></td></tr><tr><td><p><strong>Multimodal Capability</strong></p></td><td><p>Text only</p></td><td><p>Text only</p></td><td><p>Primarily text-based at launch</p></td><td><p>Text and image multimodal understanding</p></td></tr><tr><td><p><strong>Coding Ability</strong></p></td><td><p>Strong code generation emergence</p></td><td><p>Improved usability for coding tasks</p></td><td><p>Widely used as coding assistant</p></td><td><p>Much stronger coding and debugging performance</p></td></tr><tr><td><p><strong>Context Handling</strong></p></td><td><p>2048-token context window</p></td><td><p>Similar GPT-3-based context limits</p></td><td><p>Improved conversational memory handling</p></td><td><p>Much larger context capabilities</p></td></tr><tr><td><p><strong>Model Size</strong></p></td><td><p>175B parameters</p></td><td><p>Fine-tuned versions of GPT-3 models</p></td><td><p>Based on aligned GPT-3.5/GPT-4 systems</p></td><td><p>Undisclosed by OpenAI</p></td></tr><tr><td><p><strong>Training Data</strong></p></td><td><p>Massive internet-scale text datasets</p></td><td><p>GPT-3 pretraining plus human demonstrations and rankings</p></td><td><p>Large conversational interaction tuning datasets</p></td><td><p>Large-scale multimodal and internet-scale datasets</p></td></tr><tr><td><p><strong>Learning Paradigm</strong></p></td><td><p>In-context learning through scale</p></td><td><p>Human preference learning through RLHF</p></td><td><p>Conversational alignment at deployment scale</p></td><td><p>Combined capability scaling and alignment scaling</p></td></tr><tr><td><p><strong>Key Innovation</strong></p></td><td><p>Emergent few-shot learning</p></td><td><p>RLHF-based alignment pipeline</p></td><td><p>Conversational AI interface revolution</p></td><td><p>Multimodal aligned foundation systems</p></td></tr><tr><td><p><strong>User Experience</strong></p></td><td><p>Powerful but difficult to control</p></td><td><p>More cooperative and instruction-aware</p></td><td><p>Feels like talking to an assistant</p></td><td><p>More reliable, capable, and multimodal interaction</p></td></tr><tr><td><p><strong>Reliability</strong></p></td><td><p>Often unstable across prompts</p></td><td><p>More stable instruction behavior</p></td><td><p>Significantly improved usability</p></td><td><p>Stronger robustness and interaction quality</p></td></tr><tr><td><p><strong>Deployment Style</strong></p></td><td><p>Research and API usage</p></td><td><p>Alignment research milestone</p></td><td><p>Mass public deployment</p></td><td><p>Large-scale multimodal deployment</p></td></tr><tr><td><p><strong>Benchmark Emphasis</strong></p></td><td><p>Capability scaling and few-shot tasks</p></td><td><p>Human preference evaluations and alignment</p></td><td><p>Real-world conversational usability</p></td><td><p>Broad multimodal benchmark dominance</p></td></tr><tr><td><p><strong>Main Limitation</strong></p></td><td><p>Poor alignment and hallucinations</p></td><td><p>Alignment still incomplete and subjective</p></td><td><p>Hallucinations and jailbreak vulnerabilities</p></td><td><p>Hallucinations, safety tradeoffs, and lack of transparency</p></td></tr><tr><td><p><strong>Historical Importance</strong></p></td><td><p>Proved scaling produces emergent abilities</p></td><td><p>Introduced modern alignment-centered LLM training</p></td><td><p>Brought conversational AI to mainstream global use</p></td><td><p>Defined the era of aligned multimodal AI systems</p></td></tr><tr><td><p><strong>What Changed in AI</strong></p></td><td><p>Prompting became central</p></td><td><p>Alignment became a core research priority</p></td><td><p>AI became a mainstream consumer interface</p></td><td><p>AI became deployable multimodal infrastructure</p></td></tr><tr><td><p><strong>Legacy</strong></p></td><td><p>Foundation of prompt-driven AI</p></td><td><p>Foundation of ChatGPT alignment pipeline</p></td><td><p>Popularized conversational AI globally</p></td><td><p>Established modern multimodal AI ecosystem</p></td></tr></tbody></table>

<h2 id="heading-from-gpt-1-to-gpt-4-a-timeline-of-modern-ai-systems-and-alignment-evolution">From GPT-1 to GPT-4: A Timeline of Modern AI Systems and Alignment Evolution</h2>
<p>Before we wrap up, it's worth stepping back and looking at the bigger picture.</p>
<p>The InstructGPT paper didn't emerge in isolation. It was part of a much larger evolution that transformed GPT models from research-focused language models into the conversational AI systems we use today.</p>
<p>Each generation introduced a new idea that pushed the field forward.</p>
<p>GPT-1 introduced large-scale pretraining, GPT-2 demonstrated zero-shot capabilities, GPT-3 popularized prompting and in-context learning, and InstructGPT introduced alignment through human feedback. ChatGPT then brought these ideas to millions of users through a conversational interface, while GPT-4 combined alignment with stronger reasoning and multimodal capabilities.</p>
<p>The timeline below summarizes the key transitions that shaped the modern AI era.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/6e4cc89c-7772-41e4-b5dc-b61820e1521a.png" alt="From GPT-1 to GPT-4 A Timeline of Modern AI Systems and Alignment Evolution" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<table style="min-width:150px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Year</strong></p></td><td><p><strong>System</strong></p></td><td><p><strong>Main Transition</strong></p></td><td><p><strong>What Changed</strong></p></td><td><p><strong>Key Paper / Release</strong></p></td><td><p><strong>Historical Importance</strong></p></td></tr><tr><td><p><strong>2018</strong></p></td><td><p>GPT-1</p></td><td><p>Pretraining + Fine-Tuning Era</p></td><td><p>Introduced generative pretraining using Transformers before supervised fine-tuning</p></td><td><p><em>Improving Language Understanding by Generative Pre-Training</em></p></td><td><p>Started the modern large-scale NLP pretraining paradigm</p></td></tr><tr><td><p><strong>2019</strong></p></td><td><p>GPT-2</p></td><td><p>Zero-Shot Language Modeling Era</p></td><td><p>Showed that larger language models could perform multiple tasks without task-specific fine-tuning</p></td><td><p><em>Language Models are Unsupervised Multitask Learners</em></p></td><td><p>Shifted AI toward general-purpose generative models</p></td></tr><tr><td><p><strong>2020</strong></p></td><td><p>GPT-3</p></td><td><p>In-Context Learning Era</p></td><td><p>Demonstrated few-shot, one-shot, and zero-shot learning at massive scale using prompts alone</p></td><td><p><em>Language Models are Few-Shot Learners</em></p></td><td><p>Made prompting the central interface for AI systems</p></td></tr><tr><td><p><strong>March 2022</strong></p></td><td><p>InstructGPT</p></td><td><p>Alignment and RLHF Era</p></td><td><p>Introduced reinforcement learning from human feedback (RLHF) to align models with user intent</p></td><td><p><em>Training Language Models to Follow Instructions with Human Feedback</em></p></td><td><p>Shifted AI development from raw capability to alignment and usability</p></td></tr><tr><td><p><strong>Nov 2022</strong></p></td><td><p>GPT-3.5 / ChatGPT</p></td><td><p>Conversational AI Era</p></td><td><p>Combined GPT-3.5 with RLHF and chat-based interaction for public deployment</p></td><td><p>ChatGPT public release based on GPT-3.5 family</p></td><td><p>Turned LLMs into mainstream conversational assistants used globally</p></td></tr><tr><td><p><strong>2023</strong></p></td><td><p>GPT-4</p></td><td><p>Multimodal Aligned Foundation Model Era</p></td><td><p>Expanded aligned AI into multimodal reasoning across text and images with stronger reliability and safety systems</p></td><td><p>GPT-4 Technical Report</p></td><td><p>Established the modern era of deployable multimodal AI systems</p></td></tr><tr><td><p><strong>2023–Present</strong></p></td><td><p>GPT-4 + ChatGPT Ecosystem</p></td><td><p>AI Assistant Infrastructure Era</p></td><td><p>AI systems evolved into integrated assistants for coding, education, productivity, reasoning, and multimodal interaction</p></td><td><p>GPT-4 deployment ecosystem</p></td><td><p>Transitioned AI from research products into global infrastructure platforms</p></td></tr></tbody></table>

<h2 id="heading-final-insight">Final Insight</h2>
<p>When people look back at the history of modern AI, they often focus on the moments when models became larger, more powerful, or more capable. But the story of the GPT series is not just a story about scale. It is also a story about learning how to make that intelligence useful.</p>
<p>GPT-1 showed that language models could learn surprisingly rich representations from large amounts of text before being adapted to specific tasks.</p>
<p>GPT-2 expanded that idea and revealed that scale itself could unlock new behaviors.</p>
<p>GPT-3 pushed the field into entirely new territory, demonstrating that a single model could perform a wide variety of tasks simply by responding to prompts and examples.</p>
<p>For a moment, it seemed as though scaling might be the answer to everything.</p>
<p>Then InstructGPT arrived and exposed a different challenge.</p>
<p>The problem was no longer whether a model could generate text, answer questions, or complete tasks. Models were already becoming remarkably capable.</p>
<p>The real question was whether people could actually rely on them. Could they follow instructions consistently? Could they respond in ways users found helpful? Could they become something more than sophisticated prediction engines?</p>
<p>That was the breakthrough at the heart of InstructGPT.</p>
<p>Rather than focusing solely on making models smarter, the paper focused on making them behave better.</p>
<p>Human feedback became part of the training process itself.</p>
<p>Alignment moved from a research concern to a core design principle. For the first time, improving the relationship between humans and AI became just as important as improving the model's raw capabilities.</p>
<p>The impact of that shift extended far beyond a single paper.</p>
<p>It laid the groundwork for ChatGPT, which introduced millions of people to conversational AI. Suddenly, interacting with advanced language models no longer required APIs, research expertise, or carefully engineered prompts. People could simply ask questions, seek advice, explore ideas, or learn something new through natural conversation.</p>
<p>That change transformed AI from a research breakthrough into a widely used product.</p>
<p>GPT-4 would later build on this foundation, combining stronger reasoning and broader capabilities with the alignment techniques that began with InstructGPT. But by then, the industry had already learned an important lesson: capability alone was not enough. Intelligence had to be usable.</p>
<p>In hindsight, the lasting significance of the InstructGPT paper is not that it introduced a new training pipeline. It is that it helped redefine the goal of modern AI.</p>
<p>The challenge was no longer just building systems that could generate language.</p>
<p>It was building systems that people could work with, learn from, and trust.</p>
<p>And that may ultimately be the transition that defined this era of artificial intelligence.</p>
<h2 id="heading-resources"><strong>Resources:</strong></h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2203.02155">Training Language Models to Follow Instructions with Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.01325">Learning to Summarize from Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1909.08593">Fine-Tuning Language Models from Human Preferences</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1706.03741">Deep Reinforcement Learning from Human Preferences</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.01325">Learning to Summarize with Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2008.02275">Aligning AI With Shared Human Values</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2107.05637">Asking for Help on Recursive Decomposition</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2112.09332">WebGPT: Browser-assisted Question-Answering with Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2212.08073">Constitutional AI: Harmlessness from AI Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2109.07958">TruthfulQA: Measuring How Models Mimic Human Falsehoods</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.11462">RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2104.08691">The Power of Scale for Parameter-Efficient Prompt Tuning</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2109.01652">Finetuned Language Models Are Zero-Shot Learners</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2110.08207">Multitask Prompted Training Enables Zero-Shot Task Generalization</a></p>
</li>
</ul>
<p><strong>Contact Me</strong></p>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Why Your Deep Learning Model Isn't Learning: Diagnosing Data Problems in Medical Imaging ]]>
                </title>
                <description>
                    <![CDATA[ I built a clean, well-structured deep learning pipeline using MONAI (Medical Open Network for AI) on a public abdominal ultrasound dataset. The pipeline included: proper subject-grouped train/validat ]]>
                </description>
                <link>https://www.freecodecamp.org/news/why-your-deep-learning-model-isn-t-learning-data-problems-in-medical-imaging/</link>
                <guid isPermaLink="false">6a19aed9b55c6a731d1d7c06</guid>
                
                    <category>
                        <![CDATA[ Medical Imaging ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Healthcare AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Dataanalysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Lakshmi Mahabaleshwara ]]>
                </dc:creator>
                <pubDate>Fri, 29 May 2026 15:20:57 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/36be814e-4189-4905-9470-1cb5860e7124.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>I built a clean, well-structured deep learning pipeline using <a href="https://project-monai.github.io/">MONAI</a> (Medical Open Network for AI) on a public abdominal ultrasound dataset.</p>
<p>The pipeline included:</p>
<ul>
<li><p>proper subject-grouped train/validation splits</p>
</li>
<li><p>robust preprocessing</p>
</li>
<li><p>carefully decoded segmentation masks</p>
</li>
<li><p>sensible loss functions</p>
</li>
<li><p>consistent evaluation</p>
</li>
</ul>
<p>And the model still struggled to learn.</p>
<p>The interesting part isn't that the model underperformed. What mattered was the diagnosis: a series of simple checks that traced the problem back to the dataset, not the model.</p>
<p>Those checks are useful far beyond medical imaging. They apply to almost any machine learning project.</p>
<p>If you're new to ML, this is a lesson worth carrying into every project: <strong>understand your data before you tune your model.</strong></p>
<p>I set out to build a medical image segmentation tutorial. I ended up learning a more valuable lesson: no amount of careful engineering can rescue a model from a dataset that can't support the task.</p>
<p>By the end of this article, you'll understand:</p>
<ul>
<li><p>How to evaluate whether a dataset can actually support your task</p>
</li>
<li><p>Why "the model isn't learning" is often a data problem</p>
</li>
<li><p>How to rule out engineering bugs before blaming the data</p>
</li>
<li><p>Practical diagnostics you can run in minutes</p>
</li>
<li><p>Why synthetic training data often struggles in real-world deployment</p>
</li>
<li><p>When to stop tuning and walk away from a dataset</p>
</li>
</ul>
<p>This is not a beginner introduction to deep learning – it assumes familiarity with concepts like UNet architectures and training loops. But the data-quality lessons apply broadly to many ML projects.</p>
<h3 id="heading-what-well-cover">What We'll Cover:</h3>
<ul>
<li><p><a href="#heading-the-dataset">The Dataset</a></p>
</li>
<li><p><a href="#heading-step-1-rule-out-the-pipeline-before-blaming-the-data">Step 1: Rule Out the Pipeline Before Blaming the Data</a></p>
<ul>
<li><p><a href="#heading-subject-grouped-splits">Subject-grouped splits</a></p>
</li>
<li><p><a href="#heading-decoding-masks-correctly">Decoding masks correctly</a></p>
</li>
<li><p><a href="#heading-loss-design-and-class-weighting">Loss design and class weighting</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-2-the-model-still-struggled">Step 2: The Model Still Struggled</a></p>
</li>
<li><p><a href="#heading-step-3-interrogating-the-dataset">Step 3: Interrogating the Dataset</a></p>
<ul>
<li><p><a href="#heading-diagnostic-1-what-does-the-dataset-actually-contain">Diagnostic 1: What Does the Dataset Actually Contain?</a></p>
</li>
<li><p><a href="#heading-diagnostic-2-do-synthetic-and-real-images-look-similar">Diagnostic 2: Do Synthetic and Real Images Look Similar?</a></p>
</li>
<li><p><a href="#heading-diagnostic-3-can-the-gap-be-fixed-by-adding-real-data">Diagnostic 3: Can the gap be fixed by adding real data?</a></p>
</li>
</ul>
</li>
<li><p><a href="#heading-step-4-knowing-when-to-stop">Step 4: Knowing When to Stop</a></p>
</li>
<li><p><a href="#heading-a-practical-dataset-evaluation-checklist">A Practical Dataset Evaluation Checklist</a></p>
</li>
<li><p><a href="#heading-what-i-would-try-next">What I Would Try Next</a></p>
</li>
<li><p><a href="#heading-the-bigger-lesson">The Bigger Lesson</a></p>
</li>
</ul>
<h2 id="heading-the-dataset">The Dataset</h2>
<p>I used the <a href="https://www.kaggle.com/datasets/ignaciorlando/ussimandsegm">US Simulation &amp; Segmentation dataset</a>, a public collection of abdominal ultrasound images with organ segmentation labels from Kaggle.</p>
<p>It contains:</p>
<ul>
<li><p><strong>926 synthetic ultrasound images</strong> — generated by a ray-casting simulator from CT scans, with full organ annotations</p>
</li>
<li><p><strong>617 real ultrasound images</strong> — from an actual ultrasound scanner</p>
</li>
<li><p><strong>Labels for 8 organs</strong> — liver, kidney, gallbladder, pancreas, spleen, bones, vessels, and adrenals</p>
</li>
</ul>
<p>At first glance, the dataset looked ideal:</p>
<ul>
<li><p>thousands of images</p>
</li>
<li><p>multiple organ classes</p>
</li>
<li><p>both synthetic and real ultrasound data</p>
</li>
</ul>
<p>Whether it actually supported the task was a different question.</p>
<h2 id="heading-step-1-rule-out-the-pipeline-before-blaming-the-data">Step 1: Rule Out the Pipeline Before Blaming the Data</h2>
<p>Ground rule: you should always rule out the pipeline before blaming the data. A model failing on buggy code looks exactly like a model failing on bad data. The engineering needs to be trustworthy.</p>
<h3 id="heading-subject-grouped-splits">Subject-Grouped Splits</h3>
<p>A common mistake in medical imaging is randomly splitting images into train and test sets.</p>
<p>That approach is problematic because many frames come from the same patient. Those frames share anatomy, scanner settings, and noise patterns.</p>
<p>If frames from the same patient appear in both the train and test sets, the model can partially memorize patient-specific patterns. Test scores look artificially good, even though the model may fail on truly unseen patients.</p>
<p>This is called <strong>subject leakage</strong>.</p>
<p>The fix is to split by patient instead of by image:</p>
<pre><code class="language-python">from sklearn.model_selection import GroupShuffleSplit

def assign_splits(manifest, val_fraction=0.15, seed=42):
    train_data = manifest[manifest["orig_split"] == "train"]
    groups = train_data["subject_id"].values

    gss = GroupShuffleSplit(n_splits=1, test_size=val_fraction, random_state=seed)
    train_idx, val_idx = next(gss.split(X=train_data, y=None, groups=groups))

    train_subjects = set(train_data.iloc[train_idx]["subject_id"].unique())
    val_subjects = set(train_data.iloc[val_idx]["subject_id"].unique())

    # Crash loudly if leakage ever sneaks in
    assert train_subjects.isdisjoint(val_subjects), "Subject leak detected!"
    return train_subjects, val_subjects
</code></pre>
<p><strong>That assertion matters.</strong> If the split logic ever breaks, the pipeline fails loudly instead of silently producing misleading metrics.</p>
<h3 id="heading-decoding-masks-correctly">Decoding Masks Correctly</h3>
<p>The dataset stores labels as color-coded masks. Each organ corresponds to a different RGB color.</p>
<p>Training requires converting those colors into integer class labels.</p>
<p>A naïve implementation uses exact color matching, but resizing operations can slightly alter colors at mask boundaries.</p>
<p>A more robust approach maps each pixel to its nearest palette color:</p>
<pre><code class="language-python">import numpy as np

PALETTE = np.array([
    [0, 0, 0],
    [100, 0, 100],
    [255, 255, 255],
    [0, 255, 0],
    [255, 255, 0],
    [0, 0, 255],
    [255, 0, 0],
    [255, 0, 255],
    [0, 255, 255],
], dtype=np.int32)

def decode_mask(mask_rgb):
    h, w = mask_rgb.shape[:2]
    flat = mask_rgb.reshape(-1, 3).astype(np.int32)
    d2 = (
        (flat[:, None, :] - PALETTE[None, :, :]) ** 2
    ).sum(-1)
    classes = d2.argmin(axis=1).astype(np.uint8)
    return classes.reshape(h, w)
</code></pre>
<p>Before training, it’s worth visually checking a few decoded masks against the original images. This catches issues like incorrect palettes, RGB/BGR channel swaps, or resizing artifacts that silently corrupt labels.</p>
<p>These bugs rarely throw errors. Instead, the model simply learns poorly. And “<em>trained on wrong labels</em>” looks exactly like “<em>the model can’t learn the data.</em>”</p>
<p>Verifying masks early removes that uncertainty.</p>
<h3 id="heading-loss-design-and-class-weighting">Loss Design and Class Weighting</h3>
<p>For training, I usd standard MONAI segmentation losses. The goal wasn’t to aggressively maximize performance, but to establish a stable and trustworthy baseline.</p>
<p>The training curves below show that the model optimized normally: the loss decreased consistently, and the validation dice stabilized rather than diverging. This helped rule out optimization instability as the primary cause of poor final performance.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/841346d4-d3df-48a9-bc4d-31a5dd0d9bb0.png" alt="Two training curves from a MONAI liver segmentation experiment. The left plot shows training loss steadily decreasing across 50 epochs, while the right plot shows validation Dice scores stabilizing around 0.55–0.60 after initial fluctuations, indicating stable optimization but limited segmentation performance." style="display:block;margin:0 auto" width="1594" height="448" loading="lazy">

<p>Three choices were deliberate:</p>
<ul>
<li><p><strong>Dice + Cross-Entropy combined:</strong> Cross-entropy keeps learning stable early on – Dice directly rewards good region overlap. Together they balance each other.</p>
</li>
<li><p><code>include_background=False</code> <strong>for binary segmentation:</strong> In a single-organ task, background can be 85–90% of the pixels. Counting it in the loss drowns out the signal for the organ you actually care about, so it's better left out.</p>
</li>
<li><p><strong>Class weighting for multi-class segmentation:</strong> With organs of very different sizes, an unweighted loss lets the model ignore the small, rare ones and still score well. Weighting rare-class mistakes more heavily pushes back against that.</p>
</li>
</ul>
<h2 id="heading-step-2-the-model-still-struggled">Step 2: The Model Still Struggled</h2>
<p>The first experiment focused on liver segmentation — the simplest single-organ task in the dataset.</p>
<table>
<thead>
<tr>
<th>Test set</th>
<th>Liver Dice</th>
</tr>
</thead>
<tbody><tr>
<td>Synthetic test set</td>
<td>~0.68</td>
</tr>
<tr>
<td>Real ultrasound test set</td>
<td>~0.48</td>
</tr>
</tbody></table>
<p>Dice scores range from 0 (no overlap) to 1 (perfect overlap).</p>
<p>Qualitatively, the predictions often captured rough liver regions but failed at boundaries and consistency across real scans.</p>
<p>Especially important:</p>
<ul>
<li><p>the model struggled even on synthetic in-domain data</p>
</li>
<li><p>performance dropped further on real ultrasound images</p>
</li>
</ul>
<p>At this point, two explanations were possible:</p>
<ol>
<li><p>the model or pipeline was flawed</p>
</li>
<li><p>the dataset itself was limiting performance</p>
</li>
</ol>
<p>Because the engineering had been carefully validated, the second possibility became worth investigating seriously.</p>
<p>That's where the real lesson began.</p>
<h2 id="heading-step-3-interrogating-the-dataset">Step 3: Interrogating the Dataset</h2>
<p>Rather than endlessly tuning the model, the productive move is to turn the diagnostic lens on the dataset.</p>
<p>Three simple checks revealed the real problem. None required retraining or expensive experiments.</p>
<h3 id="heading-diagnostic-1-what-does-the-dataset-actually-contain">Diagnostic 1: What Does the Dataset Actually Contain?</h3>
<p>The first step was simply plotting the dataset composition.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/d2855b12-b416-4a76-b743-971bf4389628.png" alt="Bar chart showing the composition of the ultrasound segmentation dataset. The dataset contains 926 labeled synthetic ultrasound images, 60 labeled real ultrasound images, and 557 unlabeled real ultrasound images, for a total of 1,543 images. Labeled real data represents only 3.9% of the dataset." style="display:block;margin:0 auto" width="1574" height="932" loading="lazy">

<ul>
<li><p><strong>926 labeled synthetic images</strong> (the bulk of training data)</p>
</li>
<li><p><strong>Only 60 labeled real images</strong> — less than 4% of the dataset</p>
</li>
<li><p><strong>557 unlabeled real images</strong> — real data exists, but without labels it can't be used for supervised training</p>
</li>
</ul>
<p>This immediately changed the interpretation of the dataset.</p>
<p>Although the dataset contains many real ultrasound scans, almost all labeled training data is synthetic.</p>
<p>The model is effectively trained on synthetic ultrasound and expected to generalize to real ultrasound.</p>
<p>That's a difficult transfer problem from the start.</p>
<p>The limitation is simple: the real images mostly don't have labels, so supervised training has very little real-world data to learn from.</p>
<p><strong>Lesson:</strong> Before training anything, chart the dataset composition. A headline image count can be misleading. "1,500 images" sounds large until you discover that only a tiny fraction are labeled examples from the target domain.</p>
<h3 id="heading-diagnostic-2-do-synthetic-and-real-images-look-similar">Diagnostic 2: Do Synthetic and Real Images Look Similar?</h3>
<p>The next question was whether the synthetic and real ultrasound images actually followed similar visual distributions.</p>
<p>Plotting intensity histograms showed a clear mismatch.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69fd77e89f93a850a46d376f/baac5168-292e-45f8-ab9c-fd468dc63b46.png" alt="Histogram comparing pixel intensity distributions between synthetic and real ultrasound images. Synthetic images cluster heavily around lower intensity values, while real ultrasound images show a broader mid-range distribution. The figure also reports summary statistics including mean intensity, standard deviation, and percentile ranges for both datasets." style="display:block;margin:0 auto" width="1705" height="951" loading="lazy">

<ul>
<li><p>synthetic images clustered heavily near darker intensities</p>
</li>
<li><p>real ultrasound images had broader mid-range intensity distributions</p>
</li>
</ul>
<p>The synthetic simulator captured anatomical geometry reasonably well, but it didn't reproduce the texture and noise characteristics of real ultrasound:</p>
<ul>
<li><p>speckle patterns</p>
</li>
<li><p>intensity falloff</p>
</li>
<li><p>scanner-specific artifacts</p>
</li>
</ul>
<p>This is the classic <strong>synthetic-to-real domain gap.</strong></p>
<p>The model learned features tuned to synthetic images and then encountered a substantially different distribution during evaluation. Poor transfer performance became expected rather than surprising.</p>
<p><strong>Lesson:</strong> Whenever training and deployment happen on different domains — synthetic → real, scanner A → scanner B, hospital A → hospital B — measure the distribution shift directly. Simple histogram comparisons can reveal major problems in minutes.</p>
<h3 id="heading-diagnostic-3-can-the-gap-be-fixed-by-adding-real-data">Diagnostic 3: Can the gap be fixed by adding real data?</h3>
<p>The obvious next idea was: why not include some real labeled data during training?</p>
<p>But before implementing that approach, it's worth checking how many distinct patients actually had labels.</p>
<pre><code class="language-plaintext">Labeled real images: 60
Distinct subjects (labeled real): 4

Frames per subject:
  subject h: 26
  subject a: 16
  subject g: 10
  subject b: 8
</code></pre>
<p>Only <strong>four</strong> patients.</p>
<p>That result fundamentally changed the situation.</p>
<p>Proper medical imaging evaluation requires subject-grouped train/test splits. But with only four patients, any evaluation becomes statistically unstable.</p>
<p>Training on two or three patients and testing on one or two patients would produce highly unreliable metrics that depend heavily on which patient happened to be held out.</p>
<p>At that point, the dataset simply couldn't support trustworthy real-world evaluation.</p>
<p><strong>Lesson:</strong> In medical imaging, count subjects, not images. The true size of a dataset is bounded by the number of independent patients, not the number of files.</p>
<h2 id="heading-step-4-knowing-when-to-stop">Step 4: Knowing When to Stop</h2>
<p>At this point, additional tuning no longer made sense.</p>
<p>The bottleneck was not the architecture, optimizer, or learning rate. The bottleneck was the dataset itself.</p>
<p>The pipeline was still valuable and reusable. But this particular dataset couldn't reliably support the intended segmentation task.</p>
<p>That distinction matters: sometimes a problem is difficult but solvable, and sometimes the data simply can't support the conclusion you want to draw.</p>
<p>Learning to recognize the difference is an important ML skill.</p>
<h2 id="heading-a-practical-dataset-evaluation-checklist">A Practical Dataset Evaluation Checklist</h2>
<p>Before committing weeks to model development, these checks are worth running on any dataset:</p>
<ol>
<li><p><strong>Chart the dataset composition</strong> — labeled vs unlabeled, class distribution, domain distribution</p>
</li>
<li><p><strong>Count subjects, not images</strong> — independent patients matter more than frame count</p>
</li>
<li><p><strong>Check class balance</strong> — rare classes are often ignored without weighting or sampling strategies</p>
</li>
<li><p><strong>Compare train and deployment distributions</strong> — especially for cross-domain problems</p>
</li>
<li><p><strong>Verify labels visually</strong> — catch preprocessing or annotation errors early</p>
</li>
<li><p><strong>Look for published baselines</strong> — low published performance may indicate dataset limitations</p>
</li>
</ol>
<p>These checks take minutes and can save weeks of unnecessary tuning.</p>
<h2 id="heading-what-i-would-try-next">What I Would Try Next</h2>
<p>Improving results would likely require better data rather than a larger model. The next steps I'd prioritize:</p>
<ul>
<li><p>collecting more labeled real ultrasound scans, from more distinct patients</p>
</li>
<li><p>improving annotation consistency</p>
</li>
<li><p>semi-supervised learning to make use of the unlabeled real images</p>
</li>
<li><p>domain adaptation between synthetic and real ultrasound</p>
</li>
</ul>
<p>All of these target the actual bottleneck: data quality and data diversity.</p>
<h2 id="heading-the-bigger-lesson">The Bigger Lesson</h2>
<p>In machine learning, it's easy to focus most of our attention on architectures, hyperparameters, optimization tricks, and newer models.</p>
<p>But the dataset quietly defines the ceiling.</p>
<p>A sophisticated model on weak data often disappoints, while a simpler model on strong data performs surprisingly well.</p>
<p>That was the real lesson from this project.</p>
<p>The most valuable skill wasn't building the pipeline. It was diagnosing why the model couldn't succeed and being willing to trust what the data was saying.</p>
<p>The workflow — checking dataset composition, counting subjects, comparing distributions, ruling out engineering bugs, and deciding when to stop — transfers to almost any ML project.</p>
<p>In many projects, better judgment about the data matters more than a better model.</p>
<p>The pipeline code and diagnostic notebooks are available at the <a href="https://github.com/lakshmi-mahabaleshwara/wg-ultrasound/tree/abdomen_simulation_segmentation/data_and_tutorials/abdomen_us_multiorgan_segmentation">MONAI</a> <a href="https://github.com/lakshmi-mahabaleshwara/wg-ultrasound/tree/abdomen_simulation_segmentation/data_and_tutorials/abdomen_us_multiorgan_segmentation">Ultrasound Working Group</a> <a href="https://github.com/lakshmi-mahabaleshwara/wg-ultrasound/tree/abdomen_simulation_segmentation/data_and_tutorials/abdomen_us_multiorgan_segmentation">repository</a>. Questions, corrections, and improvements are always welcome.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: GPT-4 Technical Report (GPT-4) ]]>
                </title>
                <description>
                    <![CDATA[ When GPT-3 was released in 2020, it completely changed how people thought about language models. It showed that a sufficiently large neural network could learn tasks directly from prompts and examples ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-gpt-4-technical-report/</link>
                <guid isPermaLink="false">6a17653cbadcd8afcb2bb430</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GPT 4 ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Wed, 27 May 2026 21:42:20 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/2a5eb5e0-bd3c-4423-b9b5-b94edbaaba98.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When GPT-3 was released in 2020, it completely changed how people thought about language models. It showed that a sufficiently large neural network could learn tasks directly from prompts and examples without traditional fine-tuning.</p>
<p>That idea eventually led to prompt engineering, AI assistants, and the first wave of large language model applications.</p>
<p>But GPT-4 felt different.</p>
<p>GPT-3 still felt like a research breakthrough: powerful, experimental, and sometimes unpredictable. GPT-4, on the other hand, felt like the beginning of a real AI platform. The focus was no longer just on scaling language models to achieve better benchmarks. Instead, the conversation shifted toward reliability, multimodal understanding, alignment, safety, and real-world deployment.</p>
<p>This change is visible throughout the GPT-4 Technical Report released by <a href="https://openai.com">OpenAI</a>.</p>
<p>Unlike the earlier GPT papers, OpenAI didn't publish a traditional research paper with detailed architecture diagrams, parameter counts, datasets, or training configurations. Instead, they released a more limited technical report focused primarily on capabilities, evaluations, safety work, and deployment considerations.</p>
<p>That decision itself reflects how much the field had changed.</p>
<p>By the time GPT-4 arrived, large language models were no longer just research projects used inside labs. They had become globally deployed systems used by millions of people through products like <a href="https://chatgpt.com">ChatGPT</a>. Questions about misuse, hallucinations, bias, cybersecurity risks, and alignment were now just as important as raw model performance.</p>
<p>GPT-4 also introduced another major shift: multimodality.</p>
<p>Previous GPT models worked only with text. GPT-4 expanded this idea by accepting both images and text as input, allowing the model to analyze screenshots, diagrams, documents, visual jokes, and other mixed forms of information. This pushed large language models closer to more general-purpose AI systems rather than narrow text generators.</p>
<p>Historically, the progression becomes surprisingly clear:</p>
<ul>
<li><p>GPT-1 introduced pretraining and transfer learning</p>
</li>
<li><p>GPT-2 introduced zero-shot multitask learning</p>
</li>
<li><p>GPT-3 introduced few-shot prompting and in-context learning</p>
</li>
<li><p>GPT-4 introduced the era of aligned, multimodal AI systems</p>
</li>
</ul>
<p>In many ways, GPT-4 marks the moment when large language models stopped being viewed primarily as research experiments and started becoming foundational computing interfaces for real-world applications.</p>
<h2 id="heading-paper-overview"><strong>Paper Overview</strong></h2>
<p>In this article, we’ll review the <em>GPT-4 Technical Report</em> published by Open AI in 2023.</p>
<p>Many important technical details were intentionally omitted from this report, including:</p>
<ul>
<li><p>parameter count</p>
</li>
<li><p>exact architecture</p>
</li>
<li><p>training compute</p>
</li>
<li><p>dataset composition</p>
</li>
<li><p>hardware configuration</p>
</li>
</ul>
<p>According to OpenAI, these limitations were introduced partly because of the competitive landscape and the growing safety implications surrounding large-scale AI systems.</p>
<p>That difference is historically important.</p>
<p>The GPT-1, GPT-2, and GPT-3 papers openly discussed architecture scaling, datasets, and training methodology in significant detail. GPT-4 marks a noticeable shift toward more restricted disclosure as language models became commercially valuable and widely deployed.</p>
<p>You can read the original report here:</p>
<p><a href="https://arxiv.org/abs/2303.08774">GPT-4 Technical Report</a></p>
<p>And here’s a quick infographic of what we’ll cover throughout this review:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/6edf3f33-6994-46a6-abd9-b04b7e75ddee.png" alt="GPT4 AI Paper Quick Insight" style="display:block;margin:0 auto" width="1414" height="2000" loading="lazy">

<h2 id="heading-table-of-content"><strong>Table of Content:</strong></h2>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-goals-of-the-report">Goals of the Report</a></p>
</li>
<li><p><a href="#heading-core-idea">Core Idea</a></p>
</li>
<li><p><a href="#heading-predictable-scaling">Predictable Scaling</a></p>
</li>
<li><p><a href="#heading-model-architecture">Model Architecture</a></p>
</li>
<li><p><a href="#heading-multimodal-learning">Multimodal Learning</a></p>
</li>
<li><p><a href="#heading-fine-tuning-vs-zero-shot-vs-few-shot-vs-aligned-multimodal-learning">Fine-Tuning vs Zero-Shot vs Few-Shot vs Aligned Multimodal Learning</a></p>
</li>
<li><p><a href="#heading-rlhf-and-alignment">RLHF and Alignment</a></p>
</li>
<li><p><a href="#heading-benchmarks-and-experiments">Benchmarks and Experiments</a></p>
</li>
<li><p><a href="#heading-coding-and-reasoning-ability">Coding and Reasoning Ability</a></p>
</li>
<li><p><a href="#heading-multilingual-capabilities">Multilingual Capabilities</a></p>
</li>
<li><p><a href="#heading-emergent-behavior">Emergent Behavior</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-safety-and-risks">Safety and Risks</a></p>
</li>
<li><p><a href="#heading-discussion">Discussion</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-gpt-1-vs-gpt-2-vs-gpt-3-vs-gpt-4-key-differences">GPT-1 vs GPT-2 vs GPT-3 vs GPT-4: Key Differences</a></p>
</li>
<li><p><a href="#heading-pytorch-implementations-of-the-gpt-architecture-evolution">PyTorch Implementations of the GPT Architecture Evolution</a></p>
</li>
<li><p><a href="#heading-resources">Resources:</a></p>
</li>
</ul>
<h2 id="heading-prerequisites"><strong>Prerequisites</strong></h2>
<p>To get the most out of this breakdown, it helps to already be familiar with some of the core ideas behind modern language models.</p>
<p>Reading the earlier reviews in this series will be especially useful:</p>
<ul>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/">AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-unsupervised-multitask-learners-gpt-2/">AI Paper Review: Language Models are Unsupervised Multitask Learners (GPT-2)</a></p>
</li>
<li><p><a href="https://www.freecodecamp.org/news/ai-paper-review-language-models-are-few-shot-learners-gpt-3/">AI Paper Review: Language Models are Few-Shot Learners (GPT-3)</a></p>
</li>
</ul>
<p>GPT-4 builds directly on many of the concepts introduced in those papers, especially large-scale pretraining, zero-shot and few-shot learning, and in-context prompting.</p>
<p>It also helps to have a general understanding of:</p>
<ul>
<li><p>Transformer architectures and self-attention</p>
</li>
<li><p>The evolution from GPT-1 → GPT-3</p>
</li>
<li><p>Few-shot learning and prompting</p>
</li>
<li><p>Basic prompt engineering concepts</p>
</li>
<li><p>Reinforcement Learning from Human Feedback (RLHF)</p>
</li>
<li><p>Scaling laws and why larger models often develop new capabilities</p>
</li>
</ul>
<p>You don't need deep mathematical knowledge to follow this article, though.</p>
<p>As with the previous reviews, I’ll focus more on explaining the ideas intuitively and practically rather than diving too deeply into heavy equations or dense academic terminology.</p>
<h2 id="heading-executive-summary"><strong>Executive Summary</strong></h2>
<p>GPT-4 is not simply a larger version of GPT-3.</p>
<p>That may sound obvious today, but at the time, many people initially assumed GPT-4 was just another scaling step in the same direction. But the technical report shows something more important: GPT-4 represents a shift from experimental language models toward deployable general-purpose AI systems.</p>
<p>According to the report, GPT-4 introduces several major advances at once.</p>
<p>First, as mentioned above, the model becomes <em>multimodal</em>. Unlike previous GPT systems that only worked with text, GPT-4 can process both images and text as input while still generating text outputs. This allows the model to analyze screenshots, diagrams, documents, photographs, visual jokes, and mixed media prompts.</p>
<p>Second, GPT-4 demonstrates significantly stronger reasoning and benchmark performance across a wide range of professional and academic evaluations. The report shows GPT-4 achieving near human-level results on exams including the Uniform Bar Exam, LSAT, GRE, SAT, AP tests, coding benchmarks, and advanced reasoning tasks.</p>
<p>The report also places heavy emphasis on <em>alignment</em> and <em>factuality</em> improvements.</p>
<p>Earlier GPT systems often produced unsafe, misleading, or overly confident outputs. GPT-4 still has these problems, but OpenAI invested heavily in reinforcement learning from human feedback (RLHF), adversarial testing, refusal behavior, and safety evaluation pipelines to reduce harmful behavior and improve adherence to user intent.</p>
<p>Another major theme throughout the report is <em>predictable scaling</em>.</p>
<p>According to the authors, OpenAI developed infrastructure and optimization methods that allowed them to accurately predict GPT-4’s final performance using much smaller training runs.</p>
<p>That detail matters more than it might seem.</p>
<p>GPT-3 demonstrated that scaling works. GPT-4 demonstrates that scaling large language models was becoming an engineering discipline with increasingly predictable behavior.</p>
<p>The broader implication is what makes this report historically important.</p>
<p>GPT-4 transforms large language models from research demonstrations into deployable AI assistants capable of reasoning across many domains, interacting through natural language, following instructions more reliably, and operating at global scale through systems like ChatGPT.</p>
<p>In many ways, this report marks the beginning of the modern AI deployment era.</p>
<h2 id="heading-goals-of-the-report"><strong>Goals of the Report</strong></h2>
<p>The GPT-4 Technical Report is not only about showing a more capable language model. In many ways, the report is about demonstrating that large AI systems can be developed more reliably, more safely, and more predictably than before.</p>
<p>One of the main goals behind GPT-4 was improving reasoning and reliability across a broad range of tasks, which we discussed above.</p>
<p>Another major objective was improving <em>alignment</em> with user intent – investing in RLHF, safety fine-tuning, refusal training, and adversarial testing to make the model more helpful and better aligned with intended behavior.</p>
<p>The report also marks a significant shift beyond text-only AI systems, as GPT-4 introduces multimodal capabilities. This expands the system from being purely a language generator into something closer to a general-purpose reasoning interface capable of interpreting visual and textual information together.</p>
<p>Safety is another central theme throughout the report.</p>
<p>OpenAI repeatedly emphasizes efforts to reduce harmful outputs, improve refusal behavior, mitigate misuse risks, and build safer deployment systems around the model. The report discusses red teaming, domain expert testing, policy enforcement, and model-assisted safety pipelines designed to reduce dangerous behavior during real-world usage.</p>
<p>But one of the most historically important goals may actually be <em>predictability</em>.</p>
<p>According to the authors, GPT-4 was developed using infrastructure and optimization methods designed to scale in highly predictable ways. OpenAI claims they could estimate aspects of GPT-4’s final performance using models trained with thousands of times less compute.</p>
<p>That idea may sound technical, but it represents a major shift in how frontier AI systems were being built.</p>
<p>Earlier generations of language models often involved substantial uncertainty during scaling. GPT-4 suggests that large-scale AI development was becoming more systematic and engineering-driven rather than purely experimental.</p>
<p>In practice, the report reflects a broader transition happening across the AI industry, from research prototypes to deployable infrastructure systems designed for real-world use at massive scale.</p>
<h2 id="heading-core-idea"><strong>Core Idea</strong></h2>
<p>One of the most surprising things about GPT-4 is that, underneath all the hype and new capabilities, the core learning objective is still fundamentally very simple.</p>
<p>Like GPT-1, GPT-2, and GPT-3, GPT-4 is still trained primarily as a next-token prediction model. In other words, the system learns by repeatedly predicting the next piece of text in a sequence.</p>
<p>The architecture also remains Transformer-based and autoregressive.</p>
<p>That means GPT-4 generates outputs one token at a time while using self-attention to understand relationships between words, sentences, images, and context inside the input sequence.</p>
<p>At a high level, the underlying principle hasn't changed very much since GPT-2:</p>
<ul>
<li><p>train on massive amounts of data</p>
</li>
<li><p>predict the next token</p>
</li>
<li><p>scale the model aggressively</p>
</li>
</ul>
<p>But GPT-4 pushes this approach much further.</p>
<p>According to the report, the model is substantially larger, more optimized, and trained using infrastructure designed specifically for predictable large-scale behavior.</p>
<p>The biggest conceptual change is that GPT-4 is no longer limited to text-only input.</p>
<p>Another major difference is the importance of <em>post-training alignment</em>.</p>
<p>GPT-3 already demonstrated strong few-shot learning abilities, but GPT-4 places much heavier emphasis on reinforcement learning from human feedback (RLHF), safety tuning, refusal behavior, and instruction following. According to the report, these post-training processes significantly improve factuality, adherence to desired behavior, and response safety.</p>
<p>This leads to one of the most important ideas behind modern AI systems:</p>
<p>Capability doesn't emerge from scale alone.</p>
<p>GPT-4 suggests that powerful AI behavior comes from the combination of:</p>
<ul>
<li><p>large-scale pretraining</p>
</li>
<li><p>scaling laws</p>
</li>
<li><p>optimization improvements</p>
</li>
<li><p>alignment training</p>
</li>
<li><p>RLHF</p>
</li>
<li><p>post-training refinement</p>
</li>
</ul>
<p>In practice, GPT-4 feels less like a raw predictive model and more like an interactive assistant because of this additional alignment layer.</p>
<p>That distinction matters historically.</p>
<p>GPT-3 showed that scaling language models could unlock powerful emergent behavior. GPT-4 shows that scaling alone is not enough — the model also needs alignment, safety training, and deployment-focused refinement to become broadly usable in the real world.</p>
<h2 id="heading-predictable-scaling"><strong>Predictable Scaling</strong></h2>
<p>One of the most important ideas in the GPT-4 Technical Report is something that many people overlooked when the paper first came out: predictable scaling.</p>
<p>Earlier generations of large language models involved a huge amount of uncertainty.</p>
<p>Researchers could train larger systems and hope performance would improve, but nobody fully knew how far scaling would go or whether massive training runs would behave the way they expected.</p>
<p>GPT-4 changed that. According to the report, OpenAI developed infrastructure and optimization methods that allowed them to accurately predict GPT-4’s final training loss, and even some capabilities, using models trained with thousands of times less compute.</p>
<p>This is far more important than it first sounds. GPT-3 proved that scaling language models works.</p>
<p>GPT-4 suggested that scaling was starting to become predictable engineering rather than trial-and-error experimentation.</p>
<p>That shift introduced several major advantages:</p>
<ul>
<li><p>Better capability forecasting before training massive models</p>
</li>
<li><p>Reduced risk of wasting millions of dollars on failed training runs</p>
</li>
<li><p>Safer deployment planning through earlier evaluation of model behavior</p>
</li>
<li><p>More reliable scaling from small experiments to frontier-scale systems</p>
</li>
</ul>
<p>The report also shows that model loss followed remarkably stable power-law behavior across scales, allowing OpenAI to estimate GPT-4’s final performance long before training finished.</p>
<p>But the paper also makes an important point: not every capability scales smoothly. Some behaviors, especially reasoning-related tasks, can emerge unpredictably or even temporarily worsen before improving again.</p>
<p>Some important limitations of predictable scaling include:</p>
<ul>
<li><p>Some capabilities still emerge unpredictably at larger scales</p>
</li>
<li><p>Benchmark performance can behave nonlinearly instead of improving smoothly</p>
</li>
<li><p>Scaling laws may not hold forever as models continue growing</p>
</li>
<li><p>Even with predictable training curves, reasoning failures and hallucinations can still appear unexpectedly</p>
</li>
</ul>
<p>That tension between predictable scaling and unexpected emergence became one of the defining themes of modern frontier AI research.</p>
<h2 id="heading-model-architecture"><strong>Model Architecture</strong></h2>
<p>One of the most unusual aspects of the GPT-4 Technical Report is how little OpenAI reveals about the actual model architecture.</p>
<p>As discussed above, in the GPT-1, GPT-2, and GPT-3 papers, OpenAI openly discussed details like parameter counts, dataset sizes, scaling configurations, and training methodology.</p>
<p>As you now know, GPT-4 is very different. The report leaves out several major technical details like the exact parameter count, the precise architecture configuration, the dataset size and composition, the training compute used, and the hardware infrastructure and setup.</p>
<p>The report explicitly states that these omissions were motivated by both the competitive landscape and safety considerations surrounding large-scale AI systems.</p>
<p>That decision became one of the most discussed aspects of the release.</p>
<p>Historically, GPT-4 marks a transition where frontier AI research started becoming more closed and product-oriented. Earlier GPT papers felt like traditional research publications. GPT-4 feels more like a controlled systems report from a company deploying AI at global scale.</p>
<p>Even though many implementation details remain hidden, the report still confirms several important things:</p>
<ol>
<li><p>GPT-4 is still fundamentally a Transformer-based model trained using autoregressive next-token prediction.</p>
</li>
<li><p>Like previous GPT systems, it generates outputs sequentially while using self-attention mechanisms to process context.</p>
</li>
<li><p>GPT-4 is multimodal, meaning it can accept both image and text inputs while producing text outputs.</p>
</li>
</ol>
<p>This is one of the biggest architectural shifts in the GPT series because it extends the model beyond pure language understanding into combined visual and textual reasoning.</p>
<p>Another important component is post-training alignment, which we've already discussed a bit. In practice, it means that GPT-4 isn't just a raw pretrained language model anymore. It's a heavily refined system built through multiple stages:</p>
<ul>
<li><p>large-scale pretraining</p>
</li>
<li><p>optimization and scaling improvements</p>
</li>
<li><p>multimodal integration</p>
</li>
<li><p>RLHF alignment</p>
</li>
<li><p>safety fine-tuning</p>
</li>
<li><p>deployment-oriented post-training</p>
</li>
</ul>
<p>The secrecy surrounding GPT-4’s architecture is historically important because it reflects a broader change happening in AI.</p>
<p>As language models became commercially valuable and socially impactful, frontier AI research started moving away from full openness toward controlled disclosure, safety-focused deployment, and competitive protection.</p>
<h2 id="heading-multimodal-learning"><strong>Multimodal Learning</strong></h2>
<p>One of the most important breakthroughs in GPT-4 is that the model is no longer limited to text alone. GPT-4 can accept both images and text as input while generating text outputs.</p>
<p>That may sound simple today, but at the time, this represented a major shift in how people thought about large language models.</p>
<p>Earlier GPT systems worked purely with language. GPT-4 expands the idea into something much broader: a model capable of reasoning across multiple forms of information at the same time.</p>
<p>In practice, GPT-4 can analyze:</p>
<ul>
<li><p>screenshots</p>
</li>
<li><p>diagrams</p>
</li>
<li><p>photographs</p>
</li>
<li><p>documents</p>
</li>
<li><p>charts</p>
</li>
<li><p>visual jokes and memes</p>
</li>
<li><p>mixed image-and-text prompts</p>
</li>
</ul>
<p>The report demonstrates this capability through several examples, but one became especially memorable: the famous VGA cable meme example.</p>
<p>In the image, a smartphone appears connected to a massive VGA monitor cable adapter – something clearly absurd in real life. GPT-4 correctly explains that the humor comes from the mismatch between outdated VGA hardware and a modern phone charging port.</p>
<p>What made this example important was not just object recognition. The model was interpreting <em>contextual humor</em> from a visual scene.</p>
<p>That distinction matters.</p>
<p>Traditional computer vision systems could often identify objects inside images, but GPT-4 demonstrated something closer to multimodal reasoning: understanding relationships, context, intent, and even jokes across combined visual and textual information.</p>
<p>The report also notes that many prompting techniques developed for language models (including few-shot prompting and chain-of-thought reasoning) continue working effectively in multimodal settings.</p>
<p>This suggests that GPT-4 is not simply attaching an image classifier onto a chatbot. Instead, the model appears to integrate visual and language understanding into a more unified reasoning system.</p>
<p>Historically, this was a major moment for the GPT series.</p>
<ul>
<li><p>GPT-1 focused on language pretraining</p>
</li>
<li><p>GPT-2 expanded zero-shot capabilities</p>
</li>
<li><p>GPT-3 introduced in-context learning</p>
</li>
<li><p>GPT-4 publicly demonstrated practical multimodal AI</p>
</li>
</ul>
<p>And unlike many earlier research demos, GPT-4’s multimodal abilities were not just experimental prototypes hidden inside papers. They became part of real-world products used by millions of people.</p>
<p>That shift made multimodal AI feel practical and deployable rather than purely theoretical.</p>
<h2 id="heading-fine-tuning-vs-zero-shot-vs-few-shot-vs-aligned-multimodal-learning">Fine-Tuning vs Zero-Shot vs Few-Shot vs Aligned Multimodal Learning</h2>
<p>One of the clearest ways to understand how GPT models evolved is by comparing how they learn and adapt to tasks.</p>
<p>Earlier NLP systems relied heavily on fine-tuning with labeled datasets, while later GPT models increasingly shifted toward zero-shot prompting, few-shot learning, and eventually aligned multimodal interaction.</p>
<p>The table below summarizes how these approaches differ in flexibility, training requirements, scalability, and real-world usability.</p>
<table style="min-width:125px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>Fine-Tuning</strong></p></td><td><p><strong>Zero-Shot Learning</strong></p></td><td><p><strong>Few-Shot Learning</strong></p></td><td><p><strong>GPT-4 Style Aligned Multimodal Learning</strong></p></td></tr><tr><td><p><strong>Definition</strong></p></td><td><p>The model is additionally trained on labeled data for a specific task</p></td><td><p>The model performs a task using only instructions, without examples</p></td><td><p>The model learns the task from a small number of examples inside the prompt</p></td><td><p>The model combines prompting, multimodal reasoning, and alignment training to perform general-purpose tasks</p></td></tr><tr><td><p><strong>Training Requirement</strong></p></td><td><p>Requires supervised task-specific datasets</p></td><td><p>No task-specific training or examples</p></td><td><p>No retraining, but requires demonstrations in prompts</p></td><td><p>Large-scale pretraining plus RLHF, safety tuning, and multimodal post-training</p></td></tr><tr><td><p><strong>How Tasks Are Given</strong></p></td><td><p>Through a separate training phase</p></td><td><p>Through natural language instructions</p></td><td><p>Through instructions plus examples</p></td><td><p>Through conversational prompts, images, instructions, and contextual interaction</p></td></tr><tr><td><p><strong>Learning Process</strong></p></td><td><p>Model weights are updated during training</p></td><td><p>No weight updates</p></td><td><p>No weight updates, as learning occurs in-context</p></td><td><p>Learns through pretraining, RLHF alignment, multimodal reasoning, and contextual prompting</p></td></tr><tr><td><p><strong>Flexibility</strong></p></td><td><p>Usually specialized for one task</p></td><td><p>Highly flexible across many tasks</p></td><td><p>Flexible while benefiting from demonstrations</p></td><td><p>Functions as a general-purpose multimodal assistant</p></td></tr><tr><td><p><strong>Adaptability</strong></p></td><td><p>Requires retraining for new tasks</p></td><td><p>Adapts instantly through prompts</p></td><td><p>Adapts quickly from contextual examples</p></td><td><p>Adapts dynamically across domains, modalities, and interaction styles</p></td></tr><tr><td><p><strong>Data Dependency</strong></p></td><td><p>Depends heavily on labeled datasets</p></td><td><p>Depends mostly on pretraining knowledge</p></td><td><p>Depends on pretraining plus prompt examples</p></td><td><p>Depends on massive multimodal pretraining and human feedback alignment</p></td></tr><tr><td><p><strong>Performance</strong></p></td><td><p>Often strongest on narrow benchmark tasks</p></td><td><p>Usually weaker than fine-tuning</p></td><td><p>Often approaches fine-tuned performance</p></td><td><p>Often surpasses specialized systems across many reasoning and language tasks</p></td></tr><tr><td><p><strong>Scalability Across Tasks</strong></p></td><td><p>Expensive and difficult to scale</p></td><td><p>Extremely scalable</p></td><td><p>Scalable without retraining</p></td><td><p>Scales broadly across language, coding, reasoning, and multimodal tasks</p></td></tr><tr><td><p><strong>Compute Cost</strong></p></td><td><p>High because each task may require retraining</p></td><td><p>Low during usage</p></td><td><p>Low during usage</p></td><td><p>Extremely high training cost but efficient deployment across many applications</p></td></tr><tr><td><p><strong>Example</strong></p></td><td><p>Fine-tune a model on a sentiment analysis dataset</p></td><td><p>“Classify the sentiment of this sentence”</p></td><td><p>“Positive: I loved the movie. Negative: The film was boring...”</p></td><td><p>Upload an image and ask the model to explain a chart, solve code, or summarize a document</p></td></tr><tr><td><p><strong>Main Strength</strong></p></td><td><p>High accuracy on specialized tasks</p></td><td><p>Simplicity and broad generalization</p></td><td><p>Strong balance between flexibility and performance</p></td><td><p>Unified multimodal reasoning with aligned conversational interaction</p></td></tr><tr><td><p><strong>Main Weakness</strong></p></td><td><p>Poor scalability across many tasks</p></td><td><p>Can misunderstand task format or intent</p></td><td><p>Sensitive to prompt quality and examples</p></td><td><p>Still hallucinates, makes reasoning errors, and requires heavy safety controls</p></td></tr><tr><td><p><strong>Most Associated With</strong></p></td><td><p>Traditional NLP systems, GPT-1 era</p></td><td><p>GPT-2 style prompting</p></td><td><p>GPT-3 and in-context learning</p></td><td><p>GPT-4 and aligned multimodal foundation models</p></td></tr><tr><td><p><strong>Core Idea</strong></p></td><td><p>Train specifically for each task</p></td><td><p>Infer tasks from instructions</p></td><td><p>Infer tasks from examples in context</p></td><td><p>Combine scale, alignment, multimodality, and prompting into deployable AI systems</p></td></tr></tbody></table>

<h2 id="heading-rlhf-and-alignment"><strong>RLHF and Alignment</strong></h2>
<p>One of the biggest differences between GPT-4 and earlier GPT models is how much emphasis the report places on <em>alignment</em> and <em>safety</em>.</p>
<p>GPT-3 demonstrated impressive few-shot learning abilities, but it also exposed serious weaknesses. The model could hallucinate facts, generate harmful instructions, confidently produce false information, or fail to follow user intent reliably.</p>
<p>GPT-4 was designed with these problems in mind.</p>
<p>A major part of this improvement comes from Reinforcement Learning from Human Feedback (RLHF).</p>
<p>At a high level, RLHF works by collecting human feedback about model responses and then using that feedback to train the model toward preferred behavior. Instead of learning only from internet text, the system also learns from human judgments about what kinds of answers are helpful, safe, accurate, or appropriate.</p>
<p>According to the report, GPT-4 undergoes extensive post-training alignment designed to improve:</p>
<ul>
<li><p>factuality</p>
</li>
<li><p>instruction following</p>
</li>
<li><p>refusal behavior</p>
</li>
<li><p>harmlessness</p>
</li>
<li><p>adherence to user intent</p>
</li>
</ul>
<p>This alignment layer is a major reason GPT-4 feels different from raw pretrained language models.</p>
<p>The report repeatedly emphasizes <em>refusal behavior</em> as an important safety capability.</p>
<p>Earlier versions of GPT-4 could sometimes generate dangerous instructions, including harmful chemical synthesis advice or weapon-related content during internal testing. OpenAI used adversarial testing, domain experts, RLHF training, and additional safety pipelines to reduce these behaviors significantly.</p>
<p>The examples shown in the report are especially revealing.</p>
<p>In one case, an earlier GPT-4 version provided detailed responses about creating dangerous materials. Later aligned versions instead refuse the request and redirect the conversation safely.</p>
<p>What makes this important is that GPT-4 is not simply being made “more restrictive.”</p>
<p>The report also discusses the opposite problem: models becoming <em>too cautious</em>. OpenAI specifically worked on reducing unnecessary refusals for harmless requests while still blocking dangerous ones.</p>
<p>In practice, alignment becomes a balancing act between:</p>
<ul>
<li><p>usefulness</p>
</li>
<li><p>safety</p>
</li>
<li><p>honesty</p>
</li>
<li><p>flexibility</p>
</li>
<li><p>and reliability</p>
</li>
</ul>
<p>The paper also introduces <em>rule-based reward models</em> and model-assisted safety pipelines that help guide GPT-4 toward safer behavior during training.</p>
<p>Historically, this section of the report marks another major transition in AI development.</p>
<p>Earlier GPT papers focused primarily on capabilities and scaling. GPT-4 treats alignment and deployment safety as core engineering problems rather than secondary concerns.</p>
<p>That shift reflects a deeper realization across the industry: once AI systems become powerful enough for real-world deployment at global scale, improving intelligence alone is no longer enough. The systems also need to behave safely, follow human intent reliably, and resist harmful misuse.</p>
<h2 id="heading-benchmarks-and-experiments"><strong>Benchmarks and Experiments</strong></h2>
<p>One of the most striking parts of the GPT-4 Technical Report is the sheer scale of the evaluation process.</p>
<p>According to the report, OpenAI tested GPT-4 across a wide range of academic exams, professional certifications, reasoning tasks, coding benchmarks, and traditional NLP evaluations.</p>
<p>The goal was not simply to show that GPT-4 could generate fluent text. The evaluations were designed to measure whether the model could reason, solve problems, follow instructions, answer questions, and generalize across many different domains.</p>
<p>The human exam results attracted enormous attention when the report was released.</p>
<p>GPT-4 achieved particularly strong scores on several well-known exams:</p>
<ul>
<li><p><a href="https://www.ncbex.org/exams/ube">Uniform Bar Exam → around the top 10% of test takers</a></p>
</li>
<li><p><a href="https://www.lsac.org/lsat">LSAT → roughly 88th percentile</a></p>
</li>
<li><p><a href="https://satsuite.collegeboard.org/sat/whats-on-the-test/reading-writing">SAT Reading &amp; Writing → around 93rd percentile</a></p>
</li>
<li><p><a href="https://www.ets.org/gre/test-takers/general-test/prepare/content/verbal-reasoning.html">GRE Verbal → around the 99th percentile</a></p>
</li>
<li><p><a href="https://apstudents.collegeboard.org/">Strong performance across many AP exams</a></p>
</li>
</ul>
<h3 id="heading-gpt-performance-on-academic-and-professional-exams">GPT Performance on Academic and Professional Exams</h3>
<p>The table below summarizes GPT-4’s performance across a wide range of academic and professional exams, showing how the model compared with GPT-3.5 on tests such as the Uniform Bar Exam, LSAT, GRE, SAT, AP exams, and coding challenges.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/f66d72a0-ce80-4ec9-acd3-ad8c3e974acd.png" alt="GPT Performance on Academic Professional Exams" style="display:block;margin:0 auto" width="752" height="812" loading="lazy">

<p>Source: <a href="https://arxiv.org/pdf/2303.08774">GPT-4 Technical Report</a> (OpenAI, 2023), Table 1.</p>
<p>The comparison with GPT-3.5 was especially dramatic in some cases. For example, the report notes that GPT-3.5 scored near the bottom 10% on the simulated bar exam, while GPT-4 reached the top 10%.</p>
<p>These results helped change public perception of large language models.</p>
<p>Earlier systems were often viewed mainly as autocomplete engines or text generators. GPT-4 demonstrated that scaling and alignment could produce systems capable of performing competitively on many tasks originally designed for humans.</p>
<p>The figure below visualizes GPT-4’s percentile rankings across multiple exams, highlighting the significant improvement over GPT-3.5 in areas such as reasoning, language understanding, mathematics, and professional testing.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/f5c4d70a-7da3-482a-bb57-688bf63bbeb2.png" alt="GPT Performance on Academic Professional Exams" style="display:block;margin:0 auto" width="881" height="825" loading="lazy">

<p>Source: <a href="https://arxiv.org/pdf/2303.08774">GPT-4 Technical Report</a> (OpenAI, 2023), Figure 4.</p>
<p>The report also evaluates GPT-4 on a wide collection of standard NLP benchmarks.</p>
<p>Some of the most important include:</p>
<ul>
<li><p><a href="https://arxiv.org/abs/2009.03300">MMLU → broad academic and professional reasoning benchmark</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1905.07830">HellaSwag → commonsense reasoning</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2410.12381">HumanEval → coding and Python synthesis tasks</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2110.14168">GSM8K → grade-school mathematics reasoning</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2505.11831">ARC → science reasoning questions</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1907.10641">WinoGrande → pronoun and commonsense reasoning</a></p>
</li>
</ul>
<p>Across most of these evaluations, GPT-4 substantially outperforms GPT-3.5 and often surpasses previous state-of-the-art language models. In several cases, it even exceeds systems that relied on benchmark-specific fine-tuning or specialized engineering pipelines.</p>
<p>One especially important benchmark is MMLU (Massive Multitask Language Understanding), which tests knowledge and reasoning across 57 different subjects. GPT-4 achieves remarkably strong performance on this benchmark, including multilingual variants translated into many languages.</p>
<p>The coding evaluations are also historically significant. On HumanEval and LeetCode-style tasks, GPT-4 demonstrates major improvements in code generation and problem solving compared to earlier GPT systems.</p>
<p>This capability eventually became one of the foundations behind modern AI coding assistants.</p>
<p>The table below compares GPT-4 with previous language models and state-of-the-art systems on major AI benchmarks such as MMLU, HellaSwag, ARC, HumanEval, and GSM-8K, demonstrating the model’s strong performance across reasoning, coding, and language understanding tasks.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/77b6a129-6581-4a13-aa04-4c34d19b43f7.png" alt="GPT Performance on Academic benchmarks" style="display:block;margin:0 auto" width="981" height="826" loading="lazy">

<p>Source: <a href="https://arxiv.org/pdf/2303.08774">GPT-4 Technical Report</a> (OpenAI, 2023), Table 2.</p>
<p>What makes these experiments especially important is that GPT-4 performs well across <em>many different categories simultaneously</em>:</p>
<ul>
<li><p>reasoning</p>
</li>
<li><p>coding</p>
</li>
<li><p>mathematics</p>
</li>
<li><p>language understanding</p>
</li>
<li><p>professional exams</p>
</li>
<li><p>multilingual tasks</p>
</li>
<li><p>commonsense reasoning</p>
</li>
</ul>
<p>That breadth is part of what made GPT-4 feel qualitatively different from earlier systems.</p>
<p>Instead of excelling in one narrow benchmark, GPT-4 demonstrated increasingly general behavior across a wide variety of intellectual tasks.</p>
<h2 id="heading-coding-and-reasoning-ability"><strong>Coding and Reasoning Ability</strong></h2>
<p>One of the areas where GPT-4 shows some of its most noticeable improvements over earlier models is coding and structured reasoning.</p>
<p>While GPT-3 was already capable of generating code, GPT-4 pushes these abilities much further. According to the report, the model demonstrates substantial gains on programming benchmarks, mathematical reasoning tasks, and multi-step problem solving.</p>
<p>A key benchmark highlighted in the report is <em>HumanEval</em>, which measures the model’s ability to generate working Python functions from natural language descriptions.</p>
<p>GPT-4 achieves significantly higher performance than GPT-3.5 on this benchmark, showing much stronger code synthesis and problem-solving ability.</p>
<p>The report also includes LeetCode-style evaluations across easy, medium, and hard programming problems.</p>
<p>Although GPT-4 still struggles with many difficult competitive programming tasks, it performs substantially better than GPT-3.5, especially on easier and medium-level coding challenges.</p>
<p>These improvements became extremely important in practice.</p>
<p>Around the release of GPT-4, AI coding assistants started becoming genuinely useful for real software development workflows. Systems built on GPT-4 could help developers:</p>
<ul>
<li><p>generate functions</p>
</li>
<li><p>explain code</p>
</li>
<li><p>debug errors</p>
</li>
<li><p>refactor implementations</p>
</li>
<li><p>write documentation</p>
</li>
<li><p>solve algorithmic problems</p>
</li>
</ul>
<p>This was one of the first moments where large language models began functioning as practical engineering tools rather than experimental demos.</p>
<p>The report also highlights the importance of <em>chain-of-thought prompting</em> for reasoning tasks.</p>
<p>Instead of forcing the model to produce an immediate answer, chain-of-thought prompting encourages GPT-4 to reason step by step before reaching a conclusion.</p>
<p>For example, on benchmarks like GSM8K (a dataset of grade-school mathematics problems), GPT-4 performs much better when allowed to generate intermediate reasoning steps.</p>
<p>This became another major shift in how people interacted with large language models. Earlier systems were often treated like direct answer generators. GPT-4 demonstrated that prompting the model to “think through” a problem could significantly improve performance on reasoning-heavy tasks.</p>
<p>Compared to GPT-3.5, GPT-4 consistently shows stronger reasoning across many domains:</p>
<ul>
<li><p>coding</p>
</li>
<li><p>mathematics</p>
</li>
<li><p>structured problem solving</p>
</li>
<li><p>commonsense reasoning</p>
</li>
<li><p>academic evaluations</p>
</li>
</ul>
<p>Of course, the model is still far from perfect.</p>
<p>The report repeatedly notes that GPT-4 can still hallucinate, make logical mistakes, fail at complex reasoning chains, or confidently produce incorrect solutions.</p>
<p>But historically, this section of the report matters because it helped establish a new category of AI applications: large language models as interactive reasoning and coding assistants.</p>
<p>That idea quickly became one of the defining use cases of modern AI systems.</p>
<h2 id="heading-multilingual-capabilities"><strong>Multilingual Capabilities</strong></h2>
<p>One of the more underrated aspects of the GPT-4 Technical Report is how strongly the model performs across multiple languages.</p>
<p>Earlier language models were often heavily English-centric. Even when multilingual support existed, performance in lower-resource languages usually dropped significantly compared to English benchmarks.</p>
<p>GPT-4 shows noticeable progress in this area.</p>
<p>To evaluate multilingual reasoning ability, OpenAI translated the MMLU benchmark – a broad academic and professional reasoning benchmark covering 57 subjects – into many different languages using machine translation systems.</p>
<p>According to the report, GPT-4 performs extremely well across most tested languages and even surpasses the English-language performance of earlier models in many cases.</p>
<p>What makes this especially important is that the improvements are not limited to high-resource languages like French, German, or Spanish.</p>
<p>The report specifically highlights strong performance gains in lower-resource languages such as:</p>
<ul>
<li><p>Latvian</p>
</li>
<li><p>Welsh</p>
</li>
<li><p>Swahili</p>
</li>
<li><p>Bengali</p>
</li>
<li><p>Nepali</p>
</li>
<li><p>Marathi</p>
</li>
<li><p>Telugu</p>
</li>
</ul>
<p>This suggests something important about large-scale language modeling: as models scale and training data becomes more diverse, the learned capabilities start generalizing beyond English in a much more robust way.</p>
<p>In other words, the scaling effects observed in GPT-3 were not purely English-language phenomena.</p>
<p>GPT-4 demonstrates that many reasoning and language understanding capabilities can transfer across languages, even when available training data is far more limited.</p>
<p>This is historically significant because it moves large language models closer to becoming globally useful systems rather than tools optimized mainly for English-speaking users.</p>
<p>The multilingual results also reinforce another major theme throughout the report: GPT-4 is not narrowly specialized for a single domain or benchmark. Instead, it behaves increasingly like a general-purpose reasoning system capable of adapting across:</p>
<ul>
<li><p>languages</p>
</li>
<li><p>tasks</p>
</li>
<li><p>modalities</p>
</li>
<li><p>domains</p>
</li>
<li><p>and interaction styles</p>
</li>
</ul>
<p>Of course, multilingual performance is still uneven.</p>
<p>The report doesn't claim perfect fluency or equal reasoning quality across all languages. Lower-resource languages still present major challenges, and evaluation itself remains difficult in many multilingual settings.</p>
<p>But compared to earlier GPT systems, GPT-4 demonstrates a substantial step forward in multilingual generalization. And that became an important milestone for globally deployed AI systems.</p>
<h2 id="heading-emergent-behavior"><strong>Emergent Behavior</strong></h2>
<p>One of the most fascinating ideas surrounding GPT-4 is the concept of <em>emergent behavior</em>.</p>
<p>In the context of large language models, emergence refers to abilities that appear unexpectedly as models become larger and more capable. Instead of improving smoothly in every area, some skills seem to “switch on” once the model reaches a certain scale.</p>
<p>GPT-3 already hinted at this phenomenon through few-shot learning and in-context adaptation. GPT-4 continues that trend much more strongly.</p>
<p>According to the report, many capabilities improve nonlinearly as scale increases.</p>
<p>In simpler terms, doubling the size or compute of a model doesn't just make it slightly better at the same tasks. Sometimes, entirely new behaviors emerge that were weak or mostly absent in smaller systems.</p>
<p>This becomes especially visible in reasoning tasks.</p>
<p>GPT-4 demonstrates major improvements over GPT-3.5 in coding, mathematical reasoning, academic evaluations, instruction following, and structured problem solving.</p>
<p>The report also highlights how prompting strategies become more effective at larger scales.</p>
<p>Few-shot prompting (where the model learns from examples inside the prompt) works far more reliably in GPT-4 than in earlier systems. Similarly, chain-of-thought prompting becomes significantly more useful for reasoning-heavy tasks.</p>
<p>Instead of immediately generating an answer, GPT-4 can often improve performance by reasoning step by step through a problem.</p>
<p>What makes this important is that these abilities weren't explicitly programmed into the system. The model was still trained primarily through next-token prediction. Yet at sufficient scale, behaviors like:</p>
<ul>
<li><p>multi-step reasoning</p>
</li>
<li><p>code synthesis</p>
</li>
<li><p>contextual adaptation</p>
</li>
<li><p>multilingual generalization</p>
</li>
<li><p>instruction following</p>
</li>
<li><p>and visual-text reasoning</p>
</li>
</ul>
<p>began appearing much more robustly.</p>
<p>The report’s discussion of predictable scaling also connects directly to this idea. OpenAI explains that GPT-4’s capabilities could often be estimated from smaller training runs using scaling laws.</p>
<p>At the same time, some behaviors remain difficult to predict cleanly. The paper even notes cases where certain tasks improve unexpectedly or reverse earlier scaling trends as models become larger.</p>
<p>Historically, GPT-4 reinforces one of the biggest lessons from the GPT series: large language models don't simply become more fluent as they scale. They begin exhibiting qualitatively different behaviors.</p>
<p>That realization fundamentally changed AI research. Instead of treating language models as narrow NLP systems, researchers increasingly started viewing them as general-purpose learning systems whose capabilities could continue emerging with scale, alignment, and better training methods.</p>
<h2 id="heading-limitations"><strong>Limitations</strong></h2>
<p>Despite the impressive benchmark results and multimodal capabilities, the GPT-4 Technical Report is surprisingly direct about the model’s weaknesses.</p>
<p>The paper repeatedly emphasizes that GPT-4 is still not fully reliable.</p>
<p>One of the biggest problems is still <em>hallucination</em>.</p>
<p>Like earlier GPT systems, GPT-4 can confidently generate information that's incorrect, fabricated, or misleading. The model may produce answers that sound highly convincing even when the underlying facts are wrong.</p>
<p>This becomes especially dangerous because GPT-4 is often more fluent and persuasive than previous models. In practice, stronger language generation can sometimes make mistakes harder for users to notice.</p>
<p>The report also discusses <em>reasoning failures</em>.</p>
<p>Although GPT-4 performs much better than GPT-3.5 across many benchmarks, it can still fail at relatively simple logical tasks, make arithmetic mistakes, or break down during longer reasoning chains.</p>
<p>Another important limitation is <em>overconfidence</em>.</p>
<p>GPT-4 doesn't naturally “know when it does not know.” The model can present uncertain or incorrect answers with a high degree of confidence, which creates risks in high-stakes situations like medicine, law, education, or cybersecurity.</p>
<p>The report also notes that GPT-4 has a knowledge cutoff. Most of the model’s training data ends around September 2021, meaning the system lacks reliable awareness of many events that happened afterward.</p>
<p>One particularly interesting section discusses <em>calibration</em>.</p>
<p>According to the report, the pretrained GPT-4 model was actually fairly well calibrated&nbsp;– meaning its confidence often matched the probability of correctness. But post-training alignment and RLHF reduced calibration quality in some cases.</p>
<p>This reveals an important tradeoff: making models more helpful and aligned doesn't automatically make them more truthful or better calibrated.</p>
<p>The paper is also honest about <em>bias</em> and <em>unsafe behavior</em>.</p>
<p>Because GPT-4 learns from large internet-scale datasets, it can still reflect social biases, stereotypes, and problematic patterns present in training data.</p>
<p>OpenAI discusses extensive efforts to reduce harmful outputs, but the report explicitly acknowledges that unsafe behavior is still possible.</p>
<p>One example is <em>jailbreaking</em>: attempts to bypass safety mechanisms using adversarial prompts or clever conversational manipulation. According to the report, GPT-4’s safety systems reduce harmful behavior significantly, but determined users can still sometimes elicit dangerous or policy-violating outputs.</p>
<p>The paper also emphasizes that GPT-4 should not be blindly trusted in high-risk environments without additional safeguards, human oversight, or verification systems.</p>
<p>That honesty is one reason the report remains important: instead of presenting GPT-4 as a solved form of intelligence, OpenAI frames it as a powerful but imperfect system whose growing capabilities also create growing risks.</p>
<p>Historically, this reflects a major shift in AI research culture.</p>
<p>Earlier papers focused mostly on increasing performance. GPT-4 places equal emphasis on capability <em>and</em> failure modes, because once models become widely deployed, understanding limitations becomes just as important as demonstrating strengths.</p>
<h2 id="heading-safety-and-risks"><strong>Safety and Risks</strong></h2>
<p>One of the clearest signs that the AI field had changed by the time GPT-4 was released is how much of the report is dedicated to safety, risk analysis, and deployment concerns.</p>
<p>Earlier GPT papers focused primarily on capability improvements, scaling behavior, and benchmark performance. The GPT-4 Technical Report still discusses those topics, but safety becomes a central engineering theme rather than a secondary discussion.</p>
<p>According to the report, OpenAI conducted extensive <em>red teaming</em> and adversarial testing before deployment.</p>
<p>Red teaming involves intentionally trying to break the system, bypass safeguards, trigger unsafe outputs, or expose dangerous behaviors. OpenAI worked with external domain experts to evaluate risks across areas like cybersecurity, misinformation, chemistry, and biological threats.</p>
<p>This type of testing reflects a major shift in mindset.</p>
<p>The goal was no longer simply: “Can the model do impressive things?” But also: “What happens if capable systems are misused at global scale?”</p>
<p>The report repeatedly discusses concerns around <em>dangerous instruction generation</em>.</p>
<p>During internal evaluations, earlier GPT-4 versions were sometimes capable of generating unsafe or harmful information related to dangerous materials, offensive content, or exploitative behavior. OpenAI used RLHF, safety fine-tuning, rule-based reward models, and policy systems to reduce these risks significantly before public deployment.</p>
<p>Cybersecurity concerns also receive substantial attention. The report discusses risks involving:</p>
<ul>
<li><p>phishing assistance</p>
</li>
<li><p>malware-related guidance</p>
</li>
<li><p>social engineering</p>
</li>
<li><p>exploit generation</p>
</li>
<li><p>automation of cyber abuse workflows</p>
</li>
</ul>
<p>Although GPT-4 isn't presented as an autonomous hacking system, OpenAI clearly recognizes that increasingly capable language models could amplify existing cybersecurity threats if deployed irresponsibly.</p>
<p>Another especially important topic is <em>biosecurity</em>.</p>
<p>The report explains that domain experts evaluated whether GPT-4 could meaningfully assist users with harmful biological or chemical knowledge. OpenAI specifically investigated whether the model could help lower the barrier for dangerous misuse.</p>
<p>This was one of the first times a major AI paper openly treated advanced language models as potential dual-use technologies with real-world security implications.</p>
<p>The report also emphasizes <em>deployment monitoring</em> and iterative safety improvement.</p>
<p>Rather than treating safety as something solved before release, OpenAI frames deployment itself as part of the learning process. Monitoring user interactions, identifying failure modes, updating safeguards, and improving refusal systems became ongoing operational responsibilities rather than one-time research tasks.</p>
<p>Historically, this section may be one of the most important parts of the entire report.</p>
<p>GPT-4 marks the moment when AI safety stopped being a niche research discussion and became a core component of flagship frontier model development.</p>
<p>That shift reflects a deeper realization across the industry: once AI systems become powerful enough for large-scale deployment, increasing capability and managing risk become inseparable engineering problems.</p>
<h2 id="heading-discussion"><strong>Discussion</strong></h2>
<p>Looking back at the GPT series, GPT-4 feels less like the release of a single research model and more like the beginning of a new computing platform.</p>
<p>GPT-1 introduced the idea of large-scale language pretraining. GPT-2 demonstrated zero-shot multitask behavior. GPT-3 showed that models could adapt through prompting and in-context learning.</p>
<p>But GPT-4 changes the conversation again.</p>
<p>According to the technical report, the focus is no longer only about making models larger or improving benchmark scores. The report repeatedly emphasizes reliability, deployment, alignment, infrastructure, multimodal interaction, and safety engineering.</p>
<p>That shift is historically important.</p>
<p>Earlier GPT papers felt like research milestones published mainly for the machine learning community. GPT-4 feels like infrastructure designed for real-world deployment at global scale.</p>
<p>This becomes especially clear through systems like ChatGPT.</p>
<p>GPT-4 was not simply released as a downloadable research artifact or benchmark model. Instead, it became part of an entire AI product ecosystem:</p>
<ul>
<li><p>conversational assistants</p>
</li>
<li><p>coding copilots</p>
</li>
<li><p>enterprise APIs</p>
</li>
<li><p>productivity tools</p>
</li>
<li><p>educational systems</p>
</li>
<li><p>multimodal interfaces</p>
</li>
</ul>
<p>In practice, GPT-4 helped transform large language models from isolated research demos into continuously deployed software platforms.</p>
<p>Another major change is the increasing secrecy surrounding frontier AI systems.</p>
<p>Unlike GPT-2 and GPT-3, the GPT-4 report intentionally omits many technical details, including parameter counts, architecture specifics, training compute, and dataset composition.</p>
<p>OpenAI explains this partly through safety concerns and the competitive landscape, but the broader implication is significant: frontier AI models were becoming strategically valuable technologies rather than purely academic research projects.</p>
<p>This marks the beginning of a much more closed era in large-scale AI development.</p>
<p>The report also shows why <em>alignment</em> became such a central concern.</p>
<p>As language models became more capable, the risks associated with hallucinations, harmful outputs, cybersecurity misuse, misinformation, and unsafe reasoning also increased. GPT-4 treats alignment not as an optional improvement layer, but as a core engineering requirement.</p>
<p>This is another major transition in the history of AI systems.</p>
<p>Earlier models were evaluated mostly on capability:</p>
<ul>
<li><p>accuracy</p>
</li>
<li><p>perplexity</p>
</li>
<li><p>benchmark scores</p>
</li>
<li><p>scaling behavior</p>
</li>
</ul>
<p>GPT-4 expands the discussion toward:</p>
<ul>
<li><p>safety</p>
</li>
<li><p>deployment monitoring</p>
</li>
<li><p>refusal behavior</p>
</li>
<li><p>policy enforcement</p>
</li>
<li><p>human oversight</p>
</li>
<li><p>operational reliability</p>
</li>
</ul>
<p>The model is no longer judged only by what it <em>can</em> do, but also by how safely and consistently it behaves in real-world environments.</p>
<p>In many ways, GPT-4 also represents the rise of the modern <em>foundation model ecosystem</em>.</p>
<p>Instead of training separate systems for every individual task, one large aligned model can serve as a shared base for many applications:</p>
<ul>
<li><p>coding</p>
</li>
<li><p>tutoring</p>
</li>
<li><p>search</p>
</li>
<li><p>writing</p>
</li>
<li><p>research assistance</p>
</li>
<li><p>customer support</p>
</li>
<li><p>multimodal interaction</p>
</li>
<li><p>enterprise workflows</p>
</li>
</ul>
<p>That idea fundamentally changed the software industry.</p>
<p>Historically, GPT-4 may ultimately be remembered less for a single benchmark result and more for what it represented: the moment large language models became practical, continuously deployed, general-purpose AI infrastructure.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The GPT-4 Technical Report marks one of the most important turning points in the history of modern AI systems.</p>
<p>According to the report, GPT-4 is not simply a larger language model. It's a multimodal, aligned foundation model designed for real-world deployment at global scale.</p>
<p>The model combines several major ideas that evolved throughout the GPT series:</p>
<ul>
<li><p>large-scale Transformer pretraining</p>
</li>
<li><p>autoregressive next-token prediction</p>
</li>
<li><p>scaling laws</p>
</li>
<li><p>few-shot prompting</p>
</li>
<li><p>multimodal reasoning</p>
</li>
<li><p>reinforcement learning from human feedback</p>
</li>
<li><p>safety-focused post-training</p>
</li>
</ul>
<p>Together, these components produce a system that feels qualitatively different from earlier GPT models.</p>
<p>GPT-4 demonstrates that scaling alone is no longer the entire story.</p>
<p>GPT-3 showed that larger models could develop powerful emergent abilities through scale. GPT-4 shows that alignment, safety engineering, post-training refinement, and deployment infrastructure became equally important parts of building useful AI systems.</p>
<p>This combination of scale and alignment ultimately became the dominant paradigm behind modern frontier AI development.</p>
<p>The report also reflects a broader transition happening across the industry.</p>
<p>Large language models were no longer being treated as isolated research experiments or benchmark systems. GPT-4 pushed AI toward real-world deployment through products, APIs, multimodal assistants, coding systems, enterprise tools, and globally accessible conversational interfaces like ChatGPT.</p>
<p>Historically, GPT-4 represents the moment when foundation models became practical infrastructure for everyday computing.</p>
<p>And that shift continues shaping the direction of modern AI today.</p>
<h2 id="heading-final-insight"><strong>Final Insight</strong></h2>
<p>Looking across the entire GPT series, the progression becomes remarkably clear.</p>
<p>GPT-1 introduced the idea that large-scale pretraining could produce transferable language representations. Instead of training separate NLP systems from scratch for every task, models could first learn general language patterns and then adapt through fine-tuning.</p>
<p>GPT-2 pushed this idea further by showing that sufficiently large language models could perform tasks in a zero-shot setting without explicit supervised training. The model was no longer just memorizing tasks – it was beginning to generalize from language itself.</p>
<p>GPT-3 changed the paradigm again. Few-shot prompting and in-context learning showed that models could adapt dynamically during inference simply from examples written inside the prompt. This transformed prompting into a new interface for interacting with AI systems.</p>
<p>Then GPT-4 expanded the idea into something much larger. The focus was no longer only about scaling models or improving benchmarks. GPT-4 introduced the era of aligned multimodal foundation models: systems designed not just to generate language, but to operate safely, follow instructions, reason across modalities, and function as deployable infrastructure for real-world applications.</p>
<p>Historically, that may be the most important shift of all.</p>
<p>GPT-4 was not simply a larger language model.</p>
<p>It marked the transition from experimental large language models to globally deployed AI assistants integrated into everyday computing, software development, education, productivity tools, and multimodal human-computer interaction.</p>
<p>And in many ways, we're still only at the beginning of that transition.</p>
<h2 id="heading-gpt-1-vs-gpt-2-vs-gpt-3-vs-gpt-4-key-differences">GPT-1 vs GPT-2 vs GPT-3 vs GPT-4: Key Differences</h2>
<p>A simple way to see how the GPT series evolved is by looking at what each generation introduced.</p>
<p>GPT-1 introduced modern pretraining, GPT-2 showed that large language models could perform tasks through zero-shot prompting, GPT-3 pushed few-shot prompting and in-context learning into the mainstream, and GPT-4 expanded the idea further through alignment, multimodal reasoning, and real-world deployment.</p>
<p>The comparison below shows how the focus gradually shifted from task-specific NLP models to general-purpose AI systems capable of conversation, coding, reasoning, and multimodal understanding.</p>
<table>
<thead>
<tr>
<th>Aspect</th>
<th>GPT-1</th>
<th>GPT-2</th>
<th>GPT-3</th>
<th>GPT-4</th>
</tr>
</thead>
<tbody><tr>
<td>Core Idea</td>
<td>Pre-training followed by fine-tuning</td>
<td>Pre-training alone enables zero-shot behavior</td>
<td>Large-scale pre-training enables few-shot and in-context learning</td>
<td>Aligned multimodal foundation model for general-purpose deployment</td>
</tr>
<tr>
<td>Training Approach</td>
<td>Two-stage pipeline: pretrain then fine-tune</td>
<td>Single-stage language modeling</td>
<td>Same language modeling approach, but massively scaled</td>
<td>Large-scale pretraining combined with RLHF, safety tuning, and multimodal post-training</td>
</tr>
<tr>
<td>Supervision</td>
<td>Requires labeled data for downstream tasks</td>
<td>Can perform tasks without supervised fine-tuning</td>
<td>Can adapt from prompts and examples without retraining</td>
<td>Uses alignment training and RLHF to improve instruction following and safety</td>
</tr>
<tr>
<td>Task Handling</td>
<td>Separate fine-tuning for each task</td>
<td>Tasks handled mainly through zero-shot prompts</td>
<td>Tasks handled through zero-shot, one-shot, and few-shot prompting</td>
<td>Tasks handled through conversational prompting, multimodal interaction, and aligned responses</td>
</tr>
<tr>
<td>Learning Style</td>
<td>Learns representations, then specializes</td>
<td>Learns general language patterns</td>
<td>Learns to infer tasks directly from context</td>
<td>Learns contextual reasoning, multimodal understanding, and aligned interaction behavior</td>
</tr>
<tr>
<td>Generalization</td>
<td>Limited outside fine-tuned tasks</td>
<td>Stronger cross-task generalization</td>
<td>Much stronger contextual adaptation and in-context learning</td>
<td>Broad multimodal generalization across language, vision, coding, and reasoning tasks</td>
</tr>
<tr>
<td>Prompt Usage</td>
<td>Minimal importance</td>
<td>Prompts become useful</td>
<td>Prompts become central to system behavior</td>
<td>Prompting becomes the main interaction interface for AI systems</td>
</tr>
<tr>
<td>Inference Behavior</td>
<td>Mostly static after training</td>
<td>Can generalize during inference</td>
<td>Can adapt dynamically during inference</td>
<td>Can reason interactively across text and images with aligned conversational behavior</td>
</tr>
<tr>
<td>Architecture</td>
<td>Transformer (decoder-based)</td>
<td>Decoder-only Transformer</td>
<td>Decoder-only Transformer with large-scale scaling</td>
<td>Transformer-based multimodal autoregressive model</td>
</tr>
<tr>
<td>Model Size</td>
<td>~117M parameters</td>
<td>Up to 1.5B parameters</td>
<td>Up to 175B parameters</td>
<td>Undisclosed by OpenAI</td>
</tr>
<tr>
<td>Context Window</td>
<td>Smaller context length</td>
<td>Up to 1024 tokens</td>
<td>2048-token context window</td>
<td>Much larger context handling with multimodal inputs</td>
</tr>
<tr>
<td>Training Data</td>
<td>Books Corpus and curated datasets</td>
<td>WebText internet dataset</td>
<td>Massive multi-source dataset including Common Crawl, WebText, Books, and Wikipedia</td>
<td>Large-scale multimodal and internet-scale datasets (details undisclosed)</td>
</tr>
<tr>
<td>Key Capability</td>
<td>Transfer learning</td>
<td>Zero-shot learning</td>
<td>Few-shot and in-context learning</td>
<td>Multimodal reasoning and aligned AI assistance</td>
</tr>
<tr>
<td>Performance Style</td>
<td>Strong after fine-tuning</td>
<td>Strong without task-specific training</td>
<td>Often competitive with fine-tuned systems using prompts alone</td>
<td>Often surpasses previous state-of-the-art systems across many benchmarks</td>
</tr>
<tr>
<td>Scaling Importance</td>
<td>Moderate</td>
<td>Important</td>
<td>Central research strategy of the paper</td>
<td>Scaling combined with alignment becomes the dominant paradigm</td>
</tr>
<tr>
<td>Main Limitation</td>
<td>Requires labeled datasets and retraining</td>
<td>Weak reasoning and inconsistent zero-shot behavior</td>
<td>Extremely expensive compute requirements and persistent reasoning limitations</td>
<td>Hallucinations, alignment tradeoffs, safety risks, and lack of transparency</td>
</tr>
<tr>
<td>Main Contribution</td>
<td>Introduced modern NLP pre-training paradigm</td>
<td>Demonstrated multitask zero-shot behavior</td>
<td>Demonstrated emergent in-context learning at scale</td>
<td>Introduced aligned multimodal foundation models for real-world deployment</td>
</tr>
<tr>
<td>Historical Impact</td>
<td>Foundation of modern Transformer NLP</td>
<td>Shift toward general-purpose language models</td>
<td>Foundation for prompt-driven AI systems and modern LLM applications</td>
<td>Transition from experimental LLMs to globally deployed AI assistants</td>
</tr>
<tr>
<td>What Changed in the Field</td>
<td>Pre-training became standard</td>
<td>Prompting became viable</td>
<td>Prompting became the primary interface for AI systems</td>
<td>AI systems became deployable multimodal infrastructure platforms</td>
</tr>
<tr>
<td>Legacy</td>
<td>Inspired modern transfer learning pipelines</td>
<td>Inspired large-scale generative models</td>
<td>Directly influenced ChatGPT, instruction tuning, and foundation models</td>
<td>Defined the modern era of aligned multimodal AI ecosystems</td>
</tr>
</tbody></table>
<h2 id="heading-pytorch-implementations-of-the-gpt-architecture-evolution">PyTorch Implementations of the GPT Architecture Evolution</h2>
<h3 id="heading-gpt-1-pre-training-fine-tuning-architecture">GPT-1: Pre-training + Fine-Tuning Architecture</h3>
<pre><code class="language-python">class GPT1(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(512, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model)
            for _ in range(n_layers)
        ])

        self.ln_f = nn.LayerNorm(d_model)

        # Language modeling head
        self.lm_head = nn.Linear(d_model, vocab_size)

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.ln_f(x)

        logits = self.lm_head(x)

        return logits
</code></pre>
<p><code>GPT1</code> inherits from <code>nn.Module</code>, which is the base class used to build neural networks in PyTorch. The constructor <code>(init)</code> defines all trainable layers used by the model.</p>
<p><code>nn.Embedding(vocab_size, d_model)</code> creates a learnable lookup table that converts token IDs into dense vectors. Each token in the vocabulary is mapped to a vector of size <code>d_model</code>.</p>
<p>The positional embedding layer adds information about token order. Since Transformers process tokens in parallel, they need explicit positional information to understand sequence structure.</p>
<p><code>nn.ModuleList([...])</code> stores multiple <code>Transformer blocks</code> while ensuring PyTorch properly tracks their parameters during training. Each TransformerBlock typically contains masked self-attention and a feed-forward network.</p>
<p><code>nn.LayerNorm(d_model)</code> applies layer normalization before the output projection. This helps stabilize training and improves gradient flow in deep Transformer architectures.</p>
<p>The language modeling head <code>(nn.Linear)</code> projects the hidden representations back into vocabulary space. The output size equals <code>vocab_size</code>, producing prediction scores for every possible next token.</p>
<p>Inside the <code>forward()</code> method, <code>input_ids.size(1)</code> retrieves the sequence length, and <code>torch.arange(...)</code> generates positional indices for each token position.</p>
<p>The token embeddings and positional embeddings are added together to produce the initial Transformer input representation.</p>
<p>The model then passes the representation through each Transformer block sequentially:</p>
<pre><code class="language-python">for block in self.transformer_blocks:
    x = block(x)
</code></pre>
<p>This iterative stacking is what allows GPT models to learn increasingly abstract contextual representations.</p>
<p>After normalization, the final hidden states are passed into <code>lm_head</code>, producing <code>logits</code>. These logits are unnormalized prediction scores used to compute probabilities for next-token generation.</p>
<p>The model finally returns the logits tensor, which is typically passed through <code>softmax</code> during inference or used directly with <code>CrossEntropyLoss</code> during training.</p>
<h3 id="heading-gpt-2-zero-shot-multitask-architecture">GPT-2: Zero-Shot Multitask Architecture</h3>
<pre><code class="language-python">class GPT2(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(1024, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(
                d_model=d_model,
                pre_layer_norm=True
            )
            for _ in range(n_layers)
        ])

        self.final_layer_norm = nn.LayerNorm(d_model)

        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.final_layer_norm(x)

        logits = self.lm_head(x)

        return logits
</code></pre>
<p>Like GPT-1, the model begins with token embeddings and positional embeddings. <code>nn.Embedding</code> converts token IDs into dense vectors, while positional embeddings provide information about token order in the sequence.</p>
<p>One noticeable difference is the larger positional embedding size (<code>1024</code> instead of <code>512</code>), allowing GPT-2 to process longer contexts.</p>
<p>The Transformer layers are stored using <code>nn.ModuleList</code>, but each <code>TransformerBlock</code> now uses:</p>
<pre><code class="language-python">pre_layer_norm=True
</code></pre>
<p>This means layer normalization is applied before attention and feed-forward operations rather than after them. This “Pre-LN” design significantly improves gradient flow and training stability in deeper Transformer models.</p>
<p>The forward pass follows the same overall pipeline:</p>
<ol>
<li><p>Generate positional indices with <code>torch.arange()</code></p>
</li>
<li><p>Add token and positional embeddings</p>
</li>
<li><p>Pass representations through stacked Transformer blocks</p>
</li>
<li><p>Apply final normalization</p>
</li>
<li><p>Project outputs into vocabulary space</p>
</li>
</ol>
<p>The sequential block processing happens here:</p>
<pre><code class="language-python">for block in self.transformer_blocks:
    x = block(x)
</code></pre>
<p>GPT-2 also introduces a small optimization in the output layer:</p>
<pre><code class="language-python">self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
</code></pre>
<p>The bias term is removed because it provides little benefit in large language modeling setups and slightly reduces parameter count.</p>
<p>Finally, the model returns <code>logits</code>, which contain prediction scores for every token in the vocabulary at each sequence position.</p>
<h3 id="heading-gpt-3-few-shot-in-context-learning-architecture">GPT-3: Few-Shot / In-Context Learning Architecture</h3>
<pre><code class="language-python">class GPT3(nn.Module):
    def __init__(
        self,
        vocab_size=50257,
        d_model=12288,
        n_layers=96,
        n_heads=96,
        context_length=2048
    ):
        super().__init__()

        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(context_length, d_model)

        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(
                d_model=d_model,
                n_heads=n_heads,
                pre_layer_norm=True,
                sparse_attention=True
            )
            for _ in range(n_layers)
        ])

        self.final_layer_norm = nn.LayerNorm(d_model)

        self.lm_head = nn.Linear(
            d_model,
            vocab_size,
            bias=False
        )

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1))

        x = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        for block in self.transformer_blocks:
            x = block(x)

        x = self.final_layer_norm(x)

        logits = self.lm_head(x)

        return logits
</code></pre>
<p>Compared to earlier GPT versions, this model dramatically increases scale. The embedding size (<code>d_model=12288</code>) and the number of Transformer layers (<code>96</code>) allow the network to learn highly complex language patterns and long-range dependencies.</p>
<p>The model also uses <code>96</code> attention heads:</p>
<pre><code class="language-python">n_heads=96
</code></pre>
<p>Multi-head attention allows the model to focus on different relationships between tokens simultaneously, improving contextual understanding.</p>
<p>The positional embedding length is expanded to <code>2048</code>, enabling the model to process much longer sequences than GPT-2.</p>
<p>Each Transformer block is configured with:</p>
<pre><code class="language-python">pre_layer_norm=True,
sparse_attention=True
</code></pre>
<p>Pre-layer normalization improves training stability in very deep networks, while sparse attention reduces the computational cost of attention by limiting how many tokens attend to each other. This becomes important at GPT-3 scale, where full attention over long sequences is extremely expensive.</p>
<p>The forward pass follows the standard GPT pipeline:</p>
<ol>
<li><p>Convert token IDs into embeddings</p>
</li>
<li><p>Add positional information</p>
</li>
<li><p>Pass representations through stacked Transformer blocks</p>
</li>
<li><p>Apply final layer normalization</p>
</li>
<li><p>Generate vocabulary logits</p>
</li>
</ol>
<p>The core iterative processing happens here:</p>
<pre><code class="language-plaintext">for block in self.transformer_blocks:
    x = block(x)
</code></pre>
<p>Finally, the output layer projects the hidden states into vocabulary space, producing <code>logits</code> used for next-token prediction during training and text generation.</p>
<h3 id="heading-gpt-4-aligned-multimodal-foundation-model-architecture">GPT-4: Aligned Multimodal Foundation Model Architecture</h3>
<pre><code class="language-python">class GPT4(nn.Module):
    def __init__(
        self,
        vocab_size=50257,
        d_model=12288,
        n_layers=120,
        n_heads=96,
        context_length=8192
    ):
        super().__init__()

        # Text embeddings
        self.token_embedding = nn.Embedding(
            vocab_size,
            d_model
        )

        self.position_embedding = nn.Embedding(
            context_length,
            d_model
        )

        # Vision encoder for image inputs
        self.vision_encoder = VisionTransformer(
            embed_dim=d_model
        )

        # Multimodal projection layer
        self.image_projection = nn.Linear(
            d_model,
            d_model
        )

        # Decoder-only Transformer blocks
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(
                d_model=d_model,
                n_heads=n_heads,
                pre_layer_norm=True,
                flash_attention=True
            )
            for _ in range(n_layers)
        ])

        self.final_layer_norm = nn.LayerNorm(d_model)

        # Language modeling head
        self.lm_head = nn.Linear(
            d_model,
            vocab_size,
            bias=False
        )

        # RLHF alignment head
        self.reward_head = RewardModel(
            hidden_size=d_model
        )

    def forward(
        self,
        input_ids,
        image_inputs=None
    ):

        positions = torch.arange(
            input_ids.size(1)
        )

        text_embeddings = (
            self.token_embedding(input_ids)
            + self.position_embedding(positions)
        )

        # Encode image if provided
        if image_inputs is not None:

            image_features = self.vision_encoder(
                image_inputs
            )

            image_embeddings = self.image_projection(
                image_features
            )

            x = torch.cat(
                [image_embeddings, text_embeddings],
                dim=1
            )

        else:
            x = text_embeddings

        # Transformer decoding
        for block in self.transformer_blocks:
            x = block(x)

        x = self.final_layer_norm(x)

        logits = self.lm_head(x)

        return logits
</code></pre>
<p>Like previous GPT models, the architecture starts with token embeddings and positional embeddings. <code>nn.Embedding</code> converts token IDs into dense vector representations, while positional embeddings preserve sequence order information.</p>
<p>One major difference is the addition of a vision encoder:</p>
<pre><code class="language-python">self.vision_encoder = VisionTransformer(
    embed_dim=d_model
)
</code></pre>
<p>This module processes image inputs and converts them into visual feature representations that can be understood by the Transformer.</p>
<p>The image features are then passed through a projection layer:</p>
<pre><code class="language-python">self.image_projection = nn.Linear(
    d_model,
    d_model
)
</code></pre>
<p>This aligns image representations with the same embedding space used for text tokens, making multimodal processing possible.</p>
<p>The Transformer stack remains decoder-only, but now uses:</p>
<pre><code class="language-python">flash_attention=True
</code></pre>
<p>Flash Attention is an optimized attention implementation that reduces memory usage and improves training and inference speed, especially for very long context windows like <code>8192</code> tokens.</p>
<p>Inside the <code>forward()</code> method, text embeddings are created first. If an image is provided, the image is encoded and projected into embeddings:</p>
<pre><code class="language-python">image_features = self.vision_encoder(
    image_inputs
)
</code></pre>
<p>The image and text embeddings are then combined using:</p>
<pre><code class="language-python">x = torch.cat(
    [image_embeddings, text_embeddings],
    dim=1
)
</code></pre>
<p><code>torch.cat()</code> concatenates tensors along the sequence dimension, allowing the Transformer to process image and text tokens together as a single sequence.</p>
<p>The combined representations pass through all Transformer blocks sequentially:</p>
<pre><code class="language-python">for block in self.transformer_blocks:
    x = block(x)
</code></pre>
<p>After normalization, the final hidden states are projected into vocabulary space to produce <code>logits</code> for next-token prediction.</p>
<p>The architecture also introduces a reward model head:</p>
<pre><code class="language-python">self.reward_head = RewardModel(
    hidden_size=d_model
)
</code></pre>
<p>This component represents reinforcement learning from human feedback (RLHF), which is used to align model outputs with human preferences and improve response quality and safety.</p>
<h2 id="heading-resources"><strong>Resources:</strong></h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a></p>
</li>
<li><p><a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">Improving Language Understanding by Generative Pre-Training (GPT-1)</a></p>
</li>
<li><p><a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Language Models are Unsupervised Multitask Learners (GPT-2)</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners (GPT-3)</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2303.08774">GPT-4 Technical Report</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2001.08361">Scaling Laws for Neural Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2203.15556">Training Compute-Optimal Large Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2204.02311">PaLM: Scaling Language Modeling with Pathways</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2203.02155">Training Language Models to Follow Instructions with Human Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2212.08073">Constitutional AI: Harmlessness from AI Feedback</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2201.11903">Chain-of-Thought Prompting Elicits Reasoning in Large Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2203.11171">Self-Consistency Improves Chain of Thought Reasoning in Language Models</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2109.07958">TruthfulQA: Measuring How Models Mimic Human Falsehoods</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2107.03374">HumanEval: Evaluating Large Language Models Trained on Code</a></p>
</li>
<li><p><a href="https://arxiv.org/abs/2009.03300">Measuring Massive Multitask Language Understanding (MMLU)</a></p>
</li>
</ul>
<p><strong>Contact Me</strong></p>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1)
 ]]>
                </title>
                <description>
                    <![CDATA[ We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on  ]]>
                </description>
                <link>https://www.freecodecamp.org/news/ai-paper-review-improving-language-understanding-by-generative-pre-training-gpt-1/</link>
                <guid isPermaLink="false">69fb84ad50ecad45335e5367</guid>
                
                    <category>
                        <![CDATA[ AI ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ academic writing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Mohammed Fahd Abrah ]]>
                </dc:creator>
                <pubDate>Wed, 06 May 2026 18:13:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/0998e844-4017-49b9-a68d-2d6c73fceb78.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>We use AI tools all the time, whether it’s asking questions, generating images, or getting help with everyday tasks. But most of these tools didn’t appear out of nowhere. They were developed based on research papers where the original ideas were developed and tested.</p>
<p>Now, not everyone enjoys reading research papers or has the time to comb through and digest all that (sometimes very dense) info. So I decided to do the hard work for you and share the key insights in a series of AI paper reviews.</p>
<p>The goal isn’t to turn this into a heavy academic discussion, but to explain the main ideas in a clear and practical way. You'll learn what problem the paper was trying to solve, what approach it introduced, and why it mattered.</p>
<p>In each article, you’ll get a simple breakdown of the paper, how it works, and what you should take away from it. By the end, you should understand the key idea without needing to go through the full research paper yourself.</p>
<h2 id="heading-paper-overview">Paper Overview</h2>
<p>The first paper I'll be reviewing is "Improving Language Understanding by Generative Pre-Training", by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.</p>
<p>Here's the actual paper if you want to read it yourself: <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">Read the paper</a>.</p>
<p>And here's a little infographic of what we'll cover here:</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/0466e09f-c2a3-41fa-939d-f67d53f900e1.png" alt="0466e09f-c2a3-41fa-939d-f67d53f900e1" style="display:block;margin:0 auto" width="1414" height="2000" loading="lazy">

<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a href="#heading-executive-summary">Executive Summary</a></p>
</li>
<li><p><a href="#heading-goals-of-the-paper">Goals of the Paper</a></p>
</li>
<li><p><a href="#heading-methodology">Methodology</a></p>
</li>
<li><p><a href="#heading-transformer-vs-bert-vs-gpt">Transformer vs. BERT vs. GPT</a></p>
</li>
<li><p><a href="#heading-model-architecture">Model Architecture</a></p>
</li>
<li><p><a href="#heading-key-techniques">Key Techniques</a></p>
</li>
<li><p><a href="#heading-key-findings">Key Findings</a></p>
</li>
<li><p><a href="#heading-conclusions">Conclusions</a></p>
</li>
<li><p><a href="#heading-limitations">Limitations</a></p>
</li>
<li><p><a href="#heading-related-work-amp-context">Related Work &amp; Context</a></p>
</li>
<li><p><a href="#heading-final-insight">Final Insight</a></p>
</li>
<li><p><a href="#heading-resources">Resources</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To get the most out of this breakdown, it helps to be familiar with a few basic ideas:</p>
<ul>
<li><p>A general understanding of natural language processing (NLP) and how machines work with text</p>
</li>
<li><p>A high-level idea of what a Transformer model is (you don’t need deep details, just the concept)</p>
</li>
<li><p>The difference between supervised and unsupervised learning</p>
</li>
<li><p>Basic machine learning concepts like training data and models</p>
</li>
</ul>
<p>If you’re not fully comfortable with all of these, that’s okay, you can still follow along. The goal here is to keep things clear and intuitive.</p>
<h2 id="heading-executive-summary">Executive Summary</h2>
<p>Before models like GPT became what we know today, there was a key limitation: AI systems were good at specific tasks, but struggled with general understanding.</p>
<p>In this paper, the authors introduce a simple but powerful idea. Instead of training a model separately for each task, they first train it on a large amount of unlabeled text to learn the structure of language. Then, they adapt it to specific tasks using smaller labeled datasets.</p>
<p>According to the authors, this two-step approach (pre-training followed by fine-tuning) allows a single model to handle many different tasks with minimal changes.</p>
<p>In practice, this marked a major shift: rather than building a new model for every problem, we can train one general model that learns language itself and then reuse it across tasks.</p>
<h2 id="heading-goals-of-the-paper">Goals of the Paper</h2>
<p>To understand the motivation behind this work, it helps to look at the main limitations in NLP at the time.</p>
<p>Most models depended heavily on large labeled datasets, which weren’t always available. Many tasks simply didn’t have enough labeled data to train effective systems. On top of that, existing models were usually designed for a single task, making them hard to reuse or adapt.</p>
<p>Because of this, the authors aimed to reduce the reliance on labeled data and move toward a more general approach. Their goal was to build a language model that could learn from large amounts of raw text and then be applied across different tasks.</p>
<p>According to the paper, they also wanted to enable transfer learning: the ability to take knowledge learned from one task and apply it to others. They also wanted to improve performance without needing to redesign a new model each time.</p>
<h2 id="heading-methodology">Methodology</h2>
<p>To understand how the authors approached this problem, let’s look at the core idea behind their method.</p>
<h3 id="heading-pre-training">Pre-Training</h3>
<p>At the heart of the paper is a simple but powerful approach built in two stages. The first stage is pre-training, where the model learns directly from raw text.</p>
<p>According to the authors, the model is trained on a large corpus of unlabeled text using a language modeling objective (predicting the next word in a sequence) – specifically, predicting the next word based on the previous ones to solve the intractable problem of <a href="https://en.wikipedia.org/wiki/High-dimensional_statistics">high dimension probabilities</a>. Through this process, the model gradually learns important aspects of language, such as grammar, context, structure, and general patterns.</p>
<p>The paper highlights that datasets like BooksCorpus are used in this stage because they contain long, continuous text. This is important, since it helps the model understand relationships across sentences rather than just short fragments.</p>
<h3 id="heading-fine-tuning-adapting-to-tasks">Fine-Tuning (Adapting to Tasks)</h3>
<p>Once the model has learned general language patterns, the next step is fine-tuning, where it is adapted to specific tasks using labeled data.</p>
<p>According to the authors, this includes tasks like question answering, text classification, natural language inference, and semantic similarity. Instead of building a new model for each task, the same pre-trained model is reused with only small adjustments.</p>
<p>In practice, this is what makes the approach powerful: the model already understands language at a general level, so it can quickly adapt to different tasks without needing to be redesigned from scratch.</p>
<h2 id="heading-transformer-vs-bert-vs-gpt">Transformer vs. BERT vs. GPT</h2>
<p>Before diving into GPT-1, it helps to understand how modern language models are structured. Most of them are based on the Transformer architecture, but they use it in different ways: encoder-only models (like BERT), decoder-only models (like GPT), or full encoder–decoder models.</p>
<p>The original encoder–decoder Transformer was mainly used for tasks like machine translation. Encoder-only models are typically used for understanding tasks such as text classification and sentiment analysis, while decoder-only models are designed for generation tasks like text creation, powering systems such as ChatGPT, Gemini, and Claude.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/e7348479-5fa0-4adf-92e1-644ae2039b03.png" alt="e7348479-5fa0-4adf-92e1-644ae2039b03" style="display:block;margin:0 auto" width="700" height="449" loading="lazy">

<p><em>Illustration comparing Transformer, GPT, and BERT architectures, adapted from</em> <a href="https://automotivevisions.wordpress.com/2025/03/21/comparing-large-language-models-gpt-vs-bert-vs-t5/">Comparing Large Language Models: GPT vs. BERT vs. T5</a> <em>showing encoder-decoder, decoder-only, and encoder-only designs</em></p>
<h3 id="heading-transformer-vs-bert-vs-gpt-key-differences">Transformer vs BERT vs GPT: Key Differences</h3>
<table style="min-width:100px"><colgroup><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"><col style="min-width:25px"></colgroup><tbody><tr><td><p><strong>Aspect</strong></p></td><td><p><strong>Transformer (Original)</strong></p></td><td><p><strong>BERT</strong></p></td><td><p><strong>GPT</strong></p></td></tr><tr><td><p><strong>Paper</strong></p></td><td><p>Attention Is All You Need (2017)</p></td><td><p>BERT (2018)</p></td><td><p>GPT (2018–2019)</p></td></tr><tr><td><p><strong>Architecture Type</strong></p></td><td><p>Encoder + Decoder</p></td><td><p>Encoder-only</p></td><td><p>Decoder-only</p></td></tr><tr><td><p><strong>Primary Goal</strong></p></td><td><p>Sequence-to-sequence tasks (for example, translation)</p></td><td><p>Language understanding</p></td><td><p>Language generation</p></td></tr><tr><td><p><strong>Training Objective</strong></p></td><td><p>Predict next token (seq2seq setup)</p></td><td><p>Masked language modeling (fill in blanks)</p></td><td><p>Predict next token (autoregressive)</p></td></tr><tr><td><p><strong>Directionality</strong></p></td><td><p>Bidirectional (encoder) + left-to-right (decoder)</p></td><td><p>Fully bidirectional</p></td><td><p>Left-to-right only</p></td></tr><tr><td><p><strong>Context Understanding</strong></p></td><td><p>Strong (via attention)</p></td><td><p>Very strong (full bidirectional context)</p></td><td><p>Strong (but only past context)</p></td></tr><tr><td><p><strong>Input/Output Style</strong></p></td><td><p>Input → Output sequence</p></td><td><p>Input → Representation</p></td><td><p>Input → Generated text</p></td></tr><tr><td><p><strong>Fine-tuning</strong></p></td><td><p>Required for each task</p></td><td><p>Required for each task</p></td><td><p>Optional (GPT-2+ supports zero-shot)</p></td></tr><tr><td><p><strong>Typical Tasks</strong></p></td><td><p>Translation, summarization</p></td><td><p>Classification, QA, NLI</p></td><td><p>Text generation, QA, chat</p></td></tr><tr><td><p><strong>Strength</strong></p></td><td><p>Flexible architecture foundation</p></td><td><p>Deep understanding of text</p></td><td><p>General-purpose generation</p></td></tr><tr><td><p><strong>Limitation</strong></p></td><td><p>Not directly usable without adaptation</p></td><td><p>Cannot generate text naturally</p></td><td><p>Limited bidirectional context</p></td></tr><tr><td><p><strong>Key Innovation</strong></p></td><td><p>Self-attention mechanism</p></td><td><p>Deep bidirectional encoding</p></td><td><p>Scaled generative pre-training</p></td></tr><tr><td><p><strong>Evolution Role</strong></p></td><td><p>Foundation of all modern LLMs</p></td><td><p>Specialized understanding models</p></td><td><p>Path to general-purpose AI</p></td></tr></tbody></table>

<h2 id="heading-model-architecture">Model Architecture</h2>
<p>To support this pre-training and fine-tuning approach, the GPT-1 model is built on a Transformer (decoder) architecture.</p>
<p>According to the authors, this choice is important for a few reasons. Unlike older models such as LSTMs, Transformers handle long-range dependencies more effectively, meaning they can better understand relationships between words that are far apart in a sentence.</p>
<p>They also rely on self-attention, a mechanism that allows the model to focus on the most relevant parts of the text when processing each word. This helps the model capture context more accurately.</p>
<p>Another key advantage is that Transformers make transfer learning more effective, since the same learned representations can be reused across different tasks with minimal changes.</p>
<p>The paper highlights that, in these transfer learning scenarios, Transformers outperform LSTM-based models.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/59df10f6-d843-4db7-9def-e302594d0b7e.png" alt="59df10f6-d843-4db7-9def-e302594d0b7e" style="display:block;margin:0 auto" width="1793" height="831" loading="lazy">

<p><em>Figure 1 from</em> “Improving Language Understanding by Generative Pre-Training” <em>(Radford et al., 2018), showing the Transformer architecture and task-specific input transformations.</em></p>
<h2 id="heading-key-techniques">Key Techniques</h2>
<p>Along with the main approach, the authors introduce a few practical techniques that make the model more flexible across tasks.</p>
<p>According to the paper, different tasks are handled by converting them into text-based formats, so they can all be processed in a similar way. This makes it easier to use the same model across multiple problems without redesigning it each time.</p>
<p>Another important point is that the model requires only minimal architectural changes when switching between tasks. Most of the knowledge learned during pre-training is reused as-is.</p>
<p>The authors also include an auxiliary language modeling objective during fine-tuning, which helps the model retain its general understanding of language while adapting to specific tasks.</p>
<h2 id="heading-key-findings">Key Findings</h2>
<p>After training and evaluation, the results weren't just strong – they were surprisingly competitive.</p>
<p>According to the authors, the model outperformed state-of-the-art systems in 9 out of 12 tasks. It also showed clear improvements, including +8.9% in commonsense reasoning and +5.7% in question answering.</p>
<p>Another important observation is that the model performed well across datasets of different sizes, although performance was weaker on some smaller datasets.</p>
<p>This suggests that the pre-training step helped it generalize better, even when labeled data was limited.</p>
<p>In practice, what makes these results significant is that a single model was able to compete with specialized systems that were specifically designed for each individual task.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69ce92860ff860b6de01ed93/14e5a9dd-9919-4b2a-ad42-6b011770b7fe.png" alt="14e5a9dd-9919-4b2a-ad42-6b011770b7fe" style="display:block;margin:0 auto" width="1866" height="815" loading="lazy">

<p><em>Figure 2 from</em> “Improving Language Understanding by Generative Pre-Training” <em>(Radford et al., 2018), illustrating performance gains from layer transfer and zero-shot learning behavior.</em></p>
<h2 id="heading-conclusions">Conclusions</h2>
<p>To wrap things up, this paper introduced a major shift in how AI systems are built.</p>
<p>According to the authors, instead of training a new model from scratch for every task, we can first teach a model the structure of language through pre-training, and then adapt it to specific tasks through fine-tuning. This simple idea turns out to be highly effective.</p>
<p>The key takeaway is that language models can develop a general understanding of text, especially when combined with Transformer architectures and large-scale data. This makes transfer learning practical across many different tasks.</p>
<p>In my view, this is what makes the paper so impactful. It doesn’t just improve performance on a few benchmarks. It changes the overall approach to building AI systems.</p>
<p>This idea later became the foundation for models like GPT-2, GPT-3, and ChatGPT, and continues to shape modern large language models today.</p>
<h2 id="heading-limitations">Limitations</h2>
<p>Like any approach, this method comes with its own limitations.</p>
<p>According to the paper, one of the main challenges is the need for large amounts of unlabeled data during the pre-training stage, which may not always be easy to get. The model’s performance also depends heavily on how well the fine-tuning step is done.</p>
<p>The authors also note that multi-task learning was not fully explored in this work, leaving some open questions about how well the model can handle multiple tasks at the same time.</p>
<p>In practice, another limitation is that performance can be weaker when working with very small datasets, especially if the fine-tuning process is not carefully handled.</p>
<h2 id="heading-related-work-amp-context">Related Work &amp; Context</h2>
<p>To better understand where this paper fits, it helps to look at the ideas it builds on.</p>
<p>According to the authors, earlier approaches such as word embeddings (like Word2Vec and GloVe), LSTM-based language models, and semi-supervised learning had already made progress in understanding language. But these methods were often limited to learning representations at the word level or required more task-specific design.</p>
<p>What this paper does differently is move beyond that. Instead of focusing only on individual words, it learns broader language representations that capture context and meaning across entire sequences. This shift is what enables the model to generalize better across different tasks.</p>
<h2 id="heading-final-insight">Final Insight</h2>
<p>If there’s one idea to take away from this paper, it’s this: you don’t need to teach an AI system every task separately.</p>
<p>According to the authors, once a model learns the structure of language, it can adapt to a wide range of tasks with minimal changes. That shift – from task-specific models to general language understanding – is what makes this work so important.</p>
<p>In my view, this is the moment where things really changed. What started here with GPT-1 became the foundation for the systems we use today, including ChatGPT and other modern language models.</p>
<h2 id="heading-resources">Resources:</h2>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD/Pytorch-Collections/tree/main/GPT">Pytorch Projects for GPT series</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1301.3781">Word2Vec (Mikolov et al., 2013)</a></p>
</li>
<li><p><a href="https://aclanthology.org/D14-1162.pdf">GloVe (Pennington et al., 2014)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1706.03762">Attention Is All You Need (Vaswani et al., 2017)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1511.01432">Semi-supervised Sequence Learning (Dai and Le, 2015)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1801.06146">Universal Language Model Fine-tuning for Text Classification (Howard and Ruder, 2018)</a></p>
</li>
<li><p><a href="https://aclanthology.org/N18-1202.pdf">Deep Contextualized Word Representations (Peters et al., 2018)</a></p>
</li>
<li><p><a href="https://aclanthology.org/P17-1194.pdf">Semi-supervised Multitask Learning for Sequence Labeling (Rei, 2017)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1506.06726">Skip-Thought Vectors (Kiros et al., 2015)</a></p>
</li>
<li><p><a href="https://arxiv.org/pdf/1705.02364">Supervised Learning of Universal Sentence Representations (Conneau et al., 2017)</a></p>
</li>
</ul>
<h3 id="heading-contact-me">Contact Me</h3>
<ul>
<li><p><a href="https://github.com/MOHAMMEDFAHD"><strong>Github</strong></a></p>
</li>
<li><p><a href="https://x.com/programmingoce"><strong>X</strong></a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/mohammed-abrah-6435a63ba/"><strong>Linkedin</strong></a></p>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Neural Networks Work – Explained Using the Straight Line Equation y = ax + b ]]>
                </title>
                <description>
                    <![CDATA[ Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“ A straight line equation y = ax+b answers it in the simplest way possible. y can incre... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/neural-networks-explained-using-y-ax-b/</link>
                <guid isPermaLink="false">695ef4246f1bfe13bf31abe9</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Samyukta Hegde ]]>
                </dc:creator>
                <pubDate>Thu, 08 Jan 2026 00:02:44 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767800625537/5bb99a58-d247-4933-b60b-fd2c14651542.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Did you know that every data scientist who builds a complex neural network starts with a fundamental question, “How does the output change when the input changes?“</p>
<p>A straight line equation <code>y = ax+b</code> answers it in the simplest way possible. <code>y</code> can increase, decrease, or stay the same when <code>x</code> changes.</p>
<p>On the other hand, a deep neural network tries to answer it in a flexible way. It’s only possible because of multiple layers of straight line calculations stacked one over another along with non linear adjustments to help the network adapt and produce the desired result.</p>
<p>Since a straight line is the essence of neural networks, I think it’s time we try to understand the subtle details of <code>y = ax+b</code>, which I refer to as the <strong>magical equation</strong>. We’ll also go through the basics of linear regression and classification, which should help you understand the progression of a simple straight line to a complex deep neural network.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-yaxb">y=ax+b</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-linear-regression">Linear Regression</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-linear-classification">Linear Classification</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-comparison">Comparison</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-key-additions-to-help-build-deep-neural-networks">Key Additions to Help Build Deep Neural Networks</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-modelling-a-deep-neural-network">Modelling a Deep Neural Network</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-thoughts">Final Thoughts</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A basic understanding of linear algebra, particularly <code>y=ax+b</code>.</p>
</li>
<li><p>General idea about linear regression and classification.</p>
</li>
<li><p>Familiarity with the concept of deep neural networks.</p>
</li>
</ul>
<h2 id="heading-yaxb">y=ax+b</h2>
<p>A straight line simply means that output changes steadily as input changes. There are no surprises (that is, no non linearity). Let’s analyze it properly.</p>
<pre><code class="lang-plaintext">y =&gt; Output variable
x =&gt; Input variable
a =&gt; Amount by which y changes when x changes (slope)
b =&gt; Value of y when x is 0 (y intercept)
</code></pre>
<p>We can take an example and model it in the same form to understand it better.</p>
<p>Ms. Poly is a math teacher who wants to formulate a study plan for her students to excel in an upcoming final exam. For simplicity, she creates a rule of thumb using only one factor: the number of hours studied per week. It has a direct impact on the marks scored by a student.</p>
<p>Before beginning, she makes certain assumptions:</p>
<ul>
<li><p>Every student is capable of scoring at least 30 without studying.</p>
</li>
<li><p>For every hour a student studies, an additional 3 marks can be scored.</p>
</li>
</ul>
<p>She then comes up with the following equation based on her ideas: <code>y = 3x+30</code></p>
<pre><code class="lang-plaintext">y =&gt; Marks scored.
x =&gt; Number of hours studied.
a=3 =&gt; Increase in marks for every hour studied
b=30 =&gt; Minimum marks
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764650083131/997f2a53-78ac-4b6f-a0c1-b995fb515075.png" alt="Plot of y=3x+30" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>In the above graph, she plots the points based on the results of the equation. As expected, it is a straight line. If she needs the marks scored for <code>9</code> hours of study, she can get it by just substituting <code>x=9</code> in <code>y=3x+30</code>. Note that the data (<code>x</code> and <code>y</code>) are just based on her hunch and aren’t real.</p>
<p>But Ms. Poly wants to guide her students on how to prepare for the final exam based on actual data. So she conducts a pop quiz and grades it. In order to formulate a study plan, she interviews her students and collects information on how many hours they study math per week. She creates a table with two columns: number of hours studied (<code>x</code>) per week and marks scored (<code>y</code>). She tries her old formula <code>y=3x+30</code>, but it doesn’t seem to work. Thus, she doesn’t have any sensible equation describing the relation between <code>x</code> and <code>y</code>.</p>
<p>Let’s assume that a new student who hasn’t attended any exam (no <code>y</code> available) joins the class the next day, and Ms. Poly only knows the number of hours dedicated per week (<code>x</code>). How can she answer the question below?</p>
<p><em>If the new student studies for a certain number of hours (</em><code>x</code><em>), what can be the marks scored (</em><code>y</code><em>) in the exam?</em></p>
<p>It’s impossible unless there’s an equation defining the sample data. So, her task is to find one that fits the given points. This process is called curve fitting or regression.</p>
<h2 id="heading-linear-regression">Linear Regression</h2>
<p>The core idea of linear regression to find a straight line that captures the trend of the existing data to facilitate predictions for new input data. Now, let’s dive straight into the example to understand the concept better.</p>
<p>Ms. Poly is determined to arrive at a solution. She plots the collected data on a graph to get a better picture.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651274954/0aa2dfc2-d846-40e6-872d-e7d5abe598a8.png" alt="Input Data" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>She has absolutely no idea how <code>x</code> and <code>y</code> are related. So, she must figure out a formula, by trial and error, that roughly fits the points. She has to start with an intuitive guess, try to improve it in the subsequent steps and then arrive at the best possible solution.</p>
<p><strong>Trial 1</strong>: Ms. Poly begins with her previous straight line equation.</p>
<p><code>y = 3x+30</code></p>
<p>She substitutes different values of <code>x</code> and plots it alongside the collected input data. This way she can get a clear picture of the differences in her assumption and reality.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651323645/a3e79765-99bc-42be-8836-82119d7fbf66.png" alt="Linear Regression-Trial 1" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Trial 2</strong>: She observes that the line needs a little more slope. This simply means that, in reality, more marks are being scored for every additional hour of study. By changing it from <code>3</code> to <code>4</code>, the equation becomes:</p>
<p><code>y = 4x+30</code></p>
<p>The following graph depicts the new line alongside the sample data:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651379913/42a8fc61-7927-46de-aadf-b691544b9a1b.png" alt="Linear Regression-Trial 2" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><strong>Trial 3:</strong> It looks better but she feels there is a need to shift the whole line upwards. This means that higher marks are being scored even if a student doesn’t dedicate any time for math in a week. She decides to retain the previous slope but changes the starting marks by <code>10</code>, thus arriving at:</p>
<p><code>y = 4x+40</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651454435/5fea2d39-8254-48e6-be14-69c803982ec7.png" alt="Linear Regression-Trial 3" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>This particular line covers most of the points and can be considered the best possible solution.</p>
<p>Now, if she wishes to ascertain the marks scored by the new student who studied for <code>3.5</code> hours, she pins the value inside the formula and calculates the answer: <code>y = 4*(3.5)+40=54</code></p>
<p>We saw how Ms. Poly arrived at a straight line equation to predict the output for an unknown input. Now she can chalk out a study plan for her class based on the equation.</p>
<p>Here, an expression is formulated to ascertain the change in output when the input changes. It looks like Ms. Poly is thinking like a data scientist. She has in fact modelled a very simple neural network for regression. The equation <code>y=4x+40</code> can be considered as the only neuron (processing unit) within it. She’s adjusted the parameters <code>a</code> (weight) and <code>b</code> (bias) to arrive at the final formula which covers most of the points (thus minimizing the loss).</p>
<p>Here’s a breakdown of the <code>y = 4x+40</code> equation:</p>
<pre><code class="lang-plaintext">y =&gt; Marks scored.
x =&gt; Number of hours studied.
a=4 =&gt; Increase in marks for every hour studied
b=40 =&gt; Minimum marks
</code></pre>
<p>At present, it is a rudimentary neural network which has no layering and non-linearity.</p>
<p>Now let’s shift our attention to a completely different scenario. Ms. Poly, being a teacher, wants to ensure that all her students pass the exam. Assuming, as an end result, she’s not interested in predicting the marks scored. She just wants to know:</p>
<p><em>If a student studies for a certain number of hours (</em><code>x</code><em>), will the student pass/fail(y) the exam?</em></p>
<p>This leads her to the process of classification.</p>
<h2 id="heading-linear-classification">Linear Classification</h2>
<p>The linear classification process uses a simple straight line to divide the data into categories or classes. The line acts as a boundary so that the classes fall on either side of it. First, Ms. Poly defines the boundary condition for pass and fail.</p>
<p><em>If marks scored&gt;=50, pass</em></p>
<p><em>If marks scored&lt;50, fail</em></p>
<p>According to the data table, <code>x=3</code> corresponds to <code>y=52</code> (boundary condition). Therefore she considers <code>x=3</code> as the classification line***.***</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764651531018/e669ed7b-1c86-4093-b7e5-feb06464ebfe.png" alt="Linear Classification" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p><code>x=3</code> seems to segregate the points into the categories properly. She tries to confirm it by substituting another value. Thus, if a student studied for <code>9</code> hours, the score would lie towards the right side of <code>x=3</code>. So, they’d pass as per the classification equation.</p>
<p>Again, she’s arrived at an expression to ascertain the change in output when the input changes. But here, she has modelled a basic neural network for classification. The equation x=3 is the only neuron within it. It can be considered to be having two parts as explained below.</p>
<ol>
<li><p><strong>Pre-Activation Part:</strong> This portion of the neuron computes an intermediate value which is helpful in further processing. She’s figured out the parameters <code>a</code> (weight) and <code>b</code> (bias) to arrive at the following formula: <code>z = x-3</code></p>
<pre><code class="lang-plaintext"> z =&gt; Intermediate Value.
 x =&gt; Number of hours studied.
 a=1 =&gt; Influence of the number of hours studied on the marks scored
 b=-3 =&gt; Minimum number of hours to study to pass the exam = 3
</code></pre>
</li>
<li><p><strong>Activation Part:</strong> This portion triggers the neuron to make decisions based on a threshold value. The following equation segregates the points into two classes.</p>
<pre><code class="lang-plaintext"> y = 1 (Pass) if z&gt;=0
 y = 0 (Fail) if z&lt;0
</code></pre>
</li>
</ol>
<p>This is a very plain neural network which has no layering and non-linearity but has pre-activation and activation parts inside a neuron.</p>
<h2 id="heading-comparison">Comparison</h2>
<p>We looked at the examples of both linear regression and classification used by Ms. Poly. <strong>Regression</strong> helps in predicting a value while <strong>Classification</strong> helps in decision making. Let’s draw a small table to summarize the differences.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764652317811/f4411011-fcd3-4a53-b116-a3c8a27c81d8.png" alt="Comparison between Linear Regression and Classification" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Upon careful observation we notice that both answer the question of how input change affects output.</p>
<p>But at a slightly higher level of complexity than a straight line. Because in the case of both regression and classification, we try to figure out the equation parameters by trial and error.</p>
<p>Here, since the requirements are simple, Ms. Poly just uses a straight line to solve both. A simple linear equation can handle only one steady trend. But in real life, problems that need solving are far more challenging and unpredictable. Some examples are:</p>
<p><strong>Image Classification</strong>: An output label is produced based on the input images.</p>
<p><strong>Text Translation</strong>: An English sentence can be given as an input to be translated to say, Spanish.</p>
<p><strong>Chatbots</strong>: A text prompt is typed in by a user and a meaningful and relevant output is generated.</p>
<p>She probably should have to use a deep neural network if both data and task were complex. That presents another question: <strong>How does one build a deep neural network?</strong></p>
<p>We will explore it further by extending the same example to a more realistic version.</p>
<h2 id="heading-key-additions-to-help-build-deep-neural-networks">Key Additions to Help Build Deep Neural Networks</h2>
<p>In the above sections, we noted that Ms. Poly was interested in predicting the exam results of a student using just one factor - number of hours studied. However, in practice, is that one factor sufficient in determining the marks scored or whether the student passes the exam?</p>
<p>No. It’s not enough. She needs to take into account a lot of aspects like:</p>
<ul>
<li><p>Number of hours studied</p>
</li>
<li><p>Number of hours of sleep/rest</p>
</li>
<li><p>Burnout due to over-studying</p>
</li>
<li><p>Difficulty level of topics in math</p>
</li>
<li><p>Pattern of the exam, and so on.</p>
</li>
</ul>
<p>All the above neither act independently nor do they have a simple linear relation with the marks scored. So, she has to solve this problem by stacking the contributing factors one above the other in layers and also adding the element of non linearity. Let’s take a look at each in detail.</p>
<h3 id="heading-layering">Layering</h3>
<p>Burnout leads to lower score whereas good sleep increases score. But burnout can be reduced if the student is well rested. So, the impact on the final score when these two factors interact should be taken into account. This is possible only when the system solves it in layers. The first layer can deal with how they independently influence the score, the next layer can explore the interaction between them.</p>
<h3 id="heading-non-linearity">Non-Linearity</h3>
<p>If the number of hours studied increases, the score might increase but when burnout overpowers the effect of study hours, the score reduces. The combined effect results in a non-linear graph. There is a rise and then dip in the score based on number of hours studied. It’s evident that the relationship is not straightforward as in a straight line. That’s where it becomes necessary to add non-linearity in the calculations. It helps the system to respond differently according to the conditions, allowing for flexibility in dealing with real world data and conditions.</p>
<p>Thus, Ms. Poly would have to extend the idea of linear regression/classification by including layering and non-linearity to build a fully functional neural network to help build a practical study plan.</p>
<h2 id="heading-modelling-a-deep-neural-network">Modelling a Deep Neural Network</h2>
<p>Ms. Poly should start the work on modelling a deep neural network by following the steps mentioned below:</p>
<h3 id="heading-step-1-define-the-problem-clearly"><strong>Step #1 - Define the Problem Clearly</strong></h3>
<p>The following factors should be considered before she begins the process of modelling:</p>
<ul>
<li><p>What are the input features?</p>
</li>
<li><p>What are the output features?</p>
</li>
<li><p>What type of problem is it (regression/classification)?</p>
</li>
</ul>
<h3 id="heading-step-2-define-the-input-layer"><strong>Step #2 - Define the Input Layer</strong></h3>
<p>The input features form the first layer. There is no computation in this stage. They are represented as:</p>
<pre><code class="lang-plaintext">x1: Number of hours studied
x2: Number of hours of sleep/rest
x3: Burnout due to over-studying
x4: Difficulty level of topics in Maths
x5: Pattern of the exam
</code></pre>
<h3 id="heading-step-3-define-the-first-hidden-layer"><strong>Step #3 - Define the First Hidden Layer</strong></h3>
<p>This step consists of two parts:</p>
<p><strong>Apply Linear Transformation</strong>: The actual learning begins here. A straight line equation is used to understand the combined effect of the inputs. The general formula is <code>z=Wx+b</code>.</p>
<pre><code class="lang-plaintext">z: Intermediate value or Pre-activation
W: Weight matrix which consists of values corresponding to the impact of
each input feature
x: Matrix consisting of input features, [x1, x2, x3, x4, x5]
b: Bias which represents the initial assumptions of the teacher(when x=0)
</code></pre>
<p>It looks similar to a linear regression/classification equation. At first <code>W</code> and <code>b</code> are initialized to random values. Then in the subsequent steps, they are adjusted like it was done in earlier examples. We can consider the following combinations assuming we have two neurons in this layer:</p>
<p><strong>Neuron 1:</strong> It can focus on study hours, burnout, and rest, with other features contributing less significantly.</p>
<p><strong>Neuron 2</strong>: It can emphasize more on the difficulty level of the topic and the exam type compared to other inputs.</p>
<p>It’s important to note that this layer doesn’t calculate the interactions between the features but only on the way different linear combinations work together but independently. To make it clearer, how they contribute independently are added together. We don’t know how one input feature influences the other. For example, we know sleep increases score and burnout reduces score, but what we don’t know at this stage is if sleep reduces burnout, which in turn can influence the final score.</p>
<p><strong>Add Non-Linearity</strong>: This step, also called activation, helps in capturing the complexities in different combinations of the features. Less study results in low marks, and too much burnout also results in low marks. It means there is a curve in the score graph which can’t be represented by a linear equation. The activation function is applied to the intermediate value and can be expressed as:</p>
<p><strong>a = g(z)</strong></p>
<pre><code class="lang-plaintext">a: Activation output
g: Activation function
z: Intermediate value or Pre-activation
</code></pre>
<p>For example: <code>ReLU</code> is an activation function which outputs <code>z</code> only if <code>z</code> is positive, else <code>0</code>.</p>
<p><strong>y = ReLU(z)=max(0,z)</strong></p>
<p>We can see that it has no steady slope and is a non-linear activation function. It can suit this scenario as it lets the value pass through to the next layer only if the combined effect of features is greater than 0. Neuron 1 will let it’s output go to the next layer only if the intermediate value (<code>z</code>) that results from study hours, burnout and rest, is large enough to be influencing the final decision, else it’s ignored. There are multiple options for non-linear activation functions that one can choose from.</p>
<h3 id="heading-step-4-stack-layers-one-above-the-other"><strong>Step #4 - Stack Layers One Above the Other</strong></h3>
<p>This step helps in learning the mutual interactions between the inferences learned from the first hidden layer. The network attempts to understand the intricate details of the influencing factors and build a stable system. It is here that details of whether sleep reduces burnout are figured out. Every layer consists of linear and non linear transformations applied on the input, which are values obtained from the previous layer. Likewise multiple layers can be stacked one over the other based on the requirements. In this example, for representation, we have taken two hidden layers with two neurons each. The number of layers and neurons can vary based on requirements.</p>
<h3 id="heading-step-5-define-the-output-features"><strong>Step #5 - Define the Output Feature(s)</strong></h3>
<p>This appears to be the final stage in a deep neural network. Ms. Poly can decide what she wants for output: predict the marks scored by a student or predict if the student passes/fails the exam. If she wants the final marks scored, she just has to apply linear transformation in the neuron in the final layer to produce the output. If she wants pass/fail status, she has to apply both linear and non-linear transformations to achieve the desired results.</p>
<p>The diagram below shows an abstract representation of the deep neural network.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1766153114888/1e513840-483a-43cf-b062-ce5af886a04e.png" alt="Abstract Representation of a Deep Neural Network" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>The next steps are:</p>
<p><strong>Training the model</strong>: The network is trained in the following way:</p>
<ul>
<li><p>Random weights and biases are assigned to the linear transformation portions of the network.</p>
</li>
<li><p>Then the network makes a prediction which is compared with the expected result.</p>
</li>
<li><p>If there are gaps between the actual result and the predicted result, corrections are made in weights and biases (this step is similar to what was done in linear regression and classification).</p>
</li>
<li><p>The steps above are repeated until the results improve.</p>
</li>
</ul>
<p><strong>Using the model</strong>: After the model has been trained, it is capable of yielding results for new input values.</p>
<h2 id="heading-final-thoughts"><strong>Final Thoughts</strong></h2>
<p>In this article, we began with the basics of a straight line equation. Then we gradually navigated through slightly more elaborate concepts like linear regression and classification. They laid the groundwork for delving into the seemingly mysterious deep neural networks. But they are in fact built by stacking layers of linear transformations and non-linear activations, which help understand sophisticated real world patterns.</p>
<p>Despite all the complexities and layers, we can see that the straight line remains the foundation upon which neural networks are built. As we saw earlier, the equation that a deep neural network begins with is our <em>magical equation:</em> <code>y = ax+b</code>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Set Up CUDA and WSL2 for Windows 11 (including PyTorch and TensorFlow GPU) ]]>
                </title>
                <description>
                    <![CDATA[ If you’re working on complex Machine Learning projects, you’ll need a good Graphics Processing Unit (or GPU) to power everything. And Nvidia is a popular option these days, as it has great compatibility and widespread support. If you’re new to Machin... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-set-up-cuda-and-wsl2-for-windows-11-including-pytorch-and-tensorflow-gpu/</link>
                <guid isPermaLink="false">69309b9e8c594b8177306456</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Windows ]]>
                    </category>
                
                    <category>
                        <![CDATA[ WSL ]]>
                    </category>
                
                    <category>
                        <![CDATA[ GPU ]]>
                    </category>
                
                    <category>
                        <![CDATA[ cuda ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Md. Fahim Bin Amin ]]>
                </dc:creator>
                <pubDate>Wed, 03 Dec 2025 20:20:46 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764786287487/f0c28401-ce77-4873-b238-59fc6b737ce7.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you’re working on complex Machine Learning projects, you’ll need a good Graphics Processing Unit (or GPU) to power everything. And Nvidia is a popular option these days, as it has great compatibility and widespread support.</p>
<p>If you’re new to Machine Learning and are just getting started, then a free <a target="_blank" href="https://www.kaggle.com/">Kaggle</a> or <a target="_blank" href="https://colab.research.google.com/">Colab</a> might be enough for you. But that won’t be the case when you want to go deeper. You’ll need a GPU, which can get costly if you’re continuously using it on the cloud.</p>
<p>But there’s some good news: you can utilize your computer’s Nvidia GPU (GTX/RTX) quite easily and perform machine learning-related tasks right on your local machine. The cool thing is, it won’t cost you anything other than the electricity it uses!</p>
<p>When you’re running Machine Learning models on your local machines, the most suitable operating system is a Linux-based one, like Ubuntu. But Windows has improved a lot for this purpose. If you’re using the latest Windows 11, you can leverage Windows Subsystem for Linux (WSL) and use your GPU directly for Machine Learning-related workflows.</p>
<p>This process can be quite tricky, though, as can making two popular Machine Learning frameworks, TensorFlow and PyTorch, compatible with your system GPU in Windows 11. That’s why I have written this comprehensive guide to ease your pain.</p>
<p>In it, I’ll help you set up CUDA on Windows Subsystem for Linux 2 (WSL2) so you can leverage your Nvidia GPU for machine learning tasks.</p>
<p>By following these steps, you’ll be able to run ML frameworks like TensorFlow and PyTorch with GPU acceleration on Windows 11.</p>
<p>Keep in mind that this guide assumes you have a compatible Nvidia GPU. Make sure to check <a target="_blank" href="https://developer.nvidia.com/cuda-gpus">Nvidia's official compatibility list</a> before proceeding.</p>
<p>I have also prepared a video for you that’ll help you follow proper guidelines throughout this article.</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/qOJ49nkU4rY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
<p> </p>
<p>Also, if this tutorial helps you, then don’t forget to add a star to the GitHub repository <a target="_blank" href="https://github.com/FahimFBA/CUDA-WSL2-Ubuntu-v2">CUDA-WSL2-Ubuntu-v2</a>. If you face any issues or have any suggestions/improvements, then please raise an issue in the GitHub repository. Currently, the live website is available at <a target="_blank" href="https://ml-win11-v2.fahimbinamin.com/">ml-win11-v2.fahimbinamin.com</a>.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ol>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-windows-terminal">Windows Terminal</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-windows-powershell-latest-amp-greatest">Windows PowerShell (Latest &amp; Greatest)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-windows-terminal">Configure Windows Terminal</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configuration-of-my-computer">Configuration of my computer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cpu-virtualization">CPU Virtualization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-wsl2">Install WSL2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-latest-lts-ubuntu-via-wsl2">Install Latest LTS Ubuntu via WSL2</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-update-amp-upgrade-ubuntu-packages">Update &amp; Upgrade Ubuntu Packages</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-and-configure-miniconda">Install and Configure Miniconda</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-jupyter-amp-ipykernel">Install Jupyter &amp; Ipykernel</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-nvidia-driver">Nvidia Driver</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cuda-toolkit">CUDA Toolkit</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-add-path-to-shell-profile-for-cuda">Add Path to Shell Profile for CUDA</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-nvcc-version">nvcc Version</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-cudnn-sdk">cuDNN SDK</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tensorflow-gpu">TensorFlow GPU</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-check-tensorflow-gpu">Check TensorFlow GPU</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-pytorch-gpu">PyTorch GPU</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-check-pytorch-gpu">Check PyTorch GPU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-check-pytorch-amp-tensorflow-gpu-inside-jupyter-notebook">Check PyTorch &amp; TensorFlow GPU inside Jupyter Notebook</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Before you begin, make sure you have the following requirements met:</p>
<ul>
<li><p>Windows 11 operating system</p>
</li>
<li><p>Nvidia GPU (GTX/RTX series)</p>
</li>
<li><p>Administrator access to your PC</p>
</li>
<li><p>At least 30 GB of free disk space</p>
</li>
<li><p>Internet connection for downloads</p>
</li>
<li><p>Latest Nvidia drivers installed</p>
</li>
</ul>
<h2 id="heading-windows-terminal">Windows Terminal</h2>
<p>First, you’ll need to ensure that you have Windows Terminal installed properly in your operating system. It is the newest terminal application for users of command-line tools and shells like Command Prompt, PowerShell, and WSL. You can download it from the <a target="_blank" href="https://apps.microsoft.com/detail/9N0DX20HK701?hl=en-us&amp;gl=BD&amp;ocid=pdpshare">Microsoft Store</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094104150/c73ae561-6888-4eea-9419-186c6659a62f.png" alt="Preview of Windows Terminal on Windows 11" class="image--center mx-auto" width="1133" height="641" loading="lazy"></p>
<p>After ensuring that it’s installed properly, you can proceed to the next steps.</p>
<h2 id="heading-windows-powershell-latest-amp-greatest">Windows PowerShell (Latest &amp; Greatest)</h2>
<p>Windows PowerShell is a modern and updated command-line shell from Microsoft. You can use some Linux specific commands directly on it. It comes with built-in command suggestions. You can download it from the <a target="_blank" href="https://github.com/PowerShell/PowerShell/releases/">official GitHub page</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094138179/78315197-f4f2-4df4-b022-37cb9e74cda2.png" alt="Preview of Windows PowerShell on GitHub" class="image--center mx-auto" width="1519" height="904" loading="lazy"></p>
<p>Download the latest x64 installer and install it. After ensuring that it is installed properly, you can proceed to the next steps.</p>
<h2 id="heading-configure-windows-terminal">Configure Windows Terminal</h2>
<p>Now you’ll need to configure your Windows Terminal to use PowerShell as the default shell. It’s optional and you might skip this step. But I recommend doing it for a better experience.</p>
<p>Open Windows Terminal. Click on the down arrow icon in the title bar and select "Settings".</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094162440/6ea767c8-da3b-4280-84f8-0eb2b0647a46.png" alt="Preview of Windows PowerShell settings window" class="image--center mx-auto" width="1166" height="660" loading="lazy"></p>
<p>In the Settings tab, under "Startup", find the "Default profile" dropdown menu. Select "PowerShell" from the list.</p>
<p>Now for the "Default terminal application", select "Windows Terminal".</p>
<p>By default, Windows PowerShell always shows the version number in the title bar. If you want to disable it, select the "PowerShell" profile from the left sidebar. Click on the "Command Line" field and add an <code>--nologo</code> argument at the end of the command. After this, the line becomes <code>"C:\Program Files\PowerShell\7\pwsh.exe" --nologo</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094185648/3641d5f0-ba34-44b9-8a63-86b53068d02e.png" alt="Preview of Windows PowerShell --nologo setting" class="image--center mx-auto" width="1170" height="654" loading="lazy"></p>
<p>If you don’t use other shells frequently and want to hide them in the dropdown, then you’ll need to select those profiles one by one from the left sidebar. Scroll down to the bottom and find the "Hide profile from dropdown" toggle and enable it. It will hide that specific shell from the dropdown menu.</p>
<p>For example, I am hiding the <strong>Azure Cloud Shell</strong> profile as I don't use it frequently:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094214632/73add1b7-bcdd-4368-86a6-975fa2f72b54.png" alt="Preview of hiding profiles in Windows Terminal" class="image--center mx-auto" width="1151" height="657" loading="lazy"></p>
<p>Now click on the "Save" button at the bottom right corner to apply the changes. Close the Windows Terminal for now.</p>
<h2 id="heading-configuration-of-my-computer">Configuration of My Computer</h2>
<p>I figured it’d be helpful to share my current computer’s configuration so you can have a clear idea of which setup I’m using in this guide. Here are the details:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Component</strong></td><td><strong>Specification</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Processor</strong></td><td>AMD Ryzen 7 7700 8-Core Processor (8 Core 16 Threads)</td></tr>
<tr>
<td><strong>RAM</strong></td><td>64GB DDR5 6000MHz</td></tr>
<tr>
<td><strong>Storage</strong></td><td>1 TB Samsung 980 NVMe SSD, 4 TB HDD, 2 TB SATA SSD</td></tr>
<tr>
<td><strong>GPU</strong></td><td>NVIDIA GeForce RTX 3060 12GB GDDR6</td></tr>
<tr>
<td><strong>Operating System</strong></td><td>Windows 11 Pro Version 25H2</td></tr>
</tbody>
</table>
</div><p>Now that you have an idea about my computer’s configuration, we can proceed to the next steps.</p>
<h2 id="heading-cpu-virtualization">CPU Virtualization</h2>
<p>As we are going to use WSL2, we’ll need to make sure that the CPU virtualization is enabled. To check whether virtualization is enabled or not from Windows, simply open the Windows Task Manager. Go to the Performance tab and select CPU from the left sidebar. In the bottom right corner, you will see the Virtualization status. If it shows "Enabled", then you are good to go. If it shows "Disabled", then you need to enable it from the BIOS.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094252181/29efa40c-ec0a-4d99-adb7-50596348a1aa.png" alt="Preview of Virtualization enabled status in Windows Task Manager" class="image--center mx-auto" width="824" height="760" loading="lazy"></p>
<p>⚠️ You have to ensure that CPU Virtualization is enabled in your BIOS settings. Different manufacturers have different ways to access the BIOS. Usually, you can access the BIOS by pressing the Delete or F2 key during the boot process. Once in BIOS, look for settings related to "Virtualization Technology" or "Intel VT-x"/"AMD-V" and make sure it is enabled. Save the changes and exit the BIOS.</p>
<h2 id="heading-install-wsl2">Install WSL2</h2>
<p>Open the Windows Terminal or Windows PowerShell as an administrator. Run the following command to install WSL2 along with the latest Ubuntu LTS distribution:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span>
</code></pre>
<p>It will install Windows Subsystem for Linux 2 (WSL2). After the installation is complete, you will be prompted to restart your computer. Do so to finalize the installation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094306994/41db30c0-ecb9-4436-a425-8a059b199c42.png" alt="Preview of WSL installation in Windows PowerShell" class="image--center mx-auto" width="1295" height="656" loading="lazy"></p>
<p>⚠️ If you encounter any issues during installation, refer to the <a target="_blank" href="https://learn.microsoft.com/en-us/windows/wsl/troubleshooting">official Microsoft documentation</a> for troubleshooting WSL installation problems.</p>
<h2 id="heading-install-latest-lts-ubuntu-via-wsl2">Install Latest LTS Ubuntu via WSL2</h2>
<p>Open the Windows Terminal or Windows PowerShell again with the administrator privileges. If you want to check the available Linux distributions to install via WSL, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-list</span> -<span class="hljs-literal">-online</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094455888/8f1f2382-41cc-410f-a7b9-a47d3bb634b6.png" alt="Preview of available WSL distributions in Windows PowerShell" class="image--center mx-auto" width="1291" height="660" loading="lazy"></p>
<p>For installing any specific distribution, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span> &lt;DistroName&gt;
</code></pre>
<p>We are going to install the latest LTS Ubuntu distribution. As of now, the latest LTS version is Ubuntu 24.04. But I prefer to install the <code>Ubuntu</code> directly as it always points to the latest LTS version. So, run the following command:</p>
<pre><code class="lang-powershell">wsl.exe -<span class="hljs-literal">-install</span> Ubuntu
</code></pre>
<p>You need to give it a default user account name. For me, I am going with <code>fahim</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094505280/9beb24de-54da-4e0c-993d-b15f985867e3.png" alt="Preview of Ubuntu installation in Windows PowerShell" class="image--center mx-auto" width="1666" height="858" loading="lazy"></p>
<p>It also comes with a nice GUI management tool for WSL.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094530944/89073fb9-881f-48bd-b5ef-a0b08f74e4c5.png" alt="Preview of WSL GUI management tool" class="image--center mx-auto" width="1114" height="724" loading="lazy"></p>
<p>You can configure a lot of stuff in it including restricting core, RAM, disk space and a lot of specifications from the settings GUI window.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094551095/66aea1e1-e204-4115-80e0-b3dea2d7a2ac.png" alt="Preview of WSL GUI settings window (Memory &amp; Processor)" class="image--center mx-auto" width="1919" height="1024" loading="lazy"></p>
<h2 id="heading-update-amp-upgrade-ubuntu-packages">Update &amp; Upgrade Ubuntu Packages</h2>
<p>Open your Ubuntu terminal from Windows Terminal. First, we need to update and upgrade the existing packages to their latest versions.</p>
<p>To update the Ubuntu system, simply use the following command:</p>
<pre><code class="lang-bash">sudo apt update -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094594281/be41e056-7e55-4139-b84b-6b7921a2d435.png" alt="Preview of apt update command in Ubuntu terminal" class="image--center mx-auto" width="1649" height="888" loading="lazy"></p>
<p>To upgrade all the packages at once, simply use the following command:</p>
<pre><code class="lang-bash">sudo apt upgrade -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094627958/b1c17b1c-5290-470b-aafe-5b89bb03bd01.png" alt="Preview of apt upgrade command in Ubuntu terminal" class="image--center mx-auto" width="1659" height="934" loading="lazy"></p>
<p>⚠️ Make sure that you have a stable internet connection during the update and upgrade process to avoid any interruptions.</p>
<h2 id="heading-install-and-configure-miniconda">Install and Configure Miniconda</h2>
<p>In Machine Learning, we need to manage multiple environments with different package versions. Conda is a popular package and environment management system that makes it easy to create and manage isolated environments for different projects. We will install Miniconda, a minimal installer for Conda, to manage our Python environments. But if you prefer Anaconda, you can install it instead.</p>
<p>Go to the official website of Miniconda. Currently the Miniconda installer is inside Anaconda <a target="_blank" href="https://www.anaconda.com/docs/getting-started/miniconda/install">here</a>. If the official website gets updated, you can always search for "Miniconda installer" on Google to find the latest version. Also, you can create an issue in the <a target="_blank" href="https://github.com/FahimFBA/CUDA-WSL2-Ubuntu-v2/issues">official GitHub repository of this project</a> to notify me about it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094667031/7ee2c854-88b6-49ce-8c04-41bf0a052c90.png" alt="Preview of Miniconda official website" class="image--center mx-auto" width="1895" height="935" loading="lazy"></p>
<p>As we are installing it inside WSL, we have to select the macOS/Linux Installation. Then select Linux Terminal Installer and choose Linux x86 for downloading the installer.</p>
<pre><code class="lang-bash">wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
</code></pre>
<p>It will download the installer to your WSL directory. Then use the following command to install it properly:</p>
<pre><code class="lang-bash">bash ~/Miniconda3-latest-Linux-x86_64.sh
</code></pre>
<p>⚠️ Make sure that you are in the correct directory where the installer is downloaded. If you downloaded it to a different location, adjust the path accordingly. Also, replace bash with zsh or sh if you are using a different shell.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094706995/3a317eb9-0340-4a84-8826-45324c93dd2f.png" alt="Preview of Miniconda installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1794" height="922" loading="lazy"></p>
<p>Make sure to choose the initialization option properly. I prefer to keep the conda env active whenever I open a new shell. Therefore, I chose "Yes".</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094727839/f3fc8902-0c37-432c-a912-a92810e89fd1.png" alt="Preview of Miniconda initialization option during installation" class="image--center mx-auto" width="1656" height="924" loading="lazy"></p>
<p>Make sure that the installation succeeds without any errors.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094754454/53dfd998-62c9-4c2a-a71e-0d33e123e027.png" alt="Preview of successful Miniconda installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1652" height="914" loading="lazy"></p>
<p>For the changes to take effect, you can close and reopen the current shell. But you can also do that without closing and reopening the shell by applying the command below.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> ~/.bashrc
</code></pre>
<p>⚠️ If you’re using a different shell like zsh or fish, make sure to source the appropriate configuration file (e.g., ~/.zshrc for zsh).</p>
<h2 id="heading-install-jupyter-amp-ipykernel">Install Jupyter &amp; Ipykernel</h2>
<p>I prefer to use Jupyter Notebook for running my machine learning experiments. It provides an interactive environment for coding and data analysis. We’ll install Jupyter Notebook and Ipykernel to run Jupyter notebooks in our conda environment. We will do that in all conda environments starting with the <strong>base</strong> environment. It also helps us to keep the conda environment kernel inside Jupyter Notebook.</p>
<p>First, make sure that you are in the base conda environment. You will see (base) on the left side of the terminal.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094812122/66ad5de8-7553-42da-b920-78d20c3bdc9a.png" alt="Preview of conda base environment in WSL Ubuntu terminal" class="image--center mx-auto" width="1917" height="1027" loading="lazy"></p>
<p>Now install Jupyter and Ipykernel both by applying the following command:</p>
<pre><code class="lang-bash">conda install jupyter ipykernel -y
</code></pre>
<p>Make sure that you accept the terms of service of Conda.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094839808/90fe3dcf-053d-4bc7-a031-22f81eb706ca.png" alt="Preview of Jupyter and Ipykernel installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1659" height="927" loading="lazy"></p>
<p>Now, I will create a separate conda environment for both TensorFlow and the PyTorch GPU. You can directly install them in the base environment or in any other environment as per your preference. I am not specifying any specific Python version while creating the environment. It will automatically install the latest stable version of Python.</p>
<pre><code class="lang-bash">conda create -name ml -y
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094865498/ac9ef1f1-4494-4221-8376-5e257c4f9243.png" alt="Preview of creating a new conda environment named 'ml' in WSL Ubuntu terminal" class="image--center mx-auto" width="1659" height="925" loading="lazy"></p>
<p>To activate any specific conda environment, you have to use the following command:</p>
<pre><code class="lang-bash">conda activate &lt;conda-env-name&gt;
</code></pre>
<p>For example, if I want to activate my newly created <strong>ml</strong> environment, I will use this command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>If you’re not sure which conda environments are installed in your system, you can check all available and installed conda environments in your system by running the following command:</p>
<pre><code class="lang-bash">conda env list
</code></pre>
<h2 id="heading-nvidia-driver">Nvidia Driver</h2>
<p>Ensure that you have the latest Nvidia drivers installed on Windows. WSL2 uses the Windows driver, so no separate driver installation is needed in Ubuntu. You can download the latest drivers from the <a target="_blank" href="https://www.nvidia.com/Download/index.aspx">official Nvidia website</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094915617/cd9b0bfc-77a1-45f1-9dab-4349c8f489ef.png" alt="Preview of Nvidia driver download page" class="image--center mx-auto" width="1750" height="916" loading="lazy"></p>
<p>If you are just installing the latest GPU driver, then after installing the drivers, restart your computer to ensure the changes take effect. You can either use the GeForce Game Ready Driver or the NVIDIA Studio Driver. But I recommend using the Studio Driver for better stability with creative and ML applications.</p>
<h2 id="heading-install-cuda-dependencies">Install CUDA Dependencies</h2>
<p>You might face some issues if you do not have the CUDA dependencies installed properly. I recommend that you install the required dependencies before proceeding further:</p>
<pre><code class="lang-bash">sudo apt install gcc g++ build-essential
</code></pre>
<p>After installing the dependencies, you can then verify the CUDA installation if you had any issues earlier.</p>
<h2 id="heading-cuda-toolkit">CUDA Toolkit</h2>
<p>TensorFlow GPU is very picky about the CUDA version. So we need to install a specific version of CUDA Toolkit that is compatible with the TensorFlow version we are going to install.</p>
<p>To understand exactly which CUDA version is compatible with which TensorFlow version, you can check the official TensorFlow GPU support matrix <a target="_blank" href="https://www.tensorflow.org/install/pip">here</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095089103/87a44961-9426-4d20-95ac-cde06961b41a.png" alt="Preview of TensorFlow GPU support in official docs" class="image--center mx-auto" width="1879" height="931" loading="lazy"></p>
<p>At the time I’m writing this article, the TensorFlow GPU documentation says that we should have CUDA Toolkit 12.3. So I will ensure that I install exactly that version. You can simply click on that version link in the official docs and it will redirect you to the official Nvidia CUDA Toolkit download page. But if the link gets updated in the future, you can always search for "Nvidia CUDA Toolkit" on Google to find the latest version.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095106589/19689d63-5ebd-4783-8da4-e3dedd277efb.png" alt="Preview of Nvidia CUDA Toolkit official website" class="image--center mx-auto" width="1620" height="925" loading="lazy"></p>
<p>As TensorFlow GPU is asking for exact Version 12.3, I will select version 12.3.0 exactly.</p>
<p>In the CUDA Toolkit download page, make sure to choose the operating system as Linux, Architecture as x86_64, Distribution as WSL-Ubuntu, Version as 2.0 and the Installer type as runfile(local).</p>
<p>⚠️ As we are using Ubuntu in our WSL2, you can also choose Ubuntu as your operating system. But I prefer to choose WSL-Ubuntu for better compatibility.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095151533/b6996611-d4ce-4e07-9c73-30bdc93dbf19.png" alt="Preview of CUDA Toolkit 12.3 download page for WSL-Ubuntu" class="image--center mx-auto" width="1311" height="898" loading="lazy"></p>
<p>After selecting those, it will give you the download commands. You have to apply them sequentially. Make sure that you <strong>don't keep the checkmark in "Kernel Objects" during installing CUDA</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095169368/c2f81594-536f-4788-b765-1aab3b040fa7.png" alt="Preview of CUDA Toolkit 12.3 download commands for WSL-Ubuntu" class="image--center mx-auto" width="1895" height="1001" loading="lazy"></p>
<p>⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the CUDA Toolkit properly. If you face any issues related to CUDA dependency, then quickly go through the <a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a> section, where I have explained how to install the CUDA dependencies properly.</p>
<h2 id="heading-add-path-to-shell-profile-for-cuda">Add Path to Shell Profile for CUDA</h2>
<p>After installing CUDA Toolkit, we need to add the CUDA binaries to our shell profile for easy access. This will allow us to run CUDA commands from any directory in the terminal.</p>
<p>Note that, depending on the shell you are using (bash, zsh, and so on), you need to add the CUDA path to the appropriate configuration file. Make sure to replace <strong>.bashrc</strong> with <strong>.zshrc</strong> or other configuration files if you are using a different shell.</p>
<p>To add the CUDA binary path, follow the command below:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">'export PATH=/usr/local/cuda-12.3/bin:$PATH'</span> &gt;&gt; ~/.bashrc
</code></pre>
<p>You have to use the updated path where you installed it. Your terminal will show it after installing the CUDA:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095215437/15768563-c956-472e-9633-95b3dd1cb7a3.png" alt="Preview of CUDA installation path in WSL Ubuntu terminal" class="image--center mx-auto" width="1912" height="1011" loading="lazy"></p>
<p>Now, you need to add the path inside the Library path. Just use the exact path where you installed CUDA. Your terminal will list the path properly.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">echo</span> <span class="hljs-string">'export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH'</span> &gt;&gt; ~/.bashrc
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095242744/3c708db4-d267-4043-aa11-d04d890904f9.png" alt="Preview of CUDA library path in WSL Ubuntu terminal" class="image--center mx-auto" width="1284" height="693" loading="lazy"></p>
<p>After adding those paths, you need to source the shell profile for the changes to take effect. You can do that by running the following command:</p>
<pre><code class="lang-bash"><span class="hljs-built_in">source</span> ~/.bashrc
</code></pre>
<h2 id="heading-nvcc-version">nvcc Version</h2>
<p>NVCC stands for Nvidia CUDA Compiler. It is basically a compiler driver for the CUDA platform that allows developers to write parallel programs to run on Nvidia GPUs. As we have already installed the CUDA toolkit, we need to see whether the compiler is also properly activated. To check that, we need to verify the version.</p>
<p>Verify that CUDA is properly installed by checking the version:</p>
<pre><code class="lang-bash">nvcc --version
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095277858/2d1ded0a-01ac-4f78-9f6c-ac499d623207.png" alt="Preview of nvcc version check in WSL Ubuntu terminal" class="image--center mx-auto" width="1839" height="946" loading="lazy"></p>
<p>If the output shows the correct CUDA version, then you have successfully installed CUDA Toolkit in your WSL2 Ubuntu environment.</p>
<h2 id="heading-cudnn-sdk">cuDNN SDK</h2>
<p>The cuDNN (CUDA Deep Neural Network) SDK is a <a target="_blank" href="https://developer.nvidia.com/cudnn">GPU accelerated library of primitives for deep neural networks</a>, developed by Nvidia. It provides highly optimized building blocks for common deep learning operations, significantly speeding up the training and inference processes of AI models on Nvidia GPUs.</p>
<p>Note: Even though TensorFlow GPU suggests a specific cuDNN version, it’s often compatible with multiple versions. Because of this, I recommend downloading the latest cuDNN version that is compatible with your installed CUDA version. You can find the cuDNN download page <a target="_blank" href="https://developer.nvidia.com/cudnn-downloads">here</a>.</p>
<p>Select the Operating System as Linux, Architecture as x86_64, Distribution as Ubuntu, Version as 24.04, Installer Type as deb (local), Configuration as FULL. After selecting those, it will give you the download commands. You have to apply them sequentially.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095312370/1fca5959-f492-4160-8027-deec0674863b.png" alt="Preview of cuDNN download commands for Ubuntu 24.04" class="image--center mx-auto" width="1543" height="938" loading="lazy"></p>
<p>⚠️ Make sure to copy and paste the commands one by one in your WSL Ubuntu terminal to download and install the cuDNN SDK properly. If you face any issues related to CUDA dependency, then quickly go through the <a class="post-section-overview" href="#heading-install-cuda-dependencies">Install CUDA dependencies</a> section, where I have explained how to install the CUDA dependencies properly.</p>
<h2 id="heading-tensorflow-gpu">TensorFlow GPU</h2>
<p>Now, we are going to install TensorFlow GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created <strong>ml</strong> environment. To activate it, I’ll use the following command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>⚠️ Make sure that you have activated the correct conda environment before installing TensorFlow GPU. You will see the environment name in the terminal prompt.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095398777/0c7d8813-eb6c-4e2e-bad9-1fc7d344d7a2.png" alt="Preview of activating 'ml' conda environment in WSL Ubuntu terminal" class="image--center mx-auto" width="1227" height="692" loading="lazy"></p>
<p>I will install ipykernel and jupyter in this new environment.</p>
<pre><code class="lang-bash">conda install jupyter ipykernel -y
</code></pre>
<p>Now, to install TensorFlow GPU, I will simply use the following command:</p>
<pre><code class="lang-bash">pip install tensorflow[and-cuda]
</code></pre>
<p>It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.</p>
<h3 id="heading-check-tensorflow-gpu">Check TensorFlow GPU</h3>
<p>After installing TensorFlow GPU, we need to verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:</p>
<pre><code class="lang-bash">python3 -c <span class="hljs-string">"import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"</span>
</code></pre>
<p>If the output shows a list of available GPU devices, then TensorFlow GPU is successfully installed and working properly.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095453933/ccda58fc-9ae9-4185-9c78-6196c98d8b7c.png" alt="Preview of TensorFlow GPU check in WSL Ubuntu terminal" width="1903" height="1029" loading="lazy"></p>
<h2 id="heading-pytorch-gpu">PyTorch GPU</h2>
<p>Now, we’re going to install PyTorch GPU in our conda environment. Make sure that you have activated the conda environment where you want to install it. I’m going to install it in my previously created ml environment. To activate it, I will use the following command:</p>
<pre><code class="lang-bash">conda activate ml
</code></pre>
<p>Installing PyTorch GPU is very straightforward. You can use the official PyTorch installation command generator <a target="_blank" href="https://pytorch.org/get-started/locally/">here</a>.</p>
<p>Make sure to select PyTorch Build as the latest Stable one, Your OS as Linux, Package as Pip, Language as Python. For the Compute Platform, select the CUDA version that matches your installed CUDA Toolkit. For me, it is CUDA 12.3. But, if you can not find the exact one then choose the closest. As CUDA 12.3 is not available for me now, I am choosing CUDA 12.6.</p>
<p>After selecting those, it will give you the installation command. You have to apply it in your WSL Ubuntu terminal.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095511862/6f631369-c8db-4681-9d1c-669ad88df69d.png" alt="Preview of PyTorch installation command generator" class="image--center mx-auto" width="1618" height="911" loading="lazy"></p>
<p>It might take a couple of minutes depending on the internet speed you have. Just have patience and wait for it to finish the installation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095532246/56232263-36ea-4043-9881-df162965c514.png" alt="Preview of PyTorch GPU installation in WSL Ubuntu terminal" class="image--center mx-auto" width="1280" height="689" loading="lazy"></p>
<h3 id="heading-check-pytorch-gpu">Check PyTorch GPU</h3>
<p>After installing PyTorch GPU, verify that it is working properly with GPU support. Open a Python shell in your Ubuntu terminal and run the following commands:</p>
<pre><code class="lang-bash">python3 - &lt;&lt; <span class="hljs-string">'EOF'</span>
import torch
<span class="hljs-built_in">print</span>(torch.cuda.is_available())
<span class="hljs-built_in">print</span>(torch.cuda.device_count())
<span class="hljs-built_in">print</span>(torch.cuda.current_device())
<span class="hljs-built_in">print</span>(torch.cuda.device(0))
<span class="hljs-built_in">print</span>(torch.cuda.get_device_name(0))
EOF
</code></pre>
<p>The output should look similar to the screenshot, showing:</p>
<ul>
<li><p><strong>True</strong>: GPU is available for PyTorch</p>
</li>
<li><p><strong>1</strong>: Number of detected CUDA devices</p>
</li>
<li><p><strong>0</strong>: Index of the current active CUDA device</p>
</li>
<li><p>A device object representation</p>
</li>
<li><p><strong>NVIDIA GeForce RTX 3060</strong> (or your GPU name)</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095584921/69269152-7ea6-404b-b1ca-8534b51f2491.png" alt="Preview of PyTorch GPU check in WSL Ubuntu terminal" class="image--center mx-auto" width="1917" height="937" loading="lazy"></p>
<h3 id="heading-check-pytorch-amp-tensorflow-gpu-inside-jupyter-notebook">Check PyTorch &amp; TensorFlow GPU inside Jupyter Notebook</h3>
<p>Now that the environment is fully configured, we will verify GPU support directly inside Jupyter Notebook. This ensures both PyTorch and TensorFlow can successfully detect and use your GPU.</p>
<h4 id="heading-1-test-pytorch-gpu">1. Test PyTorch GPU</h4>
<p>Create a new Jupyter Notebook and run the following commands one by one:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.device(<span class="hljs-number">0</span>))
print(torch.cuda.get_device_name(<span class="hljs-number">0</span>))
</code></pre>
<p>If everything is configured correctly, you will see your GPU (for example <strong>NVIDIA GeForce RTX 3060</strong>) detected properly:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095624229/f94c97a0-2e44-45ad-a2a8-52f40c922482.png" alt="Preview of PyTorch GPU check inside Jupyter Notebook" class="image--center mx-auto" width="1861" height="743" loading="lazy"></p>
<h4 id="heading-2-test-tensorflow-gpu">2. Test TensorFlow GPU</h4>
<p>Next, run the following code to check whether TensorFlow detects your GPU:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

print(tf.config.list_physical_devices(<span class="hljs-string">'GPU'</span>))
</code></pre>
<p>You can also check the number of GPUs detected:</p>
<pre><code class="lang-python">print(<span class="hljs-string">"Num GPUs Available:"</span>, len(tf.config.list_physical_devices(<span class="hljs-string">'GPU'</span>)))
</code></pre>
<p>Finally, run TensorFlow’s built-in GPU validation (warnings are normal):</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf

<span class="hljs-keyword">assert</span> tf.test.is_gpu_available()
<span class="hljs-keyword">assert</span> tf.test.is_built_with_cuda()
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095666216/f9017979-b5c9-4b86-9f60-d9aaa2fe8ac1.png" alt="TensorFlow GPU initialization and CUDA validation output" class="image--center mx-auto" width="1638" height="935" loading="lazy"></p>
<p>If TensorFlow logs show your GPU model (such as <strong>RTX 3060</strong>), then TensorFlow GPU is successfully installed and fully working inside Jupyter Notebook.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Thank you so much for reading all the way through. I hope you have been able to configure your Windows 11 computer properly for running almost any kind of Machine Learning-based experiments.</p>
<p>To get more content like this, you can follow me on <a target="_blank" href="https://www.linkedin.com/in/fahimfba/">LinkedIn</a> and <a target="_blank" href="https://x.com/Fahim_FBA">X</a>. You can also check <a target="_blank" href="https://www.fahimbinamin.com/">my website</a> and follow me on <a target="_blank" href="https://github.com/FahimFBA">GitHub</a> if you are into open source and development.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build End-to-End Machine Learning Lineage ]]>
                </title>
                <description>
                    <![CDATA[ Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance. While many services for tracking ML lineage exist, creating a comprehensive and manageabl... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-end-to-end-machine-learning-lineage/</link>
                <guid isPermaLink="false">68f0f6719ac2ae80d4c5be03</guid>
                
                    <category>
                        <![CDATA[ mlops ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Thu, 16 Oct 2025 13:43:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760622158648/b990ff01-06f0-495d-8554-f832813609ab.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Machine learning lineage is critical in any robust ML system. It lets you track data and model versions, ensuring reproducibility, auditability, and compliance.</p>
<p>While many services for tracking ML lineage exist, creating a comprehensive and manageable lineage often proves complicated.</p>
<p>In this article, I’ll walk you through integrating a comprehensive ML lineage solution for an ML application deployed on serverless AWS Lambda, covering the end-to-end pipeline stages:</p>
<ul>
<li><p>ETL pipeline</p>
</li>
<li><p>Data drift detection</p>
</li>
<li><p>Preprocessing</p>
</li>
<li><p>Model tuning</p>
</li>
<li><p>Risk and fairness evaluation.</p>
</li>
</ul>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ol>
<li><p><a class="post-section-overview" href="#heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-well-build">What We’ll Build</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture - AI Pricing for Retailers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-ml-lineage">The ML Lineage</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-workflow-in-action">Workflow in Action</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-the-ml-lineage">Step 2: The ML Lineage</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-3-preprocessing">Stage 3: Preprocessing</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-5-performing-inference">Stage 5: Performing Inference</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-configuring-scheduled-run-with-prefect">Step 4: Configuring Scheduled Run with Prefect</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-test-in-local-1">Test in Local</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-deploying-the-application">Step 5: Deploying the Application</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-test-in-local-2">Test in Local</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ol>
<h3 id="heading-prerequisites">Prerequisites:</h3>
<ul>
<li><p>Knowledge of key Machine Learning / Deep Learning concepts including the full lifecycle: data handling, model training, tuning, and validation.</p>
</li>
<li><p>Proficiency in Python, with experience using major ML libraries.</p>
</li>
<li><p>Basic understanding of DevOps principles.</p>
</li>
</ul>
<h3 id="heading-tools-well-use">Tools we’ll use:</h3>
<p>Here is a summary of the tools we’re going to use to track the ML lineage:</p>
<ul>
<li><p><strong>DVC</strong>: An open-source version system for data. Used to track the ML lineage.</p>
</li>
<li><p><strong>AWS S3</strong>: A secure object storage service from AWS. Used as a remote storage.</p>
</li>
<li><p><strong>Evently AI</strong>: An open-source ML and LLM observability framework. Used to detect data drift.</p>
</li>
<li><p><strong>Prefect</strong>: A workflow orchestration engine. Used to manage the schedule run of the lineage.</p>
</li>
</ul>
<h2 id="heading-what-is-machine-learning-lineage">What is Machine Learning Lineage?</h2>
<p><strong>Machine learning (ML) lineage</strong> is a framework for tracking and understanding the complete lifecycle of a machine learning model.</p>
<p>It contains information at different levels such as:</p>
<ul>
<li><p><strong>Code:</strong> The scripts, libraries, and configurations for model training.</p>
</li>
<li><p><strong>Data:</strong> The original data, transformations, and features.</p>
</li>
<li><p><strong>Experiments:</strong> Training runs, hyperparameter tuning results.</p>
</li>
<li><p><strong>Models:</strong> The trained models and their versions.</p>
</li>
<li><p><strong>Predictions:</strong> The outputs of deployed models.</p>
</li>
</ul>
<p>ML lineage is essential for multiple reasons:</p>
<ul>
<li><p><strong>Reproducibility:</strong> Recreate the same model and prediction for validation.</p>
</li>
<li><p><strong>Root cause analysis:</strong> Trace back to the data, code, or configuration change when a model fails in production.</p>
</li>
<li><p><strong>Compliance:</strong> Some regulated industries require proof of model training to ensure fairness, transparency, and adherence to laws like GDPR and the EU AI Act.</p>
</li>
</ul>
<h2 id="heading-what-well-build">What We’ll Build</h2>
<p>In this project, I’ll integrate an ML lineage into <a target="_blank" href="https://levelup.gitconnected.com/building-a-dynamic-pricing-system-with-a-multi-layered-neural-network-c2a4c70bfcec">this price prediction system built on AWS Lambda architecture</a> using DVC, an open-source version control system for ML applications.</p>
<p>The below diagram illustrates the system architecture and the ML lineage we’ll integrate:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759825040233/5027e5dd-a2fc-4d35-b7a3-4d9184f5f179.png" alt="Figure A. A comprehensive ML lineage for an ML application on serverless Lambda (Created by Kuriko IWAI)" class="image--center mx-auto" width="25020" height="7926" loading="lazy"></p>
<p><strong>Figure A:</strong> A comprehensive ML lineage for an ML application on serverless Lambda (Created by <a target="_blank" href="https://kuriko-iwai.vercel.app/">Kuriko IWAI</a>)</p>
<h3 id="heading-the-system-architecture-ai-pricing-for-retailers">The System Architecture: AI Pricing for Retailers</h3>
<p>The system operates as a containerized, serverless microservice designed to provide optimal price recommendations to maximize retailer sales.</p>
<p>Its core intelligence comes from AI models trained on historical purchase data to predict the quantity of the product sold at various prices, allowing sellers to determine the best price.</p>
<p>For consistent deployment, the prediction logic and its dependencies are packaged into a Docker container image and stored in AWS ECR (Elastic Container Registry).</p>
<p>The prediction is then served by an AWS Lambda function, which retrieves and runs the container from ECR and exposes the result via AWS API Gateway for the Flask application to consume.</p>
<p>If you want to see how to build this from the ground up, you can follow along with my tutorial <a target="_blank" href="https://www.freecodecamp.org/news/how-to-build-a-machine-learning-system-on-serverless-architecture/">How to Build a Machine Learning System on Serverless Architecture</a>.</p>
<h3 id="heading-the-ml-lineage">The ML Lineage</h3>
<p>In the system, GitHub handles the code lineage, while DVC captures the lineage of:</p>
<ul>
<li><p><strong>Data</strong> (blue boxes): ETL and preprocessing.</p>
</li>
<li><p><strong>Experiments</strong> (light orange): Hyperparamters tuning and validation.</p>
</li>
<li><p><strong>Models</strong> and <strong>Prediction</strong> (dark orange): Final model artifacts and prediction results.</p>
</li>
</ul>
<p><strong>DVC</strong> tracks the lineage through separate stages, from data extraction to fairness testing (yellow rows in Figure A).</p>
<p>For each stage, DVC uses an <strong>MD5</strong> or <strong>SHA256 hash</strong> to track and push metadata like artifacts, metrics, and reports to its remote on <strong>AWS S3</strong>.</p>
<p>The pipeline incorporates <strong>Evently AI</strong> to handle data drift tests, which are essential for identifying shifts in data distributions that could compromise the model's generalization capabilities in production.</p>
<p>Only models that successfully pass both the data drift and fairness tests can serve predictions via the AWS API gateway (red box in Figure A).</p>
<p>Lastly, this entire lineage process is triggered weekly by the open-source workflow scheduler, <strong>Prefect</strong>.</p>
<p>Prefect prompts DVC to check for updates in data and scripts, and executes the full lineage process if changes are detected.</p>
<h2 id="heading-workflow-in-action">Workflow in Action</h2>
<p>The building process involves five main steps:</p>
<ol>
<li><p>Initiate a DVC project</p>
</li>
<li><p>Define the lineage stages with the DVC script <code>dvc.yaml</code> and corresponding Python script</p>
</li>
<li><p>Deploy the DVC project</p>
</li>
<li><p>Configure scheduled run with Prefect</p>
</li>
<li><p>Deploy the application</p>
</li>
</ol>
<p>Let’s walk through each step together.</p>
<h2 id="heading-step-1-initiating-a-dvc-project">Step 1: Initiating a DVC Project</h2>
<p>The first step is to initiate a DVC project:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> init
</code></pre>
<p>This command automatically creates a <code>.dvc</code> directory at the root of the project folder:</p>
<pre><code class="lang-bash">.
.dvc/
│
└── cache/         <span class="hljs-comment"># [.gitignore] store dvc caches (cached actual data files)</span>
└── tmp/           <span class="hljs-comment"># [.gitignore]</span>
└── .gitignore     <span class="hljs-comment"># gitignore cache, tmp, and config.local</span>
└── config         <span class="hljs-comment"># dvc config for production</span>
└── config.local   <span class="hljs-comment"># [.gitignore] dvc config for local</span>
</code></pre>
<p>DVC maintains a fast, lightweight Git repository by separating the original data in large files from the repository.</p>
<p>The process involves caching the original data in the local <code>.dvc/cache</code> directory, creating a small <code>.dvc</code> metadata file which contains an MD5 hash and a link to the original data file path, pushing <em>only</em> the small metadata files to Git, and pushing the original data to the DVC remote.</p>
<h2 id="heading-step-2-the-ml-lineage">Step 2: The ML Lineage</h2>
<p>Next, we’ll configure the ML lineage with the following stages:</p>
<ol>
<li><p><code>etl_pipeline</code>: Extract, clean, impute the original data and perform feature engineering.</p>
</li>
<li><p><code>data_drift_check</code>: Run data drift tests. If they fail, the system exits.</p>
</li>
<li><p><code>preprocess</code>: Create training, validation, and test datasets.</p>
</li>
<li><p><code>tune_primary_model</code>: Tune hyperparameters and train the model.</p>
</li>
<li><p><code>inference_primary_model</code>: Perform inference on the test dataset.</p>
</li>
<li><p><code>assess_model_risk</code>: Runs risk and fairness tests.</p>
</li>
</ol>
<p>Each stage requires defining the DVC command and its corresponding Python script.</p>
<p>Let’s get started.</p>
<h3 id="heading-stage-1-the-etl-pipeline">Stage 1: The ETL Pipeline</h3>
<p>The first stage is to extract, clean, impute the original data, and perform feature engineering.</p>
<h4 id="heading-dvc-configuration"><strong>DVC Configuration</strong></h4>
<p>We’ll create the <code>dvc.yaml</code> file at the root of the project directory and add the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code></p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-comment"># output paths for dvc to track</span>
    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/original_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/processed_df.parquet</span>
</code></pre>
<p>The <code>dvc.yaml</code> file defines a sequence of steps (stages) using sections like:</p>
<ul>
<li><p><code>cmd</code>: The shell command to be executed for that stage</p>
</li>
<li><p><code>deps</code>: Dependencies that need to run the <code>cmd</code></p>
</li>
<li><p><code>prams</code>: Default parameters for the <code>cmd</code> defined in the <code>params.yaml</code> file</p>
</li>
<li><p><code>metrics</code>: The metrics files to track</p>
</li>
<li><p><code>reports</code>: The report files to track</p>
</li>
<li><p><code>plots</code>: The DVC plot files for visualization</p>
</li>
<li><p><code>outs</code>: The output files produced by the <code>cmd</code>, which DVC will track</p>
</li>
</ul>
<p>The configuration helps DVC ensure reproducibility by explicitly listing dependencies, outputs, and the commands of each stage. It also helps it manage the lineage by establishing a <strong>Directed Acyclic Graph (DAG)</strong> of the workflow, linking each stage to the next.</p>
<h4 id="heading-python-scripts"><strong>Python Scripts</strong></h4>
<p>Next, let’s add Python scripts, ensuring the data is stored using the file paths specified in the <code>outs</code> section of the <code>dvc.yaml</code> file:</p>
<p><code>src/data_handling/etl_pipeline.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">etl_pipeline</span>():</span>
    <span class="hljs-comment"># extract the entire data</span>
    df = scripts.extract_original_dataframe()

    <span class="hljs-comment"># load perquet file</span>
    ORIGINAL_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'original_df.parquet'</span>)
    df.to_parquet(ORIGINAL_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>

    <span class="hljs-comment"># transform</span>
    df = scripts.structure_missing_values(df=df)
    df = scripts.handle_feature_engineering(df=df)

    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df.to_parquet(PROCESSED_DF_PATH, index=<span class="hljs-literal">False</span>) <span class="hljs-comment"># dvc tracked</span>
    <span class="hljs-keyword">return</span> df

<span class="hljs-comment"># for dvc execution</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:  
    parser = argparse.ArgumentParser(description=<span class="hljs-string">"run etl pipeline"</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">"specific stockcode to process. empty runs full pipeline."</span>)
    parser.add_argument(<span class="hljs-string">'--impute'</span>, action=<span class="hljs-string">'store_true'</span>, help=<span class="hljs-string">"flag to create imputation values"</span>)
    args = parser.parse_args()

    etl_pipeline(stockcode=args.stockcode, impute_stockcode=args.impute)
</code></pre>
<h4 id="heading-outputs"><strong>Outputs</strong></h4>
<p>The original and structured data in Pandas’ DataFrames are stored in the DVC cache:</p>
<ul>
<li><p><code>data/original_df.parquet</code></p>
</li>
<li><p><code>data/processed_df.parquet</code></p>
</li>
</ul>
<h3 id="heading-stage-2-the-data-drift-check">Stage 2: The Data Drift Check</h3>
<p>Before jumping into preprocessing, we’ll run data drift tests to ensure any notable drift is in the data. To do this, we’ll use <strong>EventlyAI</strong>, an open-source ML and LLM observability framework.</p>
<h4 id="heading-what-is-data-drift">What is Data Drift?</h4>
<p>Data drift refers to any changes in the statistical properties like the mean, variance, or distribution of the data that the model is trained on.</p>
<p>There are three main types of data drift:</p>
<ul>
<li><p><strong>Covariate Drift</strong> (Feature Drift): A change in the input feature distribution.</p>
</li>
<li><p><strong>Prior Probability Drift</strong> (Label Drift): A change in the target variable distribution.</p>
</li>
<li><p><strong>Concept Drift</strong>: A change in the relationship between the input data and the target variable.</p>
</li>
</ul>
<p>Data drift compromises the model's generalization capabilities over time, making its detection after deployment crucial.</p>
<h4 id="heading-dvc-configuration-1">DVC Configuration</h4>
<p>We’ll add the <code>data_drift_check</code> stage right after the <code>etl_pipeline</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
     <span class="hljs-comment"># the main command dvc will run in this stage</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/report_data_drift.py
      data/processed/processed_df.csv 
      data/processed_df_${params.stockcode}.parquet
      reports/data_drift_report_${params.stockcode}.html
      metrics/data_drift_${params.stockcode}.json
      ${params.stockcode}
</span>
    <span class="hljs-comment"># default values to the parameters (defined in the param.yaml file)</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>

    <span class="hljs-comment"># dependencies necessary to run the main command</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/report_data_drift.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-comment"># output file pathes for dvc to track</span>
    <span class="hljs-attr">plots:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/data_drift_report_${params.stockcode}.html:</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/data_drift_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then, add default values to the parameters passed to the DVC command:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">stockcode:</span> <span class="hljs-string">&lt;STOCKCODE</span> <span class="hljs-string">OF</span> <span class="hljs-string">CHOICE&gt;</span>
</code></pre>
<h4 id="heading-python-scripts-1">Python Scripts</h4>
<p>After <a target="_blank" href="https://docs.evidentlyai.com/quickstart_ml#1-1-set-up-evidently-cloud">generating an API token from the EventlyAI workplace,</a> we’ll add a Python script to detect data drift and store the results in the <code>metrics</code> variable:</p>
<p><code>src/data_handling/report_data_drift.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> evidently <span class="hljs-keyword">import</span> Dataset, DataDefinition, Report
<span class="hljs-keyword">from</span> evidently.presets <span class="hljs-keyword">import</span> DataDriftPreset
<span class="hljs-keyword">from</span> evidently.ui.workspace <span class="hljs-keyword">import</span> CloudWorkspace

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># initiate evently cloud workspace</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ws = CloudWorkspace(token=os.getenv(<span class="hljs-string">'EVENTLY_API_TOKEN'</span>), url=<span class="hljs-string">'https://app.evidently.cloud'</span>)

    <span class="hljs-comment"># retrieve evently project</span>
    project = ws.get_project(<span class="hljs-string">'EVENTLY AI PROJECT ID'</span>)

    <span class="hljs-comment"># retrieve paths from the command line args</span>
    REFERENCE_DATA_PATH = sys.argv[<span class="hljs-number">1</span>]
    CURRENT_DATA_PATH = sys.argv[<span class="hljs-number">2</span>]
    REPORT_OUTPUT_PATH = sys.argv[<span class="hljs-number">3</span>]
    METRICS_OUTPUT_PATH = sys.argv[<span class="hljs-number">4</span>]
    STOCKCODE = sys.argv[<span class="hljs-number">5</span>]

    <span class="hljs-comment"># create folders if not exist</span>
    os.makedirs(os.path.dirname(REPORT_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    os.makedirs(os.path.dirname(METRICS_OUTPUT_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># extract datasets</span>
    reference_data_full = pd.read_csv(REFERENCE_DATA_PATH)
    reference_data_stockcode = reference_data_full[reference_data_full[<span class="hljs-string">'stockcode'</span>] == STOCKCODE]
    current_data_stockcode = pd.read_parquet(CURRENT_DATA_PATH)

    <span class="hljs-comment"># define data schema</span>
    nums, cats = scripts.categorize_num_cat_cols(df=reference_data_stockcode)
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> nums: current_data_stockcode[col] = pd.to_numeric(current_data_stockcode[col], errors=<span class="hljs-string">'coerce'</span>)

    schema = DataDefinition(numerical_columns=nums, categorical_columns=cats)

    <span class="hljs-comment"># define evently dataset w/ the data schema</span>
    eval_data_1 = Dataset.from_pandas(reference_data_stockcode, data_definition=schema)
    eval_data_2 = Dataset.from_pandas(current_data_stockcode, data_definition=schema)

    <span class="hljs-comment"># execute drift detection</span>
    report = Report(metrics=[DataDriftPreset()])
    data_eval = report.run(reference_data=eval_data_1, current_data=eval_data_2)
    data_eval.save_html(REPORT_OUTPUT_PATH)

    <span class="hljs-comment"># create metrics for dvc tracking</span>
    report_dict = json.loads(data_eval.json())
    num_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'count'</span>]
    shared_drifts = report_dict[<span class="hljs-string">'metrics'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'value'</span>][<span class="hljs-string">'share'</span>]
    metrics = dict(
        drift_detected=bool(num_drifts &gt; <span class="hljs-number">0.0</span>), num_drifts=num_drifts, shared_drifts=shared_drifts,
        num_cols=nums,
        cat_cols=cats,
        stockcode=STOCKCODE,
        timestamp=datetime.datetime.now().isoformat(),
    )

    <span class="hljs-comment"># load metrics file</span>
    <span class="hljs-keyword">with</span> open(METRICS_OUTPUT_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... drift metrics saved to <span class="hljs-subst">{METRICS_OUTPUT_PATH}</span>... '</span>)

    <span class="hljs-comment"># stop the system if data drift is found</span>
    <span class="hljs-keyword">if</span> num_drifts &gt; <span class="hljs-number">0.0</span>: sys.exit(<span class="hljs-string">'❌ FATAL: data drift detected. stopping pipeline'</span>)
</code></pre>
<p>If data drift is found, the script immediately exits using the final <code>sys.exit</code> command.</p>
<h4 id="heading-outputs-1">Outputs</h4>
<p>The script generates two files that DVC will track:</p>
<ul>
<li><p><code>reports/data_drift_report.html</code>: The data drift report in a HTML file.</p>
</li>
<li><p><code>metrics/data_drift.json</code>: The data drift metics in a JSON file including drift results along with feature columns and a timestamp:</p>
</li>
</ul>
<p><code>metrics/data_drift.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"drift_detected"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-attr">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-attr">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>
}
</code></pre>
<p>The drift test results are also available on the Evently workplace dashboard for further analysis:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*2C1ICzvVazAUH7fk.png" alt="Figure B. Screenshot of the Evently workspace dashboard" width="600" height="400" loading="lazy"></p>
<p><strong>Figure B.</strong> Screenshot of the Evently workspace dashboard</p>
<h3 id="heading-stage-3-preprocessing">Stage 3: Preprocessing</h3>
<p>If no data drift is detected, the linage moves onto the preprocessing stage.</p>
<h4 id="heading-dvc-configuration-2">DVC Configuration</h4>
<p>We’ll add the <code>preprocess</code> stage right after the <code>data_drift_check</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/data_handling/preprocess.py --target_col ${params.target_col} --should_scale ${params.should_scale} --verbose ${params.verbose}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/preprocess.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils</span>

    <span class="hljs-comment"># params from params.yaml</span>
    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.target_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.should_scale</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.verbose</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-comment"># train, val, test datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_train_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_val_df.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/y_test_df.parquet</span>

      <span class="hljs-comment"># preprocessed input datasets</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_train_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_val_processed.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/x_test_processed.parquet</span>

      <span class="hljs-comment"># trained preprocessor and human readable feature names for shap analysis</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/column_transformer.pkl</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">preprocessors/feature_names.json</span>
</code></pre>
<p>And then add default values of the parameters used in the <code>cmd</code>:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-2">Python Scripts</h4>
<p>Next, we’ll add a Python script to create training, validation, and test datasets and preprocess input data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess</span>(<span class="hljs-params">stockcode: str = <span class="hljs-string">''</span>, target_col: str = <span class="hljs-string">'quantity'</span>, should_scale: bool = True, verbose: bool = False</span>):</span>
    <span class="hljs-comment"># initiate metrics to track (dvc)</span>
    DATA_DRIFT_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'data_drift_<span class="hljs-subst">{args.stockcode}</span>.json'</span>)

    <span class="hljs-keyword">if</span> os.path.exists(DATA_DRIFT_METRICS_PATH):
        <span class="hljs-keyword">with</span> open(DATA_DRIFT_METRICS_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
            metrics = json.load(f)
    <span class="hljs-keyword">else</span>: metrics = dict()

    <span class="hljs-comment"># load processed df from dvc cache</span>
    PROCESSED_DF_PATH = os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">'processed_df.parquet'</span>)
    df = pd.read_parquet(PROCESSED_DF_PATH)

    <span class="hljs-comment"># categorize num and cat columns</span>
    num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
    <span class="hljs-keyword">if</span> verbose: main_logger.info(<span class="hljs-string">f'num_cols: <span class="hljs-subst">{num_cols}</span> \ncat_cols: <span class="hljs-subst">{cat_cols}</span>'</span>)

    <span class="hljs-comment"># structure cat cols</span>
    <span class="hljs-keyword">if</span> cat_cols:
        <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cat_cols: df[col] = df[col].astype(<span class="hljs-string">'string'</span>)

    <span class="hljs-comment"># initiate preprocessor (either load from the dvc cache or create from scratch)</span>
    PREPROCESSOR_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'column_transformer.pkl'</span>)
    <span class="hljs-keyword">try</span>:
        preprocessor = joblib.load(PREPROCESSOR_PATH)
    <span class="hljs-keyword">except</span>:
        preprocessor = scripts.create_preprocessor(num_cols=num_cols <span class="hljs-keyword">if</span> should_scale <span class="hljs-keyword">else</span> [], cat_cols=cat_cols)

    <span class="hljs-comment"># creates train, val, test datasets</span>
    y = df[target_col]
    X = df.copy().drop(target_col, axis=<span class="hljs-string">'columns'</span>)

    <span class="hljs-comment"># split</span>
    test_size, random_state = <span class="hljs-number">50000</span>, <span class="hljs-number">42</span>
    X_tv, X_test, y_tv, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)
    X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=test_size, random_state=random_state, shuffle=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># store train, val, test datasets (dvc track)</span>
    X_train.to_parquet(<span class="hljs-string">'data/x_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_val.to_parquet(<span class="hljs-string">'data/x_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    X_test.to_parquet(<span class="hljs-string">'data/x_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_train.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_train_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_val.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_val_df.parquet'</span>, index=<span class="hljs-literal">False</span>)
    y_test.to_frame(name=target_col).to_parquet(<span class="hljs-string">'data/y_test_df.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># preprocess</span>
    X_train = preprocessor.fit_transform(X_train)
    X_val = preprocessor.transform(X_val)
    X_test = preprocessor.transform(X_test)

    <span class="hljs-comment"># store preprocessed input data (dvc track)</span>
    pd.DataFrame(X_train).to_parquet(<span class="hljs-string">f'data/x_train_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_val).to_parquet(<span class="hljs-string">f'data/x_val_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)
    pd.DataFrame(X_test).to_parquet(<span class="hljs-string">f'data/x_test_processed.parquet'</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-comment"># save feature names (dvc track) for shap</span>
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">'preprocessors/feature_names.json'</span>, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        feature_names = preprocessor.get_feature_names_out()
        json.dump(feature_names.tolist(), f)

    <span class="hljs-keyword">return</span>  X_train, X_val, X_test, y_train, y_val, y_test, preprocessor


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'run data preprocessing'</span>)
    parser.add_argument(<span class="hljs-string">'--stockcode'</span>, type=str, default=<span class="hljs-string">''</span>, help=<span class="hljs-string">'specific stockcode'</span>)
    parser.add_argument(<span class="hljs-string">'--target_col'</span>, type=str, default=<span class="hljs-string">'quantity'</span>, help=<span class="hljs-string">'the target column name'</span>)
    parser.add_argument(<span class="hljs-string">'--should_scale'</span>, type=bool, default=<span class="hljs-literal">True</span>, help=<span class="hljs-string">'flag to scale numerical features'</span>)
    parser.add_argument(<span class="hljs-string">'--verbose'</span>, type=bool, default=<span class="hljs-literal">False</span>, help=<span class="hljs-string">'flag for verbose logging'</span>)
    args = parser.parse_args()

    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = preprocess(
        target_col=args.target_col,
        should_scale=args.should_scale,
        verbose=args.verbose,
        stockcode=args.stockcode,
    )
</code></pre>
<h4 id="heading-outputs-2">Outputs</h4>
<p>This stage generates the necessary datasets for both model training and inference:</p>
<p>Input features:</p>
<ul>
<li><p><code>data/x_train_df.parquet</code></p>
</li>
<li><p><code>data/x_val_df.parquet</code></p>
</li>
<li><p><code>data/x_test_df.parquet</code></p>
</li>
</ul>
<p>Preprocessed input features:</p>
<ul>
<li><p><code>data/x_train_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_val_processed_df.parquet</code></p>
</li>
<li><p><code>data/x_test_processed_df.parquet</code></p>
</li>
</ul>
<p>Target variables:</p>
<ul>
<li><p><code>data/y_train_df.parquet</code></p>
</li>
<li><p><code>data/y_val_df.parquet</code></p>
</li>
<li><p><code>data/y_test_df.parquet</code></p>
</li>
</ul>
<p>The preprocessor and human-readable feature names are also stored in cache for inference and SHAP feature impact analysis later:</p>
<ul>
<li><p><code>preprocessors/column_transformer.pk</code></p>
</li>
<li><p><code>preprocessors/feature_names.json</code></p>
</li>
</ul>
<p>Lastly, DVC adds the <code>preprocess_status</code> , <code>x_train_processed_path</code>, and <code>preprocessor_path</code> to the data summary metrics file <code>data.json</code> created in Step 2 to track the end-to-end process of Steps 2 and 3:</p>
<p><code>metrics/data.json</code>:</p>
<pre><code class="lang-python">{
    <span class="hljs-string">"drift_detected"</span>: false,
    <span class="hljs-string">"num_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"shared_drifts"</span>: <span class="hljs-number">0.0</span>,
    <span class="hljs-string">"num_cols"</span>: [
        <span class="hljs-string">"invoiceno"</span>,
        <span class="hljs-string">"invoicedate"</span>,
        <span class="hljs-string">"unitprice"</span>,
        <span class="hljs-string">"product_avg_quantity_last_month"</span>,
        <span class="hljs-string">"product_max_price_all_time"</span>,
        <span class="hljs-string">"unitprice_vs_max"</span>,
        <span class="hljs-string">"unitprice_to_avg"</span>,
        <span class="hljs-string">"unitprice_squared"</span>,
        <span class="hljs-string">"unitprice_log"</span>
    ],
    <span class="hljs-string">"cat_cols"</span>: [
        <span class="hljs-string">"stockcode"</span>,
        <span class="hljs-string">"customerid"</span>,
        <span class="hljs-string">"country"</span>,
        <span class="hljs-string">"year"</span>,
        <span class="hljs-string">"year_month"</span>,
        <span class="hljs-string">"day_of_week"</span>,
        <span class="hljs-string">"is_registered"</span>
    ],
    <span class="hljs-string">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:24:29.899495"</span>,

    <span class="hljs-comment"># updates</span>
    <span class="hljs-string">"preprocess_status"</span>: <span class="hljs-string">"completed"</span>,
    <span class="hljs-string">"x_train_processed_path"</span>: <span class="hljs-string">"data/x_train_processed_85123A.parquet"</span>,
    <span class="hljs-string">"preprocessor_path"</span>: <span class="hljs-string">"preprocessors/column_transformer.pkl"</span>
}
</code></pre>
<p>Next, let’s move onto the model/experiment lineage.</p>
<h3 id="heading-stage-4-tuning-the-model">Stage 4: Tuning the Model</h3>
<p>Now that we’ve created the datasets, we’ll tune and train the primary model. It’s a multi-layered feedforward network on <strong>PyTorch</strong>, using training and validation datasets created in the <code>preprocess</code> stage.</p>
<h4 id="heading-dvc-configuration-3">DVC Configuration</h4>
<p>First, we’ll add the <code>tuning_primary_model</code> stage right after the <code>preprocess</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/main.py
      data/x_train_processed_${params.stockcode}.parquet
      data/x_val_processed_${params.stockcode}.parquet
      data/y_train_df_${params.stockcode}.parquet
      data/y_val_df_${params.stockcode}.parquet
      ${tuning.should_local_save}
      ${tuning.grid}
      ${tuning.n_trials}
      ${tuning.num_epochs}
      ${params.stockcode}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/main.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.n_trials</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.grid</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tuning.should_local_save</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/dfn_best_${params.stockcode}.pth</span> <span class="hljs-comment"># dvc track</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_val_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>params.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>
</code></pre>
<h4 id="heading-python-scripts-3">Python Scripts</h4>
<p>Next, we’ll add the Python scripts to tune the model using <strong>Bayesian optimization</strong> and then train the optimal model on the complete <code>X_train</code> and <code>y_train</code> datasets created in the <code>preprocess</code> stage.</p>
<p><code>src/model/torch_model/main.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tune_and_train</span>(<span class="hljs-params">
        X_train, X_val, y_train, y_val,
        stockcode: str = <span class="hljs-string">''</span>,
        should_local_save: bool = True,
        grid: bool = False,
        n_trials: int = <span class="hljs-number">50</span>,
        num_epochs: int = <span class="hljs-number">3000</span>
    </span>) -&gt; tuple[nn.Module, dict]:</span>

    <span class="hljs-comment"># perform bayesian optimization</span>
    best_dfn, best_optimizer, best_batch_size, best_checkpoint = scripts.bayesian_optimization(
        X_train, X_val, y_train, y_val, n_trials=n_trials, num_epochs=num_epochs
    )

    <span class="hljs-comment"># save the model artifact (dvc track)</span>
    DFN_FILE_PATH = os.path.join(<span class="hljs-string">'models'</span>, <span class="hljs-string">'production'</span>, <span class="hljs-string">f'dfn_best_<span class="hljs-subst">{stockcode}</span>.pth'</span> <span class="hljs-keyword">if</span> stockcode <span class="hljs-keyword">else</span> <span class="hljs-string">'dfn_best.pth'</span>)
    os.makedirs(os.path.dirname(DFN_FILE_PATH), exist_ok=<span class="hljs-literal">True</span>)
    torch.save(best_checkpoint, DFN_FILE_PATH)

    <span class="hljs-keyword">return</span> best_dfn, best_checkpoint



<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">track_metrics_by_stockcode</span>(<span class="hljs-params">X_val, y_val, best_model, checkpoint: dict, stockcode: str</span>):</span>
    MODEL_VAL_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_val_<span class="hljs-subst">{stockcode}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_VAL_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># validate the tuned model</span>
    _, mse, exp_mae, rmsle = scripts.perform_inference(model=best_model, X=X_val, y=y_val)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    metrics = dict(
        stockcode=stockcode,
        mse_val=mse,
        mae_val=exp_mae,
        rmsle_val=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-comment"># store the validation results (dvc track)</span>
    <span class="hljs-keyword">with</span> open(MODEL_VAL_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
        json.dump(metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... validation metrics saved to <span class="hljs-subst">{MODEL_VAL_METRICS_PATH}</span> ...'</span>)


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># fetch command arg values</span>
    X_TRAIN_PATH = sys.argv[<span class="hljs-number">1</span>]
    X_VAL_PATH = sys.argv[<span class="hljs-number">2</span>]
    Y_TRAIN_PATH = sys.argv[<span class="hljs-number">3</span>]
    Y_VAL_PATH = sys.argv[<span class="hljs-number">4</span>]
    SHOULD_LOCAL_SAVE = sys.argv[<span class="hljs-number">5</span>] == <span class="hljs-string">'True'</span>
    GRID = sys.argv[<span class="hljs-number">6</span>] == <span class="hljs-string">'True'</span>
    N_TRIALS = int(sys.argv[<span class="hljs-number">7</span>])
    NUM_EPOCHS = int(sys.argv[<span class="hljs-number">8</span>])
    STOCKCODE = str(sys.argv[<span class="hljs-number">9</span>])

    <span class="hljs-comment"># extract training and validation datasets from dvc cache</span>
    X_train, X_val = pd.read_parquet(X_TRAIN_PATH), pd.read_parquet(X_VAL_PATH)
    y_train, y_val = pd.read_parquet(Y_TRAIN_PATH), pd.read_parquet(Y_VAL_PATH)

    <span class="hljs-comment"># tuning</span>
    best_model, checkpoint = tune_and_train(
        X_train, X_val, y_train, y_val,
        stockcode=STOCKCODE, should_local_save=SHOULD_LOCAL_SAVE, grid=GRID, n_trials=N_TRIALS, num_epochs=NUM_EPOCHS
    )

    <span class="hljs-comment"># metrics tracking</span>
    track_metrics_by_stockcode(X_val, y_val, best_model=best_model, checkpoint=checkpoint, stockcode=STOCKCODE)
</code></pre>
<h4 id="heading-outputs-3">Outputs</h4>
<p>The stage generates two files:</p>
<ul>
<li><p><code>models/production/dfn_best.pth</code>: Includes model artifacts and checkpoint like the optimal hyperparameter set.</p>
</li>
<li><p><code>metrics/dfn_val.json</code>: Contains tuning results, model version, timestamp, and validation results for MSE, MAE, and RMSLE:</p>
</li>
</ul>
<p><code>metrics/dfn_val.json</code>:</p>
<pre><code class="lang-yaml">{
    <span class="hljs-attr">"stockcode":</span> <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_val":</span> <span class="hljs-number">0.6137686967849731</span>,
    <span class="hljs-attr">"mae_val":</span> <span class="hljs-number">9.092489242553711</span>,
    <span class="hljs-attr">"rmsle_val":</span> <span class="hljs-number">0.6953299045562744</span>,
    <span class="hljs-attr">"model_version":</span> <span class="hljs-string">"dfn_85123A_35604"</span>,
    <span class="hljs-attr">"hparams":</span> {
        <span class="hljs-attr">"num_layers":</span> <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm":</span> <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0":</span> <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0":</span> <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1":</span> <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1":</span> <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2":</span> <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2":</span> <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3":</span> <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3":</span> <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate":</span> <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer":</span> <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size":</span> <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr":</span> <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp":</span> <span class="hljs-string">"2025-10-07T00:31:08.700294"</span>
}
</code></pre>
<h3 id="heading-stage-5-performing-inference">Stage 5: Performing Inference</h3>
<p>After the model tuning phase is complete, we’ll configure the test inference for a final evaluation.</p>
<p>The final evaluation uses the MSE, MAE, and RMSLE metrics, as well as SHAP for feature impact and interpretability analysis.</p>
<p><strong>SHAP</strong> <strong>(SHapley Additive exPlanations)</strong> is a framework for quantifying how much each feature contributes to a model’s prediction by using the concept of Shapley values from game theory.</p>
<p>The SHAP values are leveraged for future EDA and feature engineering.</p>
<h4 id="heading-dvc-configuration-4">DVC Configuration</h4>
<p>First, we’ll add the <code>inference_primary_model</code> stage to the DVC configuration.</p>
<p>This stage has the <code>plots</code> section where DVC will track and version the generated visualization files on the SHAP values.</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/inference.py
      data/x_test_processed_${params.stockcode}.parquet
      data/y_test_df_${params.stockcode}.parquet
      models/production/dfn_best_${params.stockcode}.pth
      ${params.stockcode}
      ${tracking.sensitive_feature_col}
      ${tracking.privileged_group}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/inference.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">models/production/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_inf_${params.stockcode}.json:</span> <span class="hljs-comment"># dvc track</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>

    <span class="hljs-attr">plots:</span>
      <span class="hljs-comment"># shap summary / beeswarm plot for global interpretability</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_summary_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">simple</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">shap_value</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Beeswarm</span> <span class="hljs-string">Plot</span>

      <span class="hljs-comment"># shap mean absolute vals - feature importance bar plot</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_shap_mean_abs_${params.stockcode}.json:</span>
          <span class="hljs-attr">template:</span> <span class="hljs-string">bar</span>
          <span class="hljs-attr">x:</span> <span class="hljs-string">mean_abs_shap</span>
          <span class="hljs-attr">y:</span> <span class="hljs-string">feature_name</span>
          <span class="hljs-attr">title:</span> <span class="hljs-string">Mean</span> <span class="hljs-string">Absolute</span> <span class="hljs-string">SHAP</span> <span class="hljs-string">Importance</span>

    <span class="hljs-attr">outs:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">reports/dfn_raw_shap_values_${params.stockcode}.parquet</span> <span class="hljs-comment"># save raw shap vals for detailed analysis later</span>
</code></pre>
<h4 id="heading-python-scripts-4"><strong>Python Scripts</strong></h4>
<p>Next, we’ll add scripts where the trained model performs inference:</p>
<p><code>src/model/torch_model/inference.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> shap

<span class="hljs-keyword">import</span> src.model.torch_model.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># load test dataset</span>
    X_TEST_PATH = sys.argv[<span class="hljs-number">1</span>]
    Y_TEST_PATH = sys.argv[<span class="hljs-number">2</span>]
    X_test, y_test = pd.read_parquet(X_TEST_PATH), pd.read_parquet(Y_TEST_PATH)

    <span class="hljs-comment"># create X_test w/ column names for shap analysis and sensitive feature tracking</span>
    X_test_with_col_names = X_test.copy()
    FEATURE_NAMES_PATH = os.path.join(<span class="hljs-string">'preprocessors'</span>, <span class="hljs-string">'feature_names.json'</span>)
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">with</span> open(FEATURE_NAMES_PATH, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f: feature_names = json.load(f)
    <span class="hljs-keyword">except</span> FileNotFoundError: feature_names = X_test.columns.tolist()
    <span class="hljs-keyword">if</span> len(X_test_with_col_names.columns) == len(feature_names): X_test_with_col_names.columns = feature_names

    <span class="hljs-comment"># reconstruct the optimal model tuned in the previous stage</span>
    MODEL_PATH = sys.argv[<span class="hljs-number">3</span>]
    checkpoint = torch.load(MODEL_PATH)
    model = scripts.load_model(checkpoint=checkpoint)

    <span class="hljs-comment"># perform inference</span>
    y_pred, mse, exp_mae, rmsle = scripts.perform_inference(model=model, X=X_test, y=y_test, batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>])

    <span class="hljs-comment"># create result df w/ y_pred, y_true, and sensitive features</span>
    STOCKCODE = sys.argv[<span class="hljs-number">4</span>]
    SENSITIVE_FEATURE = sys.argv[<span class="hljs-number">5</span>]
    PRIVILEGED_GROUP = sys.argv[<span class="hljs-number">6</span>]
    inference_df = pd.DataFrame(y_pred.cpu().numpy().flatten(), columns=[<span class="hljs-string">'y_pred'</span>])
    inference_df[<span class="hljs-string">'y_true'</span>] = y_test
    inference_df[SENSITIVE_FEATURE] = X_test_with_col_names[<span class="hljs-string">f'cat__<span class="hljs-subst">{SENSITIVE_FEATURE}</span>_<span class="hljs-subst">{str(PRIVILEGED_GROUP)}</span>'</span>].astype(bool)
    inference_df.to_parquet(path=os.path.join(<span class="hljs-string">'data'</span>, <span class="hljs-string">f'dfn_inference_results_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>))

    <span class="hljs-comment"># record inference metrics</span>
    MODEL_INF_METRICS_PATH = os.path.join(<span class="hljs-string">'metrics'</span>, <span class="hljs-string">f'dfn_inf_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    os.makedirs(os.path.dirname(MODEL_INF_METRICS_PATH), exist_ok=<span class="hljs-literal">True</span>)
    model_version = <span class="hljs-string">f"dfn_<span class="hljs-subst">{STOCKCODE}</span>_<span class="hljs-subst">{os.getpid()}</span>"</span>
    inf_metrics = dict(
        stockcode=STOCKCODE,
        mse_inf=mse,
        mae_inf=exp_mae,
        rmsle_inf=rmsle,
        model_version=model_version,
        hparams=checkpoint[<span class="hljs-string">'hparams'</span>],
        optimizer=checkpoint[<span class="hljs-string">'optimizer_name'</span>],
        batch_size=checkpoint[<span class="hljs-string">'batch_size'</span>],
        lr=checkpoint[<span class="hljs-string">'lr'</span>],
        timestamp=datetime.datetime.now().isoformat()
    )
    <span class="hljs-keyword">with</span> open(MODEL_INF_METRICS_PATH, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f: <span class="hljs-comment"># dvc track</span>
        json.dump(inf_metrics, f, indent=<span class="hljs-number">4</span>)
        main_logger.info(<span class="hljs-string">f'... inference metrics saved to <span class="hljs-subst">{MODEL_INF_METRICS_PATH}</span> ...'</span>)


    <span class="hljs-comment">## shap analysis</span>
    <span class="hljs-comment"># compute shap vals</span>
    model.eval()

    <span class="hljs-comment"># prepare backgdound data</span>
    X_test_tensor = torch.from_numpy(X_test.values.astype(np.float32)).to(device_type)

    <span class="hljs-comment"># take the small samples from x_test as background</span>
    background = X_test_tensor[np.random.choice(X_test_tensor.shape[<span class="hljs-number">0</span>], <span class="hljs-number">100</span>, replace=<span class="hljs-literal">False</span>)].to(device_type)

    <span class="hljs-comment"># define deepexplainer</span>
    explainer = shap.DeepExplainer(model, background)

    <span class="hljs-comment"># compute shap vals</span>
    shap_values = explainer.shap_values(X_test_tensor) <span class="hljs-comment"># outputs = numpy array or tensor</span>

    <span class="hljs-comment"># convert shap array to pandas df</span>
    <span class="hljs-keyword">if</span> isinstance(shap_values, list): shap_values = shap_values[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">if</span> isinstance(shap_values, torch.Tensor): shap_values = shap_values.cpu().numpy()
    shap_values = shap_values.squeeze(axis=<span class="hljs-number">-1</span>) <span class="hljs-comment"># type: ignore</span>
    shap_df = pd.DataFrame(shap_values, columns=feature_names)

    <span class="hljs-comment"># shap raw data (dvc track)</span>
    RAW_SHAP_OUT_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_raw_shap_values_<span class="hljs-subst">{STOCKCODE}</span>.parquet'</span>)
    os.makedirs(os.path.dirname(RAW_SHAP_OUT_PATH), exist_ok=<span class="hljs-literal">True</span>)
    shap_df.to_parquet(RAW_SHAP_OUT_PATH, index=<span class="hljs-literal">False</span>)
    main_logger.info(<span class="hljs-string">f'... shap values saved to <span class="hljs-subst">{RAW_SHAP_OUT_PATH}</span> ...'</span>)

    <span class="hljs-comment"># bar plot of mean abs shap vals (dvc report)</span>
    mean_abs_shap = shap_df.abs().mean().sort_values(ascending=<span class="hljs-literal">False</span>)
    shap_mean_abs_df = pd.DataFrame({<span class="hljs-string">'feature_name'</span>: feature_names, <span class="hljs-string">'mean_abs_shap'</span>: mean_abs_shap.values })
    MEAN_ABS_SHAP_PATH = os.path.join(<span class="hljs-string">'reports'</span>, <span class="hljs-string">f'dfn_shap_mean_abs_<span class="hljs-subst">{STOCKCODE}</span>.json'</span>)
    shap_mean_abs_df.to_json(MEAN_ABS_SHAP_PATH, orient=<span class="hljs-string">'records'</span>, indent=<span class="hljs-number">4</span>)
</code></pre>
<h4 id="heading-outputs-4"><strong>Outputs</strong></h4>
<p>This stage generates five output files:</p>
<ul>
<li><p><code>data/dfn_inference_result_${params_stockcode}.parquet</code>: Stores prediction results, labeled targets, and any columns with sensitive features like gender, age, income, and more. I’ll use this file for the fairness test in the last stage.</p>
</li>
<li><p><code>metrics/dfn_inf.json</code>: Stores evaluation metrics and tuning results:</p>
</li>
</ul>
<pre><code class="lang-json">{
    <span class="hljs-attr">"stockcode"</span>: <span class="hljs-string">"85123A"</span>,
    <span class="hljs-attr">"mse_inf"</span>: <span class="hljs-number">0.6841545701026917</span>,
    <span class="hljs-attr">"mae_inf"</span>: <span class="hljs-number">11.5866117477417</span>,
    <span class="hljs-attr">"rmsle_inf"</span>: <span class="hljs-number">0.7423332333564758</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35834"</span>,
    <span class="hljs-attr">"hparams"</span>: {
        <span class="hljs-attr">"num_layers"</span>: <span class="hljs-number">4</span>,
        <span class="hljs-attr">"batch_norm"</span>: <span class="hljs-literal">false</span>,
        <span class="hljs-attr">"dropout_rate_layer_0"</span>: <span class="hljs-number">0.13765888061300502</span>,
        <span class="hljs-attr">"n_units_layer_0"</span>: <span class="hljs-number">184</span>,
        <span class="hljs-attr">"dropout_rate_layer_1"</span>: <span class="hljs-number">0.5509872409359128</span>,
        <span class="hljs-attr">"n_units_layer_1"</span>: <span class="hljs-number">122</span>,
        <span class="hljs-attr">"dropout_rate_layer_2"</span>: <span class="hljs-number">0.2408753527744403</span>,
        <span class="hljs-attr">"n_units_layer_2"</span>: <span class="hljs-number">35</span>,
        <span class="hljs-attr">"dropout_rate_layer_3"</span>: <span class="hljs-number">0.03451842588822594</span>,
        <span class="hljs-attr">"n_units_layer_3"</span>: <span class="hljs-number">224</span>,
        <span class="hljs-attr">"learning_rate"</span>: <span class="hljs-number">0.026240673135104406</span>,
        <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
        <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>
    },
    <span class="hljs-attr">"optimizer"</span>: <span class="hljs-string">"adamax"</span>,
    <span class="hljs-attr">"batch_size"</span>: <span class="hljs-number">64</span>,
    <span class="hljs-attr">"lr"</span>: <span class="hljs-number">0.026240673135104406</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:12.946405"</span>
}
</code></pre>
<ul>
<li><code>reports/dfn_shap_mean_abs.json</code>:  Stores the mean SHAP values:</li>
</ul>
<pre><code class="lang-json">[
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__invoicedate"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.219255722</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__unitprice"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1069829418</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_avg_quantity_last_month"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.1021453096</span>
    },
    {
        <span class="hljs-attr">"feature_name"</span>:<span class="hljs-string">"num__product_max_price_all_time"</span>,
        <span class="hljs-attr">"mean_abs_shap"</span>:<span class="hljs-number">0.0855356899</span>
    },
...
]
</code></pre>
<ul>
<li><p><code>reports/dfn_shap_summary.json</code>: Contains the data points necessary to draw the beeswarm/bar plots.</p>
</li>
<li><p><code>reports/dfn_raw_shap_values.parquet</code>: Stores raw SHAP values.</p>
</li>
</ul>
<h3 id="heading-stage-6-assessing-model-risk-and-fairness">Stage 6: Assessing Model Risk and Fairness</h3>
<p>The last stage is to assess risk and fairness of the final inference results.</p>
<h4 id="heading-the-fairness-testing">The Fairness Testing</h4>
<p>Fairness testing in ML is the process of systematically evaluating a model’s predictions to ensure they are not unfairly biased toward specific groups defined by sensitive attributes like race and gender.</p>
<p>In this project, we’ll use the registration status <code>is_registered</code> column as a sensitive feature and make sure the <strong>Mean Outcome Difference (MOD)</strong> is within the specified threshold of <code>0.1</code>.</p>
<p>The MOD is calculated as the absolute difference between the mean prediction values of the privileged (registered) and unprivileged (unregistered) groups.</p>
<h4 id="heading-dvc-configuration-5">DVC Configuration</h4>
<p>First, we’ll add the <code>assess_model_risk</code> stage right after the <code>inference_primary_model</code> stage:</p>
<p><code>dvc.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">data_drift_check:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">preprocess:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">tune_primary_model:</span>
    <span class="hljs-comment">### </span>
  <span class="hljs-attr">inference_primary_model:</span>
    <span class="hljs-comment">###</span>
  <span class="hljs-attr">assess_model_risk:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">&gt;
      python src/model/torch_model/assess_risk_and_fairness.py
      data/dfn_inference_results_${params.stockcode}.parquet
      metrics/dfn_risk_fairness_${params.stockcode}.json
      ${tracking.sensitive_feature_col}
      ${params.stockcode}
      ${tracking.privileged_group}
      ${tracking.mod_threshold}
</span>
    <span class="hljs-attr">deps:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/model/torch_model/assess_risk_and_fairness.py</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">data/dfn_inference_results_${params.stockcode}.parquet</span> <span class="hljs-comment"># ensure the result df as dependency</span>

    <span class="hljs-attr">params:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">params.stockcode</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.sensitive_feature_col</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.privileged_group</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">tracking.mod_threshold</span>

    <span class="hljs-attr">metrics:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">metrics/dfn_risk_fairness_${params.stockcode}.json:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">json</span>
</code></pre>
<p>Then we’ll add default values to the parameters:</p>
<p><code>param.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">params:</span>
  <span class="hljs-attr">target_col:</span> <span class="hljs-string">"quantity"</span>
  <span class="hljs-attr">should_scale:</span> <span class="hljs-literal">True</span>
  <span class="hljs-attr">verbose:</span> <span class="hljs-literal">False</span>

<span class="hljs-attr">tuning:</span>
  <span class="hljs-attr">n_trials:</span> <span class="hljs-number">100</span>
  <span class="hljs-attr">num_epochs:</span> <span class="hljs-number">3000</span>
  <span class="hljs-attr">should_local_save:</span> <span class="hljs-literal">False</span>
  <span class="hljs-attr">grid:</span> <span class="hljs-literal">False</span>

<span class="hljs-comment"># adding default values to the tracking metrics</span>
<span class="hljs-attr">tracking:</span>
  <span class="hljs-attr">sensitive_feature_col:</span> <span class="hljs-string">"is_registered"</span>
  <span class="hljs-attr">privileged_group:</span> <span class="hljs-number">1</span> <span class="hljs-comment"># member</span>
  <span class="hljs-attr">mod_threshold:</span> <span class="hljs-number">0.1</span>
</code></pre>
<h4 id="heading-python-script">Python Script</h4>
<p>The corresponding Python script contains the <code>calculate_fairness_metrics</code> function which performs the risk and fairness assessment:</p>
<p><code>src/model/torch_model/assess_risk_and_fairness.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_absolute_error, mean_squared_error, root_mean_squared_log_error

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">calculate_fairness_metrics</span>(<span class="hljs-params">
        df: pd.DataFrame,
        sensitive_feature_col: str,
        label_col: str = <span class="hljs-string">'y_true'</span>,
        prediction_col: str = <span class="hljs-string">'y_pred'</span>,
        privileged_group: int = <span class="hljs-number">1</span>,
        mod_threshold: float = <span class="hljs-number">0.1</span>,
    </span>) -&gt; dict:</span>

    metrics = dict()
    unprivileged_group = <span class="hljs-number">0</span> <span class="hljs-keyword">if</span> privileged_group == <span class="hljs-number">1</span> <span class="hljs-keyword">else</span> <span class="hljs-number">1</span>

    <span class="hljs-comment">## 1. risk assessment - predictive performance metrics by group</span>
    <span class="hljs-keyword">for</span> group, name <span class="hljs-keyword">in</span> zip([unprivileged_group, privileged_group], [<span class="hljs-string">'unprivileged'</span>, <span class="hljs-string">'privileged'</span>]):
        subset = df[df[sensitive_feature_col] == group]
        <span class="hljs-keyword">if</span> len(subset) == <span class="hljs-number">0</span>: <span class="hljs-keyword">continue</span>

        y_true = subset[label_col].values
        y_pred = subset[prediction_col].values

        metrics[<span class="hljs-string">f'mse_<span class="hljs-subst">{name}</span>'</span>] = float(mean_squared_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'mae_<span class="hljs-subst">{name}</span>'</span>] = float(mean_absolute_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>
        metrics[<span class="hljs-string">f'rmsle_<span class="hljs-subst">{name}</span>'</span>] = float(root_mean_squared_log_error(y_true, y_pred)) <span class="hljs-comment"># type: ignore</span>

        <span class="hljs-comment"># mean prediction (outcome disparity component)</span>
        metrics[<span class="hljs-string">f'mean_prediction_<span class="hljs-subst">{name}</span>'</span>] = float(y_pred.mean()) <span class="hljs-comment"># type: ignore</span>

    <span class="hljs-comment">## 2. bias assessment - fairness metrics</span>
    <span class="hljs-comment"># absolute mean error difference</span>
    mae_diff = metrics.get(<span class="hljs-string">'mae_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mae_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mae_diff'</span>] = float(mae_diff)

    <span class="hljs-comment"># mean outcome difference</span>
    mod = metrics.get(<span class="hljs-string">'mean_prediction_unprivileged'</span>, <span class="hljs-number">0</span>) - metrics.get(<span class="hljs-string">'mean_prediction_privileged'</span>, <span class="hljs-number">0</span>)
    metrics[<span class="hljs-string">'mean_outcome_difference'</span>] = float(mod)
    metrics[<span class="hljs-string">'is_mod_acceptable'</span>] = <span class="hljs-number">1</span> <span class="hljs-keyword">if</span> abs(mod) &lt;= mod_threshold <span class="hljs-keyword">else</span> <span class="hljs-number">0</span>

    <span class="hljs-keyword">return</span> metrics


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'assess bias and fairness metrics on model inference results.'</span>)
    parser.add_argument(<span class="hljs-string">'inference_file_path'</span>, type=str, help=<span class="hljs-string">'parquet file path to the inference results w/ y_true, y_pred, and sensitive feature cols.'</span>)
    parser.add_argument(<span class="hljs-string">'metrics_output_path'</span>, type=str, help=<span class="hljs-string">'json file path to save the metrics output.'</span>)
    parser.add_argument(<span class="hljs-string">'sensitive_feature_col'</span>, type=str, help=<span class="hljs-string">'column name of sensitive features'</span>)
    parser.add_argument(<span class="hljs-string">'stockcode'</span>, type=str)
    parser.add_argument(<span class="hljs-string">'privileged_group'</span>, type=int, default=<span class="hljs-number">1</span>)
    parser.add_argument(<span class="hljs-string">'mod_threshold'</span>, type=float, default=<span class="hljs-number">.1</span>)
    args = parser.parse_args()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># load inf df</span>
        df_inference = pd.read_parquet(args.inference_file_path)
        LABEL_COL = <span class="hljs-string">'y_true'</span>
        PREDICTION_COL = <span class="hljs-string">'y_pred'</span>
        SENSITIVE_COL = args.sensitive_feature_col

        <span class="hljs-comment"># compute fairness metrics</span>
        metrics = calculate_fairness_metrics(
            df=df_inference,
            sensitive_feature_col=SENSITIVE_COL,
            label_col=LABEL_COL,
            prediction_col=PREDICTION_COL,
            privileged_group=args.privileged_group,
            mod_threshold=args.mod_threshold,
        )

        <span class="hljs-comment"># add items to metrics</span>
        metrics[<span class="hljs-string">'model_version'</span>] = <span class="hljs-string">f'dfn_<span class="hljs-subst">{args.stockcode}</span>_<span class="hljs-subst">{os.getpid()}</span>'</span>
        metrics[<span class="hljs-string">'sensitive_feature'</span>] = args.sensitive_feature_col
        metrics[<span class="hljs-string">'privileged_group'</span>] = args.privileged_group
        metrics[<span class="hljs-string">'mod_threshold'</span>] = args.mod_threshold
        metrics[<span class="hljs-string">'stockcode'</span>] = args.stockcode
        metrics[<span class="hljs-string">'timestamp'</span>] = datetime.datetime.now().isoformat()

        <span class="hljs-comment"># load metrics (dvc track)</span>
        <span class="hljs-keyword">with</span> open(args.metrics_output_path, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f:
            json_metrics = { k: (v <span class="hljs-keyword">if</span> pd.notna(v) <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>) <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> metrics.items() }
            json.dump(json_metrics, f, indent=<span class="hljs-number">4</span>)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        main_logger.error(<span class="hljs-string">f'... an error occurred during risk and fairness assessment: <span class="hljs-subst">{e}</span> ...'</span>)
        exit(<span class="hljs-number">1</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    main()
</code></pre>
<h4 id="heading-outputs-5">Outputs</h4>
<p>The final stage generates a metrics file which contains test results and model version:</p>
<p><code>metrics/dfn_risk_fairness.json</code>:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"mse_unprivileged"</span>: <span class="hljs-number">3.5370739412593575</span>,
    <span class="hljs-attr">"mae_unprivileged"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"rmsle_unprivileged"</span>: <span class="hljs-number">0.6080000224747837</span>,
    <span class="hljs-attr">"mean_prediction_unprivileged"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"mae_diff"</span>: <span class="hljs-number">1.48263614013523</span>,
    <span class="hljs-attr">"mean_outcome_difference"</span>: <span class="hljs-number">1.8507767915725708</span>,
    <span class="hljs-attr">"is_mod_acceptable"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"model_version"</span>: <span class="hljs-string">"dfn_85123A_35971"</span>,
    <span class="hljs-attr">"sensitive_feature"</span>: <span class="hljs-string">"is_registered"</span>,
    <span class="hljs-attr">"privileged_group"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-attr">"mod_threshold"</span>: <span class="hljs-number">0.1</span>,
    <span class="hljs-attr">"timestamp"</span>: <span class="hljs-string">"2025-10-07T00:31:15.998590"</span>
}
</code></pre>
<p>That’s all for the lineage configuration. Now, we’ll test it in local.</p>
<h3 id="heading-test-in-local">Test in Local</h3>
<p>We’ll run the entire ML lineage with this command:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> repro -f
</code></pre>
<p><code>-f</code> forces DVC to rerun all the stages with or without any updates.</p>
<p>The command will automatically create the <code>dvc.lock</code> file at the root of the project directory:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">schema:</span> <span class="hljs-string">'2.0'</span>
<span class="hljs-attr">stages:</span>
  <span class="hljs-attr">etl_pipeline_full:</span>
    <span class="hljs-attr">cmd:</span> <span class="hljs-string">python</span> <span class="hljs-string">src/data_handling/etl_pipeline.py</span>
    <span class="hljs-attr">deps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/_utils/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">ae41392532188d290395495f6827ed00.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">15870</span>
      <span class="hljs-attr">nfiles:</span> <span class="hljs-number">10</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">src/data_handling/</span>
      <span class="hljs-attr">hash:</span> <span class="hljs-string">md5</span>
      <span class="hljs-attr">md5:</span> <span class="hljs-string">a8a61a4b270581a7c387d51e416f4e86.dir</span>
      <span class="hljs-attr">size:</span> <span class="hljs-number">95715</span>
<span class="hljs-string">...</span>
</code></pre>
<p>The <code>dvc.lock</code> file must be published in Git to make sure DVC will load the latest files:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add dvc.lock .dvc dvc.yaml params.yaml
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dvc config'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<h2 id="heading-step-3-deploying-the-dvc-project">Step 3: Deploying the DVC Project</h2>
<p>Next, we’ll deploy the DVC project to ensure the AWS Lambda function can access the cached files in production.</p>
<p>We’ll start by configuring the DVC remote where the cached files are stored.</p>
<p>DVC offers <a target="_blank" href="https://dvc.org/doc/user-guide/data-management/remote-storage#supported-storage-types">various storage types</a> like AWS S3 and Google Cloud. We’ll use AWS S3 for this project but your choice depend on the project ecosystem, your familiarity with the tool, and any resource constraints.</p>
<p>First, we’ll create a new S3 bucket in the selected AWS region:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> s3 mb s3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;  --region &lt;AWS REGION&gt;
</code></pre>
<p>Make sure the IAM role has the following permissions: <code>s3:ListBucket</code>, <code>s3:GetObject</code>, <code>s3:PutObject</code>, and <code>s3:DeleteObject</code>.</p>
<p>Then, add theURI of the S3 bucket to the DVC remote:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$dvc</span> remote add -d &lt;DVC REMOTE NAME&gt; ss3://&lt;PROJECT NAME&gt;/&lt;BUCKET NAME&gt;
</code></pre>
<p>Next, push the cache files to the DVC remote:</p>
<pre><code class="lang-python">$dvc push
</code></pre>
<p>Now, all cache files are stored in the S3 bucket:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/0*yl9N4P8LNI7d_G_z.png" alt="Figure C. Screenshot of the DVC remote in AWS S3 bucket" width="600" height="400" loading="lazy"></p>
<p><strong>Figure C.</strong> Screenshot of the DVC remote in AWS S3 bucket</p>
<p>As shown in <strong>Figure A,</strong> this deployment step is necessary for the AWS Lambda function to access the DVC cache in production.</p>
<h2 id="heading-step-4-configuring-scheduled-run-with-prefect"><strong>Step 4: Configuring Scheduled Run with Prefect</strong></h2>
<p>The next step is to configure the scheduled run of the entire lineage with Prefect.</p>
<p>Prefect is an open-source workflow orchestration tool for building, scheduling, and monitoring pipelines. It uses a concept called a work pool to effectively decouple the orchestration logic from the execution infrastructure.</p>
<p>Then, the work pool serves as a standardized base configuration by running a Docker container image to guarantee a consistent execution environment for all flows.</p>
<h3 id="heading-configuring-the-docker-image-registry">Configuring the Docker Image Registry</h3>
<p>The first step is to configure the Docker image registry for the Prefect work pool:</p>
<ul>
<li><p>For local deployment: <strong>A container registry in the Docker Hub.</strong></p>
</li>
<li><p>For production deployment: <strong>AWS ECR</strong>.</p>
</li>
</ul>
<p>For local deployment, we’ll first authenticate the Docker client:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> login
</code></pre>
<p>And grant a user permission to run Docker commands without <code>sudo</code>:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$sudo</span> dscl . -append /Groups/docker GroupMembership <span class="hljs-variable">$USER</span>
</code></pre>
<p>For production deployment, we’ll create a new ECR:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$aws</span> ecr create-repository --repository-name &lt;REGISTORY NAME&gt; --region &lt;AWS REGION&gt;
</code></pre>
<p>(Make sure the IAM role has access to this new ECR URI.)</p>
<h3 id="heading-configure-prefect-tasks-and-flows">Configure Prefect Tasks and Flows</h3>
<p>Next, we’ll configure the Prefect <code>task</code> and <code>flow</code> in the project:</p>
<ul>
<li><p>The Prefect <code>task</code> executes the <code>dvc repro</code> and <code>dvc push</code> commands</p>
</li>
<li><p>The Prefect <code>flow</code> weekly executes the Prefect <code>task</code>.</p>
</li>
</ul>
<p><code>src/prefect_flows.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> subprocess
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> timedelta, datetime
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> prefect <span class="hljs-keyword">import</span> flow, task
<span class="hljs-keyword">from</span> prefect.schedules <span class="hljs-keyword">import</span> Schedule
<span class="hljs-keyword">from</span> prefect_aws <span class="hljs-keyword">import</span> AwsCredentials

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># add project root to the python path - enabling prefect to find the script</span>
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), <span class="hljs-string">'..'</span>)))

<span class="hljs-comment"># define the prefect task</span>
<span class="hljs-meta">@task(retries=3, retry_delay_seconds=30)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_dvc_pipeline</span>():</span>
    <span class="hljs-comment"># execute the dvc pipeline </span>
    result = subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"repro"</span>], capture_output=<span class="hljs-literal">True</span>, text=<span class="hljs-literal">True</span>, check=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># push the updated data</span>
    subprocess.run([<span class="hljs-string">"dvc"</span>, <span class="hljs-string">"push"</span>], check=<span class="hljs-literal">True</span>)


<span class="hljs-comment"># define the prefect flow</span>
<span class="hljs-meta">@flow(name="Weekly Data Pipeline")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">weekly_data_flow</span>():</span>
    run_dvc_pipeline()

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    <span class="hljs-comment"># docker image registry (either docker hub or aws ecr)</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    ENV = os.getenv(<span class="hljs-string">'ENV'</span>, <span class="hljs-string">'production'</span>)
    DOCKER_HUB_REPO = os.getenv(<span class="hljs-string">'DOCKER_HUB_REPO'</span>)
    ECR_FOR_PREFECT_PATH = os.getenv(<span class="hljs-string">'S3_BUCKET_FOR_PREFECT_PATH'</span>)
    image_repo = <span class="hljs-string">f'<span class="hljs-subst">{DOCKER_HUB_REPO}</span>:ml-sales-pred-data-latest'</span> <span class="hljs-keyword">if</span> ENV == <span class="hljs-string">'local'</span> <span class="hljs-keyword">else</span> <span class="hljs-string">f'<span class="hljs-subst">{ECR_FOR_PREFECT_PATH}</span>:latest'</span>

    <span class="hljs-comment"># define weekly schedule</span>
    weekly_schedule = Schedule(
        interval=timedelta(weeks=<span class="hljs-number">1</span>),
        anchor_date=datetime(<span class="hljs-number">2025</span>, <span class="hljs-number">9</span>, <span class="hljs-number">29</span>, <span class="hljs-number">9</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>),
        active=<span class="hljs-literal">True</span>,
    )

    <span class="hljs-comment"># aws credentials to access ecr</span>
    AwsCredentials(
        aws_access_key_id=os.getenv(<span class="hljs-string">'AWS_ACCESS_KEY_ID'</span>),
        aws_secret_access_key=os.getenv(<span class="hljs-string">'AWS_SECRET_ACCESS_KEY'</span>),
        region_name=os.getenv(<span class="hljs-string">'AWS_REGION_NAME'</span>),
    ).save(<span class="hljs-string">'aws'</span>, overwrite=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># deploy the prefect flow</span>
    weekly_data_flow.deploy(
        name=<span class="hljs-string">'weekly-data-flow'</span>,
        schedule=weekly_schedule, <span class="hljs-comment"># schedule</span>
        work_pool_name=<span class="hljs-string">"wp-ml-sales-pred"</span>, <span class="hljs-comment"># work pool where the docker image (flow) runs</span>
        image=image_repo, <span class="hljs-comment"># create a docker image at docker hub (local) or ecr (production)</span>
        concurrency_limit=<span class="hljs-number">3</span>,
        push=<span class="hljs-literal">True</span> <span class="hljs-comment"># push the docker image to the image_repo</span>
    )
</code></pre>
<h3 id="heading-test-in-local-1">Test in Local</h3>
<p>Next, we’ll test the workflow locally with the Prefect server:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run prefect server start

<span class="hljs-variable">$export</span> PREFECT_API_URL=<span class="hljs-string">"http://127.0.0.1:4200/api"</span>
</code></pre>
<p>Run the <code>prefect_flows.py</code> script:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run src/prefect_flows.py
</code></pre>
<p>Upon the successful execution, the Prefect dashboard indicates the workflow is scheduled to run:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*pUJppTJ4MloU2DVr.png" alt="Figure D. The screenshot of the Prefect dashboard" width="1260" height="586" loading="lazy"></p>
<p><strong>Figure D.</strong> As screenshot of the Prefect dashboard</p>
<h2 id="heading-step-5-deploying-the-application">Step 5: Deploying the Application</h2>
<p>The final step is to deploy the entire application as a containerized Lambda by configuring the <code>Dockerfile</code> and the Flask application scripts.</p>
<p>The specific process in this final deployment step depends on the infrastructure.</p>
<p>But the common point is that DVC eliminates the need to store the large Parquet or CSV files directly in the feature store or model store because it caches them as lightweight hashed files.</p>
<p>So, first, we’ll simplify the loading logic of the Flask application script by using the <code>dvc.api</code> framework:</p>
<p><code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-keyword">import</span> dvc.api

DVC_REMOTE_NAME=&lt;REMOTE NAME IN .dvc/config file&gt;


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">configure_dvc_for_lambda</span>():</span>
    <span class="hljs-comment"># set dvc directories to /tmp</span>
    os.environ.update({
        <span class="hljs-string">'DVC_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-cache'</span>,
        <span class="hljs-string">'DVC_DATA_DIR'</span>: <span class="hljs-string">'/tmp/dvc-data'</span>,
        <span class="hljs-string">'DVC_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-config'</span>,
        <span class="hljs-string">'DVC_GLOBAL_CONFIG_DIR'</span>: <span class="hljs-string">'/tmp/dvc-global-config'</span>,
        <span class="hljs-string">'DVC_SITE_CACHE_DIR'</span>: <span class="hljs-string">'/tmp/dvc-site-cache'</span>
    })
    <span class="hljs-keyword">for</span> dir_path <span class="hljs-keyword">in</span> [<span class="hljs-string">'/tmp/dvc-cache'</span>, <span class="hljs-string">'/tmp/dvc-data'</span>, <span class="hljs-string">'/tmp/dvc-config'</span>]:
        os.makedirs(dir_path, exist_ok=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_x_test</span>():</span>
    <span class="hljs-keyword">global</span> X_test
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading x_test ..."</span>)

        <span class="hljs-comment"># config dvc directories</span>
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(X_TEST_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                X_test = pd.read_parquet(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded x_test via dvc api'</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_preprocessor</span>():</span>
    <span class="hljs-keyword">global</span> preprocessor
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.environ.get(<span class="hljs-string">'PYTEST_RUN'</span>, <span class="hljs-literal">False</span>):
        main_logger.info(<span class="hljs-string">"... loading preprocessor ..."</span>)
        configure_dvc_for_lambda()
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> dvc.api.open(PREPROCESSOR_PATH, remote=DVC_REMOTE_NAME, mode=<span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> fd:
                preprocessor = joblib.load(fd)
                main_logger.info(<span class="hljs-string">'✅ successfully loaded preprocessor via dvc api'</span>)

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f'❌ general loading error: <span class="hljs-subst">{e}</span>'</span>, exc_info=<span class="hljs-literal">True</span>)

<span class="hljs-comment">### ... the rest components remain the same  ...</span>
</code></pre>
<p>Then, update the Dockerfile to enable Docker to correctly reference the DVC components:</p>
<p><code>Dockerfile.lambda.production</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># use an official python runtime</span>
FROM public.ecr.aws/<span class="hljs-keyword">lambda</span>/python:<span class="hljs-number">3.12</span>

<span class="hljs-comment"># set environment variables (adding dvc related env variables)</span>
ENV JOBLIB_MULTIPROCESSING=<span class="hljs-number">0</span>
ENV DVC_HOME=<span class="hljs-string">"/tmp/.dvc"</span>
ENV DVC_CACHE_DIR=<span class="hljs-string">"/tmp/.dvc/cache"</span>
ENV DVC_REMOTE_NAME=<span class="hljs-string">"storage"</span>
ENV DVC_GLOBAL_SITE_CACHE_DIR=<span class="hljs-string">"/tmp/dvc_global"</span>

<span class="hljs-comment"># copy requirements file and install dependencies</span>
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir dvc dvc-s3

<span class="hljs-comment"># setup dvc</span>
RUN dvc init --no-scm
RUN dvc config core.no_scm true

<span class="hljs-comment"># copy the code to the lambda task root</span>
COPY . ${LAMBDA_TASK_ROOT}
CMD [ <span class="hljs-string">"app.handler"</span> ]
</code></pre>
<p>Lastly, ensure the large files are ignored from the Docker container image:</p>
<p><code>.dockerignore</code>:</p>
<pre><code class="lang-bash"><span class="hljs-comment">### ... the rest components remain the same  ...</span>

<span class="hljs-comment"># dvc cache contains large files</span>
.dvc/cache
.dvcignore

<span class="hljs-comment"># add all folders that DVC will track</span>
data/
preprocessors/
models/
reports/
metrics/
</code></pre>
<h3 id="heading-test-in-local-2">Test in Local</h3>
<p>Finally, we’ll build and test the Docker image:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> build -t my-app -f Dockerfile.lambda.local .
<span class="hljs-variable">$docker</span> run -p 5002:5002 -e ENV=<span class="hljs-built_in">local</span> my-app app.py
</code></pre>
<p>Upon the successful configuration, the waitress server will run the Flask application.</p>
<p>After confirming the changes, push the code to Git:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$git</span> add .
<span class="hljs-variable">$git</span> commit -m<span class="hljs-string">'updated dockerfiles and flask app scripts'</span>
<span class="hljs-variable">$git</span> push
</code></pre>
<p>This <code>push</code> command triggers the CI/CD pipeline via GitHub Actions, which generates a Docker container image and pushes it to AWS ECR.</p>
<p>And then after a successful pipeline flow and verification, we can manually run the deployment workflow using GitHub Actions.</p>
<p>And that’s it!</p>
<p>You can learn more here: <a target="_blank" href="https://medium.com/towards-artificial-intelligence/integrating-ci-cd-pipelines-to-machine-learning-applications-f5657c7fa164">Integrating the infrastructure CI/CD pipeline to an ML application</a></p>
<p>All code is available in <a target="_blank" href="https://github.com/krik8235/ml-sales-prediction">my GitHub repository</a>.</p>
<p>The mock app is available <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Building robust ML applications requires comprehensive ML lineage to ensure reliability and traceability.</p>
<p>In this article, you learned how to build an ML lineage by integrating open-source services like DVC and Prefect.</p>
<p>In practice, initial planning matters. Specifically, defining how metrics are tracked and at which stages leads directly to a cleaner, more maintainable code structure and the extensibility in the future.</p>
<p>Moving forward, we can consider adding more stages to the lineage and integrating advanced logic for data drift detection or fairness tests.</p>
<p>This will further ensure continued model performance and data integrity in the production environment.</p>
<p><strong>You can check out my</strong> <a target="_blank" href="https://kuriko-iwai.vercel.app/"><strong>Portfolio</strong></a> <strong>/</strong> <a target="_blank" href="https://github.com/krik8235"><strong>Github</strong></a><strong>.</strong></p>
<p><em>All images, unless otherwise noted, are by the author.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Transformers for Real-Time Gesture Recognition ]]>
                </title>
                <description>
                    <![CDATA[ Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are no... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/using-transformers-for-real-time-gesture-recognition/</link>
                <guid isPermaLink="false">68e3c692aa82abf4b593114c</guid>
                
                    <category>
                        <![CDATA[ Computer Vision ]]>
                    </category>
                
                    <category>
                        <![CDATA[ transformers ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pytorch ]]>
                    </category>
                
                    <category>
                        <![CDATA[ ONNX ]]>
                    </category>
                
                    <category>
                        <![CDATA[ gradio ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Gesture Recognition ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Accessibility ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Tutorial ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ OMOTAYO OMOYEMI ]]>
                </dc:creator>
                <pubDate>Mon, 06 Oct 2025 13:39:30 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759757931295/5f19fd4e-93c0-4bd7-a75c-a7858e061ecd.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Gesture and sign recognition is a growing field in computer vision, powering accessibility tools and natural user interfaces. Most beginner projects rely on hand landmarks or small CNNs, but these often miss the bigger picture because gestures are not static images. Rather, they unfold over time. To build more robust, real-time systems, we need models that can capture both spatial details and temporal context.</p>
<p>This is where Transformers come in. Originally built for language, they’ve become state-of-the-art in vision tasks thanks to models like the Vision Transformer (ViT) and video-focused variants such as TimeSformer.</p>
<p>In this tutorial, we’ll use a Transformer backbone to create a lightweight real-time gesture recognition tool, optimized for small datasets and deployable on a regular laptop webcam.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-why-transformers-for-gestures">Why Transformers for Gestures?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-youll-learn">What You’ll Learn</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-project-setup">Project Setup</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-generate-a-gesture-dataset">Generate a Gesture Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-training-script-trainpy">Training Script:</a> <a target="_blank" href="http://train.py">train.py</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-export-the-model-to-onnx">Export the Model to ONNX</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-next-steps">Next Steps</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-why-transformers-for-gestures">Why Transformers for Gestures?</h2>
<p>Transformers are powerful because they use self-attention to model relationships across a sequence. For gestures, this means the model doesn’t just see isolated frames, but also learns how movements evolve over time. A wave, for example, looks different from a raised hand only when viewed as a sequence.</p>
<p>Vision Transformers process images as patches, while video Transformers extend this to multiple frames with temporal attention. Even a simple approach, like applying ViT to each frame and pooling across time, can outperform traditional CNN-based methods for small datasets.</p>
<p>Combined with Hugging Face’s pre-trained models and ONNX Runtime for optimization, Transformers make it possible to train on a modest dataset and still achieve smooth real-time recognition.</p>
<h2 id="heading-what-youll-learn">What You’ll Learn</h2>
<p>In this tutorial, you’ll build a gesture recognition system using Transformers. By the end, you’ll know how to:</p>
<ul>
<li><p>Create (or record) a tiny gesture dataset</p>
</li>
<li><p>Train a Vision Transformer (ViT) with temporal pooling</p>
</li>
<li><p>Export the model to ONNX for faster inference</p>
</li>
<li><p>Build a real-time Gradio app that classifies gestures from your webcam</p>
</li>
<li><p>Evaluate your model’s accuracy and latency with simple scripts</p>
</li>
<li><p>Understand the accessibility potential and ethical limits of gesture recognition</p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along, you should have:</p>
<ul>
<li><p>Basic Python knowledge (functions, scripts, virtual environments)</p>
</li>
<li><p>Familiarity with PyTorch (tensors, datasets, training loops) – helpful but not required</p>
</li>
<li><p>Python 3.8+ installed on your system</p>
</li>
<li><p>A webcam (for the live demo in Gradio)</p>
</li>
<li><p>Optionally: GPU access (training on CPU works, but is slower)</p>
</li>
</ul>
<h2 id="heading-project-setup">Project Setup</h2>
<p>Create a new project folder and install the required libraries.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Create a new project directory and navigate into it</span>
mkdir transformer-gesture &amp;&amp; <span class="hljs-built_in">cd</span> transformer-gesture

<span class="hljs-comment"># Set up a Python virtual environment</span>
python -m venv .venv

<span class="hljs-comment"># Activate the virtual environment</span>
<span class="hljs-comment"># Windows PowerShell</span>
.venv\Scripts\Activate.ps1

<span class="hljs-comment"># macOS/Linux</span>
<span class="hljs-built_in">source</span> .venv/bin/activate
</code></pre>
<p>The provided code snippet is a set of commands for setting up a new Python project with a virtual environment. Here's a breakdown of each part:</p>
<ol>
<li><p><code>mkdir transformer-gesture &amp;&amp; cd transformer-gesture</code>: This command creates a new directory named "transformer-gesture" and then navigates into it.</p>
</li>
<li><p><code>python -m venv .venv</code>: This command creates a new virtual environment in the current directory. The virtual environment is stored in a folder named ".venv".</p>
</li>
<li><p>Activating the virtual environment:</p>
<ul>
<li><p>For Windows PowerShell, you can use <code>.venv\Scripts\Activate.ps1</code> to activate the virtual environment.</p>
</li>
<li><p>For macOS/Linux, use <code>source .venv/bin/activate</code> to activate the virtual environment.</p>
</li>
</ul>
</li>
</ol>
<p>Activating a virtual environment ensures that the Python interpreter and any packages you install are isolated to this specific project, preventing conflicts with other projects or system-wide packages.</p>
<p>Create a <code>requirements.txt</code> file:</p>
<pre><code class="lang-plaintext">torch&gt;=2.0
torchvision
torchaudio
timm
huggingface_hub

onnx
onnxruntime

gradio

numpy
opencv-python
pillow

matplotlib
seaborn
scikit-learn
</code></pre>
<p>The list provided is a set of package dependencies typically found in a <code>requirements.txt</code> file for a Python project. Here's a brief explanation of each package:</p>
<ol>
<li><p><strong>torch&gt;=2.0</strong>: PyTorch is a popular open-source deep learning framework that provides a flexible and efficient platform for building and training neural networks. Version 2.0 and above includes improvements in performance and new features.</p>
</li>
<li><p><strong>torchvision</strong>: This library is part of the PyTorch ecosystem and provides tools for computer vision tasks, including datasets, model architectures, and image transformations.</p>
</li>
<li><p><strong>torchaudio</strong>: Also part of the PyTorch ecosystem, Torchaudio provides audio processing tools and datasets, making it easier to work with audio data in deep learning projects.</p>
</li>
<li><p><strong>timm</strong>: The PyTorch Image Models (timm) library offers a collection of pre-trained models and utilities for computer vision tasks, facilitating quick experimentation and deployment.</p>
</li>
<li><p><strong>huggingface_hub</strong>: This library allows easy access to models and datasets hosted on the Hugging Face Hub, a platform for sharing and collaborating on machine learning models and datasets.</p>
</li>
<li><p><strong>onnx</strong>: The Open Neural Network Exchange (ONNX) format is used to represent machine learning models, enabling interoperability between different frameworks.</p>
</li>
<li><p><strong>onnxruntime</strong>: This is a high-performance runtime for executing ONNX models, allowing for efficient deployment across various platforms.</p>
</li>
<li><p><strong>gradio</strong>: Gradio is a library for creating user interfaces for machine learning models, making them accessible through a web interface for easy interaction and testing.</p>
</li>
<li><p><strong>numpy</strong>: A fundamental package for numerical computing in Python, providing support for arrays and a wide range of mathematical functions.</p>
</li>
<li><p><strong>opencv-python</strong>: OpenCV is a library for computer vision and image processing tasks, widely used for real-time applications.</p>
</li>
<li><p><strong>pillow</strong>: A Python Imaging Library (PIL) fork, Pillow provides tools for opening, manipulating, and saving many different image file formats.</p>
</li>
<li><p><strong>matplotlib</strong>: A plotting library for Python, Matplotlib is used for creating static, interactive, and animated visualizations in Python.</p>
</li>
<li><p><strong>seaborn</strong>: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.</p>
</li>
<li><p><strong>scikit-learn</strong>: A machine learning library in Python that provides simple and efficient tools for data analysis and modeling, including classification, regression, clustering, and dimensionality reduction.</p>
</li>
</ol>
<p>Install dependencies:</p>
<pre><code class="lang-bash">pip install -r requirements.txt
</code></pre>
<p>The command <code>pip install -r requirements.txt</code> is used to install all the Python packages listed in a file named <code>requirements.txt</code>. This file typically contains a list of package dependencies required for a Python project, each specified with a package name and optionally a version number.</p>
<p>By running this command, <code>pip</code>, which is the Python package installer, reads the file and installs each package listed, ensuring that the project has all the necessary dependencies to run properly. This is a common practice in Python projects to manage and share dependencies easily.</p>
<h2 id="heading-generate-a-gesture-dataset">Generate a Gesture Dataset</h2>
<p>To train our Transformer-based gesture recognizer, we need some data. Instead of downloading a huge dataset, we’ll start with a tiny synthetic dataset you can generate in seconds. This makes the tutorial lightweight and ensures that everyone can follow along without dealing with multi-gigabyte downloads.</p>
<h2 id="heading-option-1-generate-a-synthetic-dataset">Option 1: Generate a Synthetic Dataset</h2>
<p>We’ll use a small Python script that creates short <code>.mp4</code> clips of a moving (or still) coloured box. Each class represents a gesture:</p>
<ul>
<li><p><strong>swipe_left</strong> – box moves from right to left</p>
</li>
<li><p><strong>swipe_right</strong> – box moves from left to right</p>
</li>
<li><p><strong>stop</strong> – box stays still in the center</p>
</li>
</ul>
<p>Save this script as <code>generate_synthetic_gestures.py</code> in your project root:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, cv2, numpy <span class="hljs-keyword">as</span> np, random, argparse

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ensure_dir</span>(<span class="hljs-params">p</span>):</span> os.makedirs(p, exist_ok=<span class="hljs-literal">True</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">make_clip</span>(<span class="hljs-params">mode, out_path, seconds=<span class="hljs-number">1.5</span>, fps=<span class="hljs-number">16</span>, size=<span class="hljs-number">224</span>, box_size=<span class="hljs-number">60</span>, seed=<span class="hljs-number">0</span>, codec=<span class="hljs-string">"mp4v"</span></span>):</span>
    rng = random.Random(seed)
    frames = int(seconds * fps)
    H = W = size

    <span class="hljs-comment"># background + box color</span>
    bg_val = rng.randint(<span class="hljs-number">160</span>, <span class="hljs-number">220</span>)
    bg = np.full((H, W, <span class="hljs-number">3</span>), bg_val, dtype=np.uint8)
    color = (rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>), rng.randint(<span class="hljs-number">20</span>, <span class="hljs-number">80</span>))

    <span class="hljs-comment"># path of motion</span>
    y = rng.randint(<span class="hljs-number">40</span>, H - <span class="hljs-number">40</span> - box_size)
    <span class="hljs-keyword">if</span> mode == <span class="hljs-string">"swipe_left"</span>:
        x_start, x_end = W - <span class="hljs-number">20</span> - box_size, <span class="hljs-number">20</span>
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"swipe_right"</span>:
        x_start, x_end = <span class="hljs-number">20</span>, W - <span class="hljs-number">20</span> - box_size
    <span class="hljs-keyword">elif</span> mode == <span class="hljs-string">"stop"</span>:
        x_start = x_end = (W - box_size) // <span class="hljs-number">2</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Unknown mode: <span class="hljs-subst">{mode}</span>"</span>)

    fourcc = cv2.VideoWriter_fourcc(*codec)
    vw = cv2.VideoWriter(out_path, fourcc, fps, (W, H))
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> vw.isOpened():
        <span class="hljs-keyword">raise</span> RuntimeError(
            <span class="hljs-string">f"Could not open VideoWriter with codec '<span class="hljs-subst">{codec}</span>'. "</span>
            <span class="hljs-string">"Try --codec XVID and use .avi extension, e.g. out.avi"</span>
        )

    <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> range(frames):
        alpha = t / max(<span class="hljs-number">1</span>, frames - <span class="hljs-number">1</span>)
        x = int((<span class="hljs-number">1</span> - alpha) * x_start + alpha * x_end)
        <span class="hljs-comment"># small jitter to avoid being too synthetic</span>
        jitter_x, jitter_y = rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>), rng.randint(<span class="hljs-number">-2</span>, <span class="hljs-number">2</span>)
        frame = bg.copy()
        cv2.rectangle(frame, (x + jitter_x, y + jitter_y),
                      (x + jitter_x + box_size, y + jitter_y + box_size),
                      color, thickness=<span class="hljs-number">-1</span>)
        <span class="hljs-comment"># overlay text</span>
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>), <span class="hljs-number">2</span>, cv2.LINE_AA)
        cv2.putText(frame, mode, (<span class="hljs-number">8</span>, <span class="hljs-number">24</span>), cv2.FONT_HERSHEY_SIMPLEX, <span class="hljs-number">0.7</span>, (<span class="hljs-number">255</span>, <span class="hljs-number">255</span>, <span class="hljs-number">255</span>), <span class="hljs-number">1</span>, cv2.LINE_AA)
        vw.write(frame)

    vw.release()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_labels</span>(<span class="hljs-params">labels, out_dir</span>):</span>
    <span class="hljs-keyword">with</span> open(os.path.join(out_dir, <span class="hljs-string">"labels.txt"</span>), <span class="hljs-string">"w"</span>, encoding=<span class="hljs-string">"utf-8"</span>) <span class="hljs-keyword">as</span> f:
        <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> labels:
            f.write(c + <span class="hljs-string">"\n"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    ap = argparse.ArgumentParser(description=<span class="hljs-string">"Generate a tiny synthetic gesture dataset."</span>)
    ap.add_argument(<span class="hljs-string">"--out"</span>, default=<span class="hljs-string">"data"</span>, help=<span class="hljs-string">"Output directory (default: data)"</span>)
    ap.add_argument(<span class="hljs-string">"--classes"</span>, nargs=<span class="hljs-string">"+"</span>,
                    default=[<span class="hljs-string">"swipe_left"</span>, <span class="hljs-string">"swipe_right"</span>, <span class="hljs-string">"stop"</span>],
                    help=<span class="hljs-string">"Class names (default: swipe_left swipe_right stop)"</span>)
    ap.add_argument(<span class="hljs-string">"--clips"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Clips per class (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--seconds"</span>, type=float, default=<span class="hljs-number">1.5</span>, help=<span class="hljs-string">"Seconds per clip (default: 1.5)"</span>)
    ap.add_argument(<span class="hljs-string">"--fps"</span>, type=int, default=<span class="hljs-number">16</span>, help=<span class="hljs-string">"Frames per second (default: 16)"</span>)
    ap.add_argument(<span class="hljs-string">"--size"</span>, type=int, default=<span class="hljs-number">224</span>, help=<span class="hljs-string">"Frame size WxH (default: 224)"</span>)
    ap.add_argument(<span class="hljs-string">"--box"</span>, type=int, default=<span class="hljs-number">60</span>, help=<span class="hljs-string">"Box size (default: 60)"</span>)
    ap.add_argument(<span class="hljs-string">"--codec"</span>, default=<span class="hljs-string">"mp4v"</span>, help=<span class="hljs-string">"Codec fourcc (mp4v or XVID)"</span>)
    ap.add_argument(<span class="hljs-string">"--ext"</span>, default=<span class="hljs-string">".mp4"</span>, help=<span class="hljs-string">"File extension (.mp4 or .avi)"</span>)
    args = ap.parse_args()

    ensure_dir(args.out)
    write_labels(args.classes, <span class="hljs-string">"."</span>)  <span class="hljs-comment"># writes labels.txt to project root</span>

    print(<span class="hljs-string">f"Generating synthetic dataset -&gt; <span class="hljs-subst">{args.out}</span>"</span>)
    <span class="hljs-keyword">for</span> cls <span class="hljs-keyword">in</span> args.classes:
        cls_dir = os.path.join(args.out, cls)
        ensure_dir(cls_dir)
        mode = <span class="hljs-string">"stop"</span> <span class="hljs-keyword">if</span> cls == <span class="hljs-string">"stop"</span> <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_left"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"left"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> (<span class="hljs-string">"swipe_right"</span> <span class="hljs-keyword">if</span> <span class="hljs-string">"right"</span> <span class="hljs-keyword">in</span> cls <span class="hljs-keyword">else</span> <span class="hljs-string">"stop"</span>))
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(args.clips):
            filename = os.path.join(cls_dir, <span class="hljs-string">f"<span class="hljs-subst">{cls}</span>_<span class="hljs-subst">{i+<span class="hljs-number">1</span>:<span class="hljs-number">03</span>d}</span><span class="hljs-subst">{args.ext}</span>"</span>)
            make_clip(
                mode=mode,
                out_path=filename,
                seconds=args.seconds,
                fps=args.fps,
                size=args.size,
                box_size=args.box,
                seed=i + <span class="hljs-number">1</span>,
                codec=args.codec
            )
        print(<span class="hljs-string">f"  <span class="hljs-subst">{cls}</span>: <span class="hljs-subst">{args.clips}</span> clips"</span>)

    print(<span class="hljs-string">"Done. You can now run: python train.py, python export_onnx.py, python app.py"</span>)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<p>The script generates a synthetic gesture dataset by creating video clips of a moving or stationary coloured box, simulating gestures like "swipe left," "swipe right," and "stop," and saves them in a specified output directory.</p>
<p>Now run it inside your virtual environment:</p>
<pre><code class="lang-bash">python generate_synthetic_gestures.py --out data --clips 16 --seconds 1.5
</code></pre>
<p>The command above runs a Python script named <code>generate_synthetic_gestures.py</code>, which generates a synthetic gesture dataset with 16 clips per gesture, each lasting 1.5 seconds, and saves the output in a directory named "data".</p>
<p>This creates a dataset like:</p>
<pre><code class="lang-plaintext">data/
  swipe_left/*.mp4
  swipe_right/*.mp4
  stop/*.mp4
labels.txt
</code></pre>
<p>Each folder contains short clips of a moving (or still) box that simulate gestures. This is perfect for testing the pipeline.</p>
<h3 id="heading-training-script-trainpy">Training Script: <code>train.py</code></h3>
<p>Now that we have our dataset, let’s fine-tune a Vision Transformer with temporal pooling. This model applies ViT frame-by-frame, averages embeddings across time, and trains a classification head on your gestures.</p>
<p>Here’s the full training script:</p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
<span class="hljs-keyword">import</span> torch, torch.nn <span class="hljs-keyword">as</span> nn, torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader
<span class="hljs-keyword">import</span> timm
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ViTTemporal</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">"""Frame-wise ViT encoder -&gt; mean pool over time -&gt; linear head."""</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, num_classes, vit_name=<span class="hljs-string">"vit_tiny_patch16_224"</span></span>):</span>
        super().__init__()
        self.vit = timm.create_model(vit_name, pretrained=<span class="hljs-literal">True</span>, num_classes=<span class="hljs-number">0</span>, global_pool=<span class="hljs-string">"avg"</span>)
        feat_dim = self.vit.num_features
        self.head = nn.Linear(feat_dim, num_classes)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>  <span class="hljs-comment"># x: (B,T,C,H,W)</span>
        B, T, C, H, W = x.shape
        x = x.view(B * T, C, H, W)
        feats = self.vit(x)                  <span class="hljs-comment"># (B*T, D)</span>
        feats = feats.view(B, T, <span class="hljs-number">-1</span>).mean(dim=<span class="hljs-number">1</span>)  <span class="hljs-comment"># (B, D)</span>
        <span class="hljs-keyword">return</span> self.head(feats)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>():</span>
    device = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
    labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
    n_classes = len(labels)

    train_ds = GestureClips(train=<span class="hljs-literal">True</span>)
    val_ds   = GestureClips(train=<span class="hljs-literal">False</span>)
    print(<span class="hljs-string">f"Train clips: <span class="hljs-subst">{len(train_ds)}</span> | Val clips: <span class="hljs-subst">{len(val_ds)}</span>"</span>)

    <span class="hljs-comment"># Windows/CPU friendly</span>
    train_dl = DataLoader(train_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">True</span>,  num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)
    val_dl   = DataLoader(val_ds,   batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>, num_workers=<span class="hljs-number">0</span>, pin_memory=<span class="hljs-literal">False</span>)

    model = ViTTemporal(num_classes=n_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=<span class="hljs-number">3e-4</span>, weight_decay=<span class="hljs-number">0.05</span>)

    best_acc = <span class="hljs-number">0.0</span>
    epochs = <span class="hljs-number">5</span>
    <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, epochs + <span class="hljs-number">1</span>):
        <span class="hljs-comment"># ---- Train ----</span>
        model.train()
        total, correct, loss_sum = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0.0</span>
        <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> train_dl:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = criterion(logits, y)
            loss.backward()
            optimizer.step()

            loss_sum += loss.item() * x.size(<span class="hljs-number">0</span>)
            correct += (logits.argmax(<span class="hljs-number">1</span>) == y).sum().item()
            total += x.size(<span class="hljs-number">0</span>)

        train_acc = correct / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>
        train_loss = loss_sum / total <span class="hljs-keyword">if</span> total <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        <span class="hljs-comment"># ---- Validate ----</span>
        model.eval()
        vtotal, vcorrect = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
        <span class="hljs-keyword">with</span> torch.no_grad():
            <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
                x, y = x.to(device), y.to(device)
                vcorrect += (model(x).argmax(<span class="hljs-number">1</span>) == y).sum().item()
                vtotal += x.size(<span class="hljs-number">0</span>)
        val_acc = vcorrect / vtotal <span class="hljs-keyword">if</span> vtotal <span class="hljs-keyword">else</span> <span class="hljs-number">0.0</span>

        print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch:<span class="hljs-number">02</span>d}</span> | train_loss <span class="hljs-subst">{train_loss:<span class="hljs-number">.4</span>f}</span> "</span>
              <span class="hljs-string">f"| train_acc <span class="hljs-subst">{train_acc:<span class="hljs-number">.3</span>f}</span> | val_acc <span class="hljs-subst">{val_acc:<span class="hljs-number">.3</span>f}</span>"</span>)

        <span class="hljs-keyword">if</span> val_acc &gt; best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(), <span class="hljs-string">"vit_temporal_best.pt"</span>)

    print(<span class="hljs-string">"Best val acc:"</span>, best_acc)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    train()
</code></pre>
<p>Running the command <code>python train.py</code> initiates the training process for your gesture recognition model. Here's a breakdown of what happens:</p>
<ol>
<li><p><strong>Load your dataset from data/</strong>: The script will access and load the gesture dataset stored in the "data" directory. This dataset is used to train the model.</p>
</li>
<li><p><strong>Fine-tune a pre-trained Vision Transformer</strong>: The training script will take a Vision Transformer model that has been pre-trained on a larger dataset and fine-tune it using your specific gesture dataset. Fine-tuning helps the model adapt to the nuances of your data, improving its performance on the specific task of gesture recognition.</p>
</li>
<li><p><strong>Save the best checkpoint as vit_temporal_best.pt</strong>: During training, the script will evaluate the model's performance on a validation set. The best-performing version of the model (based on some metric like accuracy) will be saved as a checkpoint file named "vit_temporal_best.pt". This file can later be used for inference or further training.</p>
</li>
</ol>
<h4 id="heading-what-training-looks-like">What Training Looks Like</h4>
<p>You should see logs similar to this:</p>
<pre><code class="lang-plaintext">Train clips: 38 | Val clips: 10
Epoch 01 | train_loss 1.4508 | train_acc 0.395 | val_acc 0.200
Epoch 02 | train_loss 1.2466 | train_acc 0.263 | val_acc 0.200
Epoch 03 | train_loss 1.1361 | train_acc 0.368 | val_acc 0.200
Best val acc: 0.200
</code></pre>
<p>Don’t worry if your accuracy is low at first, as with the synthetic dataset that’s normal. The key is proving that the Transformer pipeline works. You can boost results later by:</p>
<ul>
<li><p>Adding more clips per class</p>
</li>
<li><p>Training for more epochs</p>
</li>
<li><p>Switching to real recorded gestures</p>
</li>
</ul>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/training-logs.png?raw=true" alt="Training logs" width="600" height="400" loading="lazy"></p>
<p>Figure 1. Example training logs from <code>train.py</code>, where the Vision Transformer with temporal pooling is fine-tuned on a tiny synthetic dataset.</p>
<h3 id="heading-export-the-model-to-onnx">Export the Model to ONNX</h3>
<p>To make our model easier to run in real time (and lighter on CPU), we’ll export it to the ONNX format.</p>
<p><strong>Note:</strong> ONNX, which stands for Open Neural Network Exchange, is an open-source format designed to facilitate the interchange of deep learning models between different frameworks. It lets you train a model in one framework, such as PyTorch or TensorFlow, and then deploy it in another, like Caffe2 or MXNet, without needing to completely rewrite the model. This interoperability is achieved by providing a standardized representation of the model's architecture and parameters.</p>
<p>ONNX supports a wide range of operators and is continually updated to include new features, making it a versatile choice for deploying machine learning models across various platforms and devices.</p>
<p>Create a file called <code>export_onnx.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

<span class="hljs-comment"># Dummy input: batch=1, 16 frames, 3x224x224</span>
dummy = torch.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>)

<span class="hljs-comment"># Export</span>
torch.onnx.export(
    model, dummy, <span class="hljs-string">"vit_temporal.onnx"</span>,
    input_names=[<span class="hljs-string">"video"</span>], output_names=[<span class="hljs-string">"logits"</span>],
    dynamic_axes={<span class="hljs-string">"video"</span>: {<span class="hljs-number">0</span>: <span class="hljs-string">"batch"</span>}},
    opset_version=<span class="hljs-number">13</span>
)

print(<span class="hljs-string">"Exported vit_temporal.onnx"</span>)
</code></pre>
<p>Run it with <code>python export_onnx.py</code>.</p>
<p>This generates a file <code>vit_temporal.onnx</code> in your project folder. ONNX lets us use onnxruntime, which is much faster for inference.</p>
<p>Create a file called <code>app.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os, tempfile, cv2, torch, onnxruntime, numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

T = <span class="hljs-number">16</span>
SIZE = <span class="hljs-number">224</span>
MODEL_PATH = <span class="hljs-string">"vit_temporal.onnx"</span>

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

<span class="hljs-comment"># --- ONNX session + auto-detect names ---</span>
ort_session = onnxruntime.InferenceSession(MODEL_PATH, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
<span class="hljs-comment"># detect first input and first output names to avoid mismatches</span>
INPUT_NAME = ort_session.get_inputs()[<span class="hljs-number">0</span>].name   <span class="hljs-comment"># e.g. "input" or "video"</span>
OUTPUT_NAME = ort_session.get_outputs()[<span class="hljs-number">0</span>].name <span class="hljs-comment"># e.g. "logits" or something else</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess_clip</span>(<span class="hljs-params">frames_rgb</span>):</span>
    <span class="hljs-keyword">if</span> len(frames_rgb) == <span class="hljs-number">0</span>:
        frames_rgb = [np.zeros((SIZE, SIZE, <span class="hljs-number">3</span>), dtype=np.uint8)]
    <span class="hljs-keyword">if</span> len(frames_rgb) &lt; T:
        frames_rgb = frames_rgb + [frames_rgb[<span class="hljs-number">-1</span>]] * (T - len(frames_rgb))
    frames_rgb = frames_rgb[:T]
    clip = [cv2.resize(f, (SIZE, SIZE), interpolation=cv2.INTER_AREA) <span class="hljs-keyword">for</span> f <span class="hljs-keyword">in</span> frames_rgb]
    clip = np.stack(clip, axis=<span class="hljs-number">0</span>)                                    <span class="hljs-comment"># (T,H,W,3)</span>
    clip = np.transpose(clip, (<span class="hljs-number">0</span>, <span class="hljs-number">3</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>)).astype(np.float32) / <span class="hljs-number">255</span> <span class="hljs-comment"># (T,3,H,W)</span>
    clip = (clip - <span class="hljs-number">0.5</span>) / <span class="hljs-number">0.5</span>
    clip = np.expand_dims(clip, <span class="hljs-number">0</span>)                                   <span class="hljs-comment"># (1,T,3,H,W)</span>
    <span class="hljs-keyword">return</span> clip

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_extract_path_from_gradio_video</span>(<span class="hljs-params">inp</span>):</span>
    <span class="hljs-keyword">if</span> isinstance(inp, str) <span class="hljs-keyword">and</span> os.path.exists(inp):
        <span class="hljs-keyword">return</span> inp
    <span class="hljs-keyword">if</span> isinstance(inp, dict):
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"video"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"path"</span>, <span class="hljs-string">"filepath"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, str) <span class="hljs-keyword">and</span> os.path.exists(v):
                <span class="hljs-keyword">return</span> v
        <span class="hljs-keyword">for</span> key <span class="hljs-keyword">in</span> (<span class="hljs-string">"data"</span>, <span class="hljs-string">"video"</span>):
            v = inp.get(key)
            <span class="hljs-keyword">if</span> isinstance(v, (bytes, bytearray)):
                tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>)
                tmp.write(v); tmp.flush(); tmp.close()
                <span class="hljs-keyword">return</span> tmp.name
    <span class="hljs-keyword">if</span> isinstance(inp, (list, tuple)) <span class="hljs-keyword">and</span> inp <span class="hljs-keyword">and</span> isinstance(inp[<span class="hljs-number">0</span>], str) <span class="hljs-keyword">and</span> os.path.exists(inp[<span class="hljs-number">0</span>]):
        <span class="hljs-keyword">return</span> inp[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_read_uniform_frames</span>(<span class="hljs-params">video_path</span>):</span>
    cap = cv2.VideoCapture(video_path)
    frames = []
    total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) <span class="hljs-keyword">or</span> <span class="hljs-number">1</span>
    idxs = np.linspace(<span class="hljs-number">0</span>, total - <span class="hljs-number">1</span>, max(T, <span class="hljs-number">1</span>)).astype(int)
    want = set(int(i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> idxs.tolist())
    j = <span class="hljs-number">0</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        ok, bgr = cap.read()
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
        <span class="hljs-keyword">if</span> j <span class="hljs-keyword">in</span> want:
            rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
            frames.append(rgb)
        j += <span class="hljs-number">1</span>
    cap.release()
    <span class="hljs-keyword">return</span> frames

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_video</span>(<span class="hljs-params">gradio_video</span>):</span>
    video_path = _extract_path_from_gradio_video(gradio_video)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> video_path <span class="hljs-keyword">or</span> <span class="hljs-keyword">not</span> os.path.exists(video_path):
        <span class="hljs-keyword">return</span> {}
    frames = _read_uniform_frames(video_path)

    <span class="hljs-comment"># If OpenCV choked on the codec (common with recorded webm), re-encode once:</span>
    <span class="hljs-keyword">if</span> len(frames) == <span class="hljs-number">0</span>:
        tmp = tempfile.NamedTemporaryFile(delete=<span class="hljs-literal">False</span>, suffix=<span class="hljs-string">".mp4"</span>); tmp_name = tmp.name; tmp.close()
        cap = cv2.VideoCapture(video_path)
        fourcc = cv2.VideoWriter_fourcc(*<span class="hljs-string">"mp4v"</span>)
        w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) <span class="hljs-keyword">or</span> <span class="hljs-number">640</span>
        h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) <span class="hljs-keyword">or</span> <span class="hljs-number">480</span>
        out = cv2.VideoWriter(tmp_name, fourcc, <span class="hljs-number">20.0</span>, (w, h))
        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
            ok, frame = cap.read()
            <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> ok: <span class="hljs-keyword">break</span>
            out.write(frame)
        cap.release(); out.release()
        frames = _read_uniform_frames(tmp_name)

    clip = preprocess_clip(frames)
    <span class="hljs-comment"># &gt;&gt;&gt; use the detected ONNX input/output names &lt;&lt;&lt;</span>
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]  <span class="hljs-comment"># (1, C)</span>
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_from_image</span>(<span class="hljs-params">image</span>):</span>
    <span class="hljs-keyword">if</span> image <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">return</span> {}
    clip = preprocess_clip([image] * T)
    logits = ort_session.run([OUTPUT_NAME], {INPUT_NAME: clip})[<span class="hljs-number">0</span>]
    probs = torch.softmax(torch.from_numpy(logits), dim=<span class="hljs-number">1</span>)[<span class="hljs-number">0</span>].numpy().tolist()
    <span class="hljs-keyword">return</span> {labels[i]: float(probs[i]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(labels))}

<span class="hljs-keyword">with</span> gr.Blocks() <span class="hljs-keyword">as</span> demo:
    gr.Markdown(<span class="hljs-string">"# Gesture Classifier (ONNX)\nRecord or upload a short video, then click **Classify Video**."</span>)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Video (record or upload)"</span>):
        vid_in = gr.Video(label=<span class="hljs-string">"Record from webcam or upload a short clip"</span>)
        vid_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Video"</span>).click(fn=predict_from_video, inputs=vid_in, outputs=vid_out)
    <span class="hljs-keyword">with</span> gr.Tab(<span class="hljs-string">"Single Image (fallback)"</span>):
        img_in = gr.Image(label=<span class="hljs-string">"Upload an image frame"</span>, type=<span class="hljs-string">"numpy"</span>)
        img_out = gr.Label(num_top_classes=<span class="hljs-number">3</span>, label=<span class="hljs-string">"Prediction"</span>)
        gr.Button(<span class="hljs-string">"Classify Image"</span>).click(fn=predict_from_image, inputs=img_in, outputs=img_out)

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    demo.launch()
</code></pre>
<p>Running the command <code>python app.py</code> launches a Gradio application in your web browser. Here's what happens:</p>
<ol>
<li><p><strong>Webcam feed streams live</strong>: The application accesses your webcam to provide a live video feed. This allows you to perform gestures in front of the camera in real-time.</p>
</li>
<li><p><strong>Predictions update continuously</strong>: As you perform gestures, the model processes the video frames continuously, updating its predictions in real-time.</p>
</li>
<li><p><strong>Top 3 gesture classes displayed with probabilities</strong>: The application displays the top three predicted gesture classes along with their probabilities, giving you an idea of the model's confidence in its predictions.</p>
</li>
</ol>
<p>When you open the app in your browser, you'll find two tabs. In the <strong>Video tab</strong>, you can click <em>Record from webcam</em> to capture a short clip of your gesture, typically lasting 2–4 seconds. After recording, click <strong>Classify Video</strong>. The model will then process the captured frames using the Transformer model and display the predicted gesture probabilities. This setup allows for interactive testing and demonstration of the gesture recognition system.</p>
<p>Here’s an example where I raised my hand for a <strong>stop</strong> gesture, and the model predicts “stop” as the top class:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/realtime-demo.png?raw=true" alt="Gradio demo output" width="600" height="400" loading="lazy"></p>
<p>Figure 2. The Gradio app running locally. After recording a short clip, the Transformer model predicts the gesture with class probabilities.</p>
<h3 id="heading-evaluate-accuracy-latency">Evaluate Accuracy + Latency</h3>
<p>Now that the model runs in a demo app, let’s check how well it performs. There are two sides to this:</p>
<ul>
<li><p><strong>Accuracy</strong>: does the model predict the right gesture class?</p>
</li>
<li><p><strong>Latency</strong>: how fast does it respond, especially on CPU vs GPU?</p>
</li>
</ul>
<h4 id="heading-1-quick-accuracy-check">1. Quick Accuracy Check</h4>
<p>Save this as <code>eval.py</code> in the same folder as your other scripts:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> GestureClips, read_labels
<span class="hljs-keyword">from</span> train <span class="hljs-keyword">import</span> ViTTemporal

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)
n_classes = len(labels)

<span class="hljs-comment"># Load validation data</span>
val_ds = GestureClips(train=<span class="hljs-literal">False</span>)
val_dl = torch.utils.data.DataLoader(val_ds, batch_size=<span class="hljs-number">2</span>, shuffle=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># Load trained model</span>
model = ViTTemporal(num_classes=n_classes)
model.load_state_dict(torch.load(<span class="hljs-string">"vit_temporal_best.pt"</span>, map_location=<span class="hljs-string">"cpu"</span>))
model.eval()

correct, total = <span class="hljs-number">0</span>, <span class="hljs-number">0</span>
all_preds, all_labels = [], []

<span class="hljs-keyword">with</span> torch.no_grad():
    <span class="hljs-keyword">for</span> x, y <span class="hljs-keyword">in</span> val_dl:
        logits = model(x)
        preds = logits.argmax(dim=<span class="hljs-number">1</span>)
        correct += (preds == y).sum().item()
        total += y.size(<span class="hljs-number">0</span>)
        all_preds.extend(preds.tolist())
        all_labels.extend(y.tolist())

print(<span class="hljs-string">f"Validation accuracy: <span class="hljs-subst">{correct/total:<span class="hljs-number">.2</span>%}</span>"</span>)
</code></pre>
<h4 id="heading-2-confusion-matrix">2. Confusion Matrix</h4>
<p>Let’s also visualize which gestures are confused. Add this snippet at the bottom of <code>eval.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> confusion_matrix

cm = confusion_matrix(all_labels, all_preds)

plt.figure(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">6</span>))
sns.heatmap(cm, annot=<span class="hljs-literal">True</span>, fmt=<span class="hljs-string">"d"</span>, xticklabels=labels, yticklabels=labels, cmap=<span class="hljs-string">"Blues"</span>)
plt.xlabel(<span class="hljs-string">"Predicted"</span>)
plt.ylabel(<span class="hljs-string">"True"</span>)
plt.title(<span class="hljs-string">"Confusion Matrix"</span>)
plt.tight_layout()
plt.show()
</code></pre>
<p>When you run <code>python eval.py</code>, a heatmap like this will pop up:</p>
<p><img src="https://github.com/tayo4christ/transformer-gesture/blob/07c7071bdb17bc08585baeb60d787eadc3936ef5/images/confusion-matrix.png?raw=true" alt="Confusion matrix" width="600" height="400" loading="lazy"></p>
<p>Figure 3. Confusion matrix on the validation set. Correct predictions appear along the diagonal. Off-diagonal counts show gesture confusions.</p>
<h4 id="heading-3-latency-benchmark">3. Latency Benchmark</h4>
<p>Finally, let’s see how fast inference runs. Save the following as <code>benchmark.py</code>:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time, numpy <span class="hljs-keyword">as</span> np, onnxruntime
<span class="hljs-keyword">from</span> dataset <span class="hljs-keyword">import</span> read_labels

labels, _ = read_labels(<span class="hljs-string">"labels.txt"</span>)

ort = onnxruntime.InferenceSession(<span class="hljs-string">"vit_temporal.onnx"</span>, providers=[<span class="hljs-string">"CPUExecutionProvider"</span>])
INPUT_NAME = ort.get_inputs()[<span class="hljs-number">0</span>].name
OUTPUT_NAME = ort.get_outputs()[<span class="hljs-number">0</span>].name

dummy = np.random.randn(<span class="hljs-number">1</span>, <span class="hljs-number">16</span>, <span class="hljs-number">3</span>, <span class="hljs-number">224</span>, <span class="hljs-number">224</span>).astype(np.float32)

<span class="hljs-comment"># Warmup</span>
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">3</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})

<span class="hljs-comment"># Benchmark</span>
t0 = time.time()
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(<span class="hljs-number">50</span>):
    ort.run([OUTPUT_NAME], {INPUT_NAME: dummy})
t1 = time.time()

print(<span class="hljs-string">f"Average latency: <span class="hljs-subst">{(t1 - t0)/<span class="hljs-number">50</span>:<span class="hljs-number">.3</span>f}</span> seconds per clip"</span>)
</code></pre>
<p>Run: <code>python benchmark.py</code></p>
<p>On CPU, you might see ~0.05–0.15s per clip; on GPU it’s much faster.</p>
<p><strong>Note</strong>: If latency is high, you can enable <strong>quantization</strong> in ONNX to shrink the model and speed up inference.</p>
<h2 id="heading-option-2-use-small-samples-from-public-gesture-datasets">Option 2: Use Small Samples from Public Gesture Datasets</h2>
<p>If you’d prefer to see your model trained on <em>real</em> gesture clips instead of synthetic moving boxes, you can grab a handful of videos from open datasets. You don’t need to download the entire dataset (which can be several GB) just a few <code>.mp4</code> samples are enough to follow along.</p>
<h3 id="heading-recommended-sources">Recommended sources</h3>
<ul>
<li><p><strong>20BN Jester Dataset</strong>: Contains short clips of hand gestures like swiping, clapping, and pointing.</p>
</li>
<li><p><strong>WLASL</strong>: A large-scale dataset of isolated sign language words.</p>
</li>
</ul>
<p>Both projects provide small <code>.mp4</code> videos you can use as realistic training examples. I’ve linked them below.</p>
<h3 id="heading-setting-up-your-dataset-folder">Setting up your dataset folder</h3>
<p>Once you download a few clips, place them in the <code>data/</code> folder under subfolders named after each gesture class. For example:</p>
<pre><code class="lang-plaintext">data/
├── swipe_left/
│   ├── clip1.mp4
│   └── clip2.mp4
├── swipe_right/
│   ├── clip1.mp4
│   └── clip2.mp4
└── stop/
    ├── clip1.mp4
    └── clip2.mp4
</code></pre>
<p>And update <code>labels.txt</code> to match the folder names:</p>
<pre><code class="lang-plaintext">swipe_left
swipe_right
stop
</code></pre>
<p>Now your dataset is ready, and the same training scripts from earlier (<code>train.py</code>, <code>eval.py</code>) will work without modification.</p>
<h3 id="heading-why-choose-this-option">Why choose this option?</h3>
<ul>
<li><p>Gives more realistic results than synthetic coloured boxes</p>
</li>
<li><p>Lets you see how the model handles <em>actual human hand movements</em></p>
</li>
<li><p>It just requires a bit more effort (downloading clips, trimming them if needed)</p>
</li>
</ul>
<p><strong>Tip:</strong> If downloading from these datasets feels too heavy, you can also record your own short gestures using your laptop webcam. Just save them as <code>.mp4</code> files and organize them in the same folder structure.</p>
<h2 id="heading-accessibility-notes-amp-ethical-limits">Accessibility Notes &amp; Ethical Limits</h2>
<p>While this project shows the technical workflow for gesture recognition with Transformers, it’s important to step back and consider the <strong>human context</strong>:</p>
<ul>
<li><p><strong>Accessibility first</strong>: Tools like this can help students with speech or motor difficulties, but they should always be co-designed with the people who will use them. Don’t assume one-size-fits-all.</p>
</li>
<li><p><strong>Dataset sensitivity</strong>: Using publicly available sign or gesture datasets is fine for prototyping, but deploying such a system requires careful consideration of consent and representation.</p>
</li>
<li><p><strong>Error tolerance</strong>: Even small misclassifications can have big consequences in accessibility contexts (for example, confusing <em>stop</em> with <em>go</em>). Always plan for fallback options (like manual input or confirmation).</p>
</li>
<li><p><strong>Bias and inclusivity</strong>: Models trained on narrow datasets may fail for different skin tones, lighting conditions, or cultural gesture variations. Broad and diverse training data is essential for fairness.</p>
</li>
</ul>
<p>In other words: this demo is a <strong>teaching scaffold</strong>, not a production-ready accessibility tool. Responsible deployment requires collaboration with educators, therapists, and end users.</p>
<h2 id="heading-next-steps">Next Steps</h2>
<p>If you’d like to push this project further, here are some directions to explore:</p>
<ul>
<li><p><strong>Better models</strong>: Try video-focused Transformers like <a target="_blank" href="https://arxiv.org/abs/2102.05095">TimeSformer</a> or <a target="_blank" href="https://arxiv.org/abs/2203.12602">VideoMAE</a> for stronger temporal reasoning.</p>
</li>
<li><p><strong>Larger vocabularies</strong>: Add more gesture classes, build your own dataset, or use portions of public datasets like <a target="_blank" href="https://www.kaggle.com/datasets/toxicmender/20bn-jester">20BN Jester</a> or <a target="_blank" href="https://www.kaggle.com/datasets/risangbaskoro/wlasl-processed">WLASL.</a></p>
</li>
<li><p><strong>Pose fusion</strong>: Combine gesture video with human pose keypoints from <a target="_blank" href="https://mediapipe.readthedocs.io/en/latest/solutions/hands.html">MediaPipe</a> or <a target="_blank" href="https://github.com/CMU-Perceptual-Computing-Lab/openpose">OpenPose</a> for more robust predictions.</p>
</li>
<li><p><strong>Real-time smoothing</strong>: Implement temporal smoothing or debounce logic in the app so predictions are more stable during live use.</p>
</li>
<li><p><strong>Quantization + edge devices</strong>: Convert your ONNX model to an INT8 quantized version and deploy it on a Raspberry Pi or Jetson Nano for classroom-ready prototypes.</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you learned how to create a gesture recognition system using Transformer models, demonstrating the potential of cutting-edge machine learning techniques. By preparing a small dataset, training a Vision Transformer with temporal pooling, exporting the model to ONNX for efficient inference, and deploying a real-time Gradio app, you showcased a practical application of these technologies. The evaluation of accuracy and latency further highlighted the system's effectiveness and responsiveness.</p>
<p>This project illustrates how you can leverage advanced ML methods to enhance accessibility and communication, paving the way for more inclusive learning environments.</p>
<p>Remember: while this demo works with small datasets, real-world applications need larger, more diverse data and careful consideration of accessibility, inclusivity, and ethics.</p>
<p>Here’s the GitHub repo for full source code: <a target="_blank" href="https://github.com/tayo4christ/transformer-gesture">transformer-gesture</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Machine Learning vs Deep Learning vs Generative AI - What are the Differences? ]]>
                </title>
                <description>
                    <![CDATA[ When I started using LLMs for work and personal use, I picked up on some technical terms, such as "machine learning" and "deep learning," which are the main technologies behind these LLMs. I've always been interested in learning about the differences... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/machine-learning-vs-deep-learning-vs-generative-ai/</link>
                <guid isPermaLink="false">68de98a534a379d15102109e</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ generative ai ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nitheesh Poojary ]]>
                </dc:creator>
                <pubDate>Thu, 02 Oct 2025 15:22:13 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759006391065/3cd87534-e2e9-49df-a9c7-1b636e491032.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When I started using LLMs for work and personal use, I picked up on some technical terms, such as "machine learning" and "deep learning," which are the main technologies behind these LLMs. I've always been interested in learning about the differences between these technologies. Most companies in the industry are now developing their own AI tools, which makes MLOps necessary for managing and utilizing them.</p>
<p>Before I began learning about MLOps, I tried to understand the technologies behind LLMs and how they work. In this article, I’ll share my understanding of machine learning, deep learning, and generative AI, along with their potential applications.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-artificial-intelligence-ai">Artificial Intelligence (AI)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-machine-learning-ml-the-foundation">Machine Learning (ML): The Foundation</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-deep-learning-adding-complexity">Deep Learning: Adding Complexity</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-generative-ai-write-new">Generative AI: Write New</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-summary-of-differences-between-machine-learning-vs-deep-learning-vs-generative-ai">Summary of Differences Between Machine Learning vs Deep Learning vs Generative AI</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759006565108/9698f88c-7d81-40b6-b902-c3d75b054728.jpeg" alt="how AI works" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-artificial-intelligence-ai">Artificial Intelligence (AI)</h2>
<p>Artificial Intelligence (AI) is a form of technology that lets machines solve problems in a way that is identical to how people do it. It helps businesses make better decisions on a large scale by helping them recognize images, create content, and make predictions based on data. Artificial intelligence includes machine learning, deep learning, and generative AI.</p>
<h2 id="heading-machine-learning-ml-the-foundation">Machine Learning (ML): The Foundation</h2>
<p>When we give computers many examples, they learn how to make their own decisions or guesses. It's like teaching a kid to tell the difference between animals. You show them a lot of pictures of cats and dogs and say things like "This is a cat" and "This is a dog." In the end, they learn to tell the difference between cats and dogs on their own. Machine learning is similar in that you give a computer a lot of data with examples, and it learns how to make predictions about new data.</p>
<h3 id="heading-how-does-machine-learning-work">How Does Machine Learning Work?</h3>
<p>Machine Learning (ML) is the process of teaching computers to find patterns in data and make decisions or predictions without being instructed what to do. There are usually six main steps in this process:</p>
<p><strong>Data Collection:</strong> Get many examples, like thousands of emails, photos, or sales records. The more training data you have, the more accurate your predictions will be.</p>
<p><strong>Data Preparation</strong>: At this stage, you clean the data by getting rid of mistakes and adding missing labels.</p>
<p><strong>Selecting Algorithm (Models):</strong> It's like choosing the right tools for the job. Models can find patterns in data or make predictions. You can find machine learning models for your data <a target="_blank" href="https://www.ibm.com/think/topics/machine-learning-algorithms">here</a>.</p>
<p><strong>Training Phase:</strong> After you pick the right model for your cleaned-up data, you teach it. This is like getting ready for a test.</p>
<p><strong>Evaluation</strong>: Use the test data to assess the model's performance and see if it can make accurate predictions on unseen data.</p>
<p><strong>Deployment</strong>: Put the trained model to work in the real world.</p>
<p><strong>Training Phase</strong>: Teach the computer with 10,000 house sales with details like size (2,000 sq ft), number of bedrooms (3), and location (downtown). Cost: $300,000.</p>
<p><strong>Learning</strong>: The algorithm finds patterns, such as the fact that bigger houses cost more and places in the city center cost more. More bedrooms make a house worth more.</p>
<p><strong>Prediction</strong>: Think about a new house with 1,800 square feet, two bedrooms, and a location in the suburbs. It guesses a figure based on what it has learned.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759006771594/12afae06-9d72-4d65-af81-c10fda1e2099.png" alt="how machine learning works" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h3 id="heading-types-of-machine-learning">Types of Machine Learning</h3>
<ol>
<li><p><strong>Supervised Learning</strong>: Give algorithms labeled and defined training data to look for patterns. The sample data tells the algorithm what to do and what to expect as an output. For instance, millions of X-ray reports that say someone is healthy or sick would need to be tagged. Then, machine learning programs could use this training data to guess if a new X-ray shows signs of illness.</p>
</li>
<li><p><strong>Unsupervised Learning</strong>: Algorithms that use unsupervised learning learn from data that doesn't have labels. The algorithm must find patterns in untagged data without outside help. For instance, finding groups of people on Facebook or Twitter who have similar interests.</p>
</li>
<li><p><strong>Reinforcement Learning</strong>: This technique is a kind of machine learning in which an agent learns how to make choices by interacting with the world around it. The agent receives points for doing things right and loses points for doing things wrong. Its goal is to get as many points as possible. For instance, cars learn how to drive safely by making mistakes in simulations. They get rewards for staying in their lane, following traffic rules, and not hitting other cars.</p>
</li>
</ol>
<h3 id="heading-machine-learningreal-world-examples">Machine Learning—Real-World Examples</h3>
<p><strong>Email Spam Detection</strong></p>
<p>You can show the computer thousands of emails that say "spam" or "not spam." It learns patterns, like how emails with "FREE MONEY" are usually spam. It can now automatically sort your inbox.</p>
<p><strong>Photo Recognition</strong></p>
<p>Give the computer millions of pictures with labels that say what's in them. It learns that apples are likely to be round and have stems. Your phone can now tell what things are in your pictures.</p>
<p><strong>Movie Recommendations</strong></p>
<p>Netflix keeps track of the movies you've seen and rated. It finds people who like the same things you do. It suggests movies that other people like.</p>
<h2 id="heading-deep-learning-adding-complexity">Deep Learning: Adding Complexity</h2>
<p>Deep learning is a type of artificial intelligence. It helps computers understand data like humans do. Deep learning can identify complex images, text, sound, and other data patterns to make accurate predictions. It uses artificial neural networks that work like the human brain. Neural networks are connected nodes that handle information.</p>
<h3 id="heading-how-does-deep-learning-work">How Does Deep Learning Work?</h3>
<p>Artificial neural networks are used in deep learning to learn from data. These networks consist of interconnected layers of nodes. Each node learns a different thing about the data.</p>
<p>For instance, when you show a computer a picture of a cat, the picture goes through a lot of steps. The first layer looks for shapes and edges. The second layer puts these shapes together to make ears, eyes, and whiskers. The last layers say things like "This picture looks like a cat." Deep learning can make a lot of mistakes when learning, but it gets better and better after each piece of feedback.</p>
<h3 id="heading-deep-learningreal-world-examples">Deep Learning—Real-World Examples</h3>
<ul>
<li><p><strong>Tesla Autopilot</strong>: Processes eight cameras simultaneously to navigate roads, recognize traffic signs, and avoid obstacles.</p>
</li>
<li><p><strong>Google's DeepMind</strong>: Detects over fifty eye diseases from retinal scans with 94% accuracy.</p>
</li>
<li><p><strong>ChatGPT</strong>: Helps with writing, coding, and problem-solving.</p>
</li>
</ul>
<h2 id="heading-generative-ai-write-new">Generative AI: Write New</h2>
<p>Generative AI is a subset of deep learning that makes new things, like stories, pictures, music, or code, instead of just looking at or sorting through things that are already there. Generative AI systems learn patterns from a lot of training data and then use those patterns to make new content.</p>
<h3 id="heading-real-world-examples">Real-World Examples</h3>
<ul>
<li><p>Chatbots help institutions give better customer service by making product suggestions and answering questions.</p>
</li>
<li><p>Automatically generate technical documents from the source code.</p>
</li>
<li><p>Auto-generate quizzes, practice problems, and explanations</p>
</li>
</ul>
<h2 id="heading-summary-of-differences-between-machine-learning-vs-deep-learning-vs-generative-ai">Summary of Differences Between Machine Learning vs Deep Learning vs Generative AI</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Feature</strong></td><td><strong>Machine Learning (ML)</strong></td><td><strong>Deep Learning (DL)</strong></td><td><strong>Generative AI (GenAI)</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Definition</strong></td><td>Subset of AI where machines learn from data to make predictions or decisions.</td><td>Subset of AI using artificial neural networks with multiple layers to model complex patterns</td><td>Subset of Deep learning that can create new content (text, images, code, etc.) similar to human-created content</td></tr>
<tr>
<td><strong>Data Requirements</strong></td><td>Small-to-medium datasets.</td><td>Large amounts of data (structured and unstructured)</td><td>Massive datasets for training, varying amounts for generation</td></tr>
<tr>
<td><strong>Computational Power</strong></td><td>Works on CPUs, moderate hardware.</td><td>Needs GPUs/TPUs for training.</td><td>Requires large-scale GPU/TPU clusters.</td></tr>
<tr>
<td><strong>Use Cases</strong></td><td>Predictions and classification.</td><td>Recognize complex data like speech, images, and language.</td><td>Generate new, original content.</td></tr>
<tr>
<td><strong>When NOT to Use</strong></td><td>Data is very complex/unstructured; accuracy is critical (medical, legal) ,Need to handle images/audio/video</td><td>The dataset is small (&lt;1000 samples), and computational resources are limited.</td><td>Copyright/IP restriction</td></tr>
<tr>
<td><strong>Cost Comparison</strong></td><td>Low ($1K-$10K) (Standard serve)</td><td>Medium ($10K-$100K)</td><td>High ($100K-$1M+)</td></tr>
<tr>
<td><strong>Real-World Examples</strong></td><td>Netflix recommendations, fraud detection, spam filters.</td><td>Face recognition, self-driving cars, Siri/Alexa.</td><td>Original creative outputs (text, images, code, video).</td></tr>
</tbody>
</table>
</div><h2 id="heading-conclusion">Conclusion</h2>
<p>To sum it up, anyone who is keen to learn more about artificial intelligence needs to know the differences between machine learning, deep learning, and generative AI.</p>
<p>Machine learning is the basis for this because it lets computers learn from data and make predictions. Deep learning takes this a step further by using neural networks to process complicated data patterns in a way that is similar to how humans understand things.</p>
<p>Generative AI goes a step further by making new things, which shows how creative AI can be. As these technologies get better, they open up a lot of new opportunities in many fields, such as improving customer service, making medical diagnoses more accurate, and making new content. To maximize AI's benefits in your life, stay current on new developments.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Build a Machine Learning System on Serverless Architecture ]]>
                </title>
                <description>
                    <![CDATA[ Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks. But a model isn’t truly valuable until it’s in production, serving real users and solving real problems. In this article, you’ll learn how to ship a pro... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-build-a-machine-learning-system-on-serverless-architecture/</link>
                <guid isPermaLink="false">68addf802314e8b22eae4655</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ coding ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Tue, 26 Aug 2025 16:23:28 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756225357023/04572f1b-b9a7-43e0-aabc-2842faa2703f.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Let’s say you’ve built a fantastic machine learning model that performs beautifully in notebooks.</p>
<p>But a model isn’t truly valuable until it’s in production, serving real users and solving real problems.</p>
<p>In this article, you’ll learn how to ship a production-ready ML application built on serverless architecture.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-were-building">What We’re Building</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-ai-pricing-for-retailers">AI Pricing for Retailers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-models">The Models</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tuning-and-training">Tuning and Training</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-prediction">The Prediction</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-performance-validation">Performance Validation</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-system-architecture">The System Architecture</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-core-aws-resources-in-the-architecture">Core AWS Resources in the Architecture</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-the-deployment-workflow-in-action">The Deployment Workflow in Action</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-step-1-draft-python-scripts">Step 1: Draft Python Scripts</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-2-configure-featuremodel-stores-in-s3">Step 2: Configure Feature/Model Stores in S3</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-3-create-a-flask-application-with-api-endpoints">Step 3: Create a Flask Application with API Endpoints</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-4-publish-a-docker-image-to-ecr">Step 4: Publish a Docker Image to ECR</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-5-create-a-lambda-function">Step 5: Create a Lambda Function</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-step-6-configure-aws-resources">Step 6: Configure AWS Resources</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-building-a-client-application-optional">Building a Client Application (Optional)</a></p>
<ul>
<li><a class="post-section-overview" href="#heading-the-react-application">The React Application</a></li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-final-results">Final Results</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<p>This project requires some basic experience with:</p>
<ul>
<li><p><strong>Machine Learning / Deep Learning:</strong> The full lifecycle, including data handling, model training, tuning, and validation.</p>
</li>
<li><p><strong>Coding:</strong> Proficiency in Python, with experience using major ML libraries such as PyTorch and Scikit-Learn.</p>
</li>
<li><p><strong>Full-stack deployment:</strong> Experience deploying applications using RESTful APIs.</p>
</li>
</ul>
<h2 id="heading-what-were-building">What We’re Building</h2>
<h3 id="heading-ai-pricing-for-retailers">AI Pricing for Retailers</h3>
<p>This project aims to help a middle-sized retailer compete with large players like Amazon.</p>
<p>Smaller companies often can’t afford significant price discounts, so they can face challenges finding optimal price points as they expand their product lines.</p>
<p>Our goal is to leverage AI models to recommend the best price for a selected product to maximize sales for the retailer, and display it on a client-side user interface (UI):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755873936847/ecf696ef-e161-4453-a6ad-e97d92ac1677.png" alt="What the UI will look like" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>You can explore the UI from <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<h3 id="heading-the-models">The Models</h3>
<p>I’ll train and tune multiple models so that when the primary model fails, a backup model gets loaded to serve predictions.</p>
<ul>
<li><p><strong>Primary Model</strong>: Multi-layered feedforward network (on the <strong>PyTorch</strong> library)</p>
</li>
<li><p><strong>Backup Models (Backups)</strong>: LightGBM, SVR, and Elastic Net (on the <strong>Scikit-Learn</strong> library)</p>
</li>
</ul>
<p>The backup models are prioritized based on learning capabilities.</p>
<h3 id="heading-tuning-and-training">Tuning and Training</h3>
<p>The primary model was trained on a dataset of around 500,000 samples (<a target="_blank" href="https://archive.ics.uci.edu/dataset/352/online+retail">source)</a> and fine-tuned using <code>Optuna</code>'s Bayesian Optimization, with grid search available for further refinement.</p>
<p>The backups are also trained on the same samples and tuned using the <code>Scikit-Optimize</code> framework.</p>
<h3 id="heading-the-prediction">The Prediction</h3>
<p>All models serve predictions on <strong>logged quantity values.</strong></p>
<p>Logarithmic transformations of the quantity data make the distribution denser, which helps models learn patterns more effectively. This is because logarithms reduce the impact of extreme values, or outliers, and can help normalize skewed data.</p>
<h3 id="heading-performance-validation">Performance Validation</h3>
<p>We’ll evaluate model performance using different metrics for the transformed and original data, with a lower value always indicating better performance.</p>
<ul>
<li><p><strong>Logged values</strong>: Mean Squared Error (MSE)</p>
</li>
<li><p><strong>Actual values</strong>: Root Mean Squared Log Error (RMSLE) and Mean Absolute Error (MAE)</p>
</li>
</ul>
<h2 id="heading-the-system-architecture">The System Architecture</h2>
<p>We’re going to build a complete ecosystem around an <strong>AWS Lambda function</strong> to create a scalable ML system:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:4680/0*ulcNtwJeU5EOfhTg.png" alt="Fig. The system architecture (Created by Kuriko IWAI)" width="600" height="400" loading="lazy"></p>
<p>Fig. The system architecture (Created by <a target="_blank" href="https://kuriko-iwai.vercel.app/">Kuriko IWAI)</a></p>
<p><strong>AWS Lambda</strong> is a <strong>serverless production</strong> where a service provider can run the application without managing servers. Once they upload the code, AWS takes on the responsibility of managing the underlying infrastructure.</p>
<p>In the serverless production, the code is deployed as <strong>a stateless function</strong> that runs only when it’s triggered by an event like HTTP requests or scheduled tasks.</p>
<p>This event-driven nature makes serverless production extremely efficient in resource allocation because:</p>
<ul>
<li><p><strong>There’s no server management</strong>: The cloud provider takes care of operational tasks.</p>
</li>
<li><p><strong>You have automatic scaling</strong>: Serverless applications automatically scale up or down based on demand.</p>
</li>
<li><p><strong>You have pay-per-use billing</strong>: Charged for the exact amount of compute resources the application consumes.</p>
</li>
</ul>
<p>Note that other cloud ecosystems like Google Cloud Platform (GCP) and Microsoft Azure offer comprehensive alternatives to AWS. Which one you choose depends on your budget, project type, and familiarity with each ecosystem.</p>
<h3 id="heading-core-aws-resources-in-the-architecture">Core AWS Resources in the Architecture</h3>
<p>The system architecture focuses on the following points:</p>
<ul>
<li><p>The application is fully containerized on Docker for universal accessibility.</p>
</li>
<li><p>The container image is stored in AWS Elastic Container Registry (ECR).</p>
</li>
<li><p>The API Gateway’s REST API endpoints trigger an event to invoke the Lambda function.</p>
</li>
<li><p>The Lambda function loads the container image from ECR and perform inference.</p>
</li>
<li><p>Trained models, processors, and input features are stored in AWS S3 buckets.</p>
</li>
<li><p>A Redis client serves cached analytical data and past predictions stored in the ElastiCache.</p>
</li>
</ul>
<p>And to build the system, we’ll use the following AWS resources:</p>
<ul>
<li><p><strong>Lamda</strong>: Serves a function to perform inference.</p>
</li>
<li><p><strong>API Gateway:</strong> Routes API calls to the Lambda function.</p>
</li>
<li><p><strong>S3 Storage</strong>: Serves feature store and model store.</p>
</li>
<li><p><strong>ElastiCache:</strong> Store cached predictions and analytical data.</p>
</li>
<li><p><strong>ECR</strong>: Stores Docker container images to allow Lambda to pull the image.</p>
</li>
</ul>
<p>Each resource requires configuration. I’ll explore those details in the next section.</p>
<h2 id="heading-the-deployment-workflow-in-action"><strong>The Deployment Workflow in Action</strong></h2>
<p>The deployment workflow involves the following steps:</p>
<ol>
<li><p>Draft data preparation, model training, and serialization scripts</p>
</li>
<li><p>Configure designated feature store and model store in S3</p>
</li>
<li><p>Create a Flask application with API endpoints</p>
</li>
<li><p>Publish a Docker image to ECR</p>
</li>
<li><p>Create a Lambda function</p>
</li>
<li><p>Configure related AWS resources</p>
</li>
</ol>
<p>We’ll now walk through each of these steps to help you fully understand the process.</p>
<p>For your reference, here is the repository structure:</p>
<pre><code class="lang-markdown">.
.venv/                  [.gitignore]    # stores uv venv
│
└── data/               [.gitignore]
│     └──raw/                           # stores raw data
│     └──preprocessed/                  # stores processed data after imputation and engineering
│
└── models/             [.gitignore]    # stores serialized model after training and tuning
│     └──dfn/                           # deep feedforward network
│     └──gbm/                           # light gbm
│     └──en/                            # elastic net
│     └──production/                    # models to be stored in S3 for production use
|
└── notebooks/                          # stores experimentation notebooks
│
└── src/                                # core functions
│     └──<span class="hljs-emphasis">_utils/                        # utility functions
│     └──data_</span>handling/                 # functions to engineer features
│     └──model/                         # functions to train, tune, validate models
│     │     └── sklearn<span class="hljs-emphasis">_model
│     │     └── torch_</span>model
│     │     └── ...
│     └──main.py                        # main script to run the inference locally
│
└──app.py                               # Flask application (API endpoints)
└──pyproject.toml                       # project configuration
└──.env                [.gitignore]     # environment variables
└──uv.lock                              # dependency locking
└──Dockerfile                           # for Docker container image
└──.dockerignore
└──requirements.txt
└──.python-version                      # python version locking (3.12)
</code></pre>
<h3 id="heading-step-1-draft-python-scripts">Step 1: Draft Python Scripts</h3>
<p>The first step is to draft Python scripts for data preparation, model training and tuning.</p>
<p>We’ll run these scripts in a <strong>batch process</strong> because these are resource-intensive and stateful tasks that aren’t suitable for serverless functions optimized for short-lived, stateless, and event-driven tasks.</p>
<p>Serverless functions also can experience <a target="_blank" href="https://www.freecodecamp.org/news/cold-start-problem-in-recommender-systems/"><strong>cold starts</strong></a>. With heavy tasks in the function, the API gateway would timeout before serving predictions.</p>
<p><code>src/main.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> warnings
<span class="hljs-keyword">import</span> pickle
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> lightgbm <span class="hljs-keyword">as</span> lgb
<span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> ElasticNet
<span class="hljs-keyword">from</span> sklearn.svm <span class="hljs-keyword">import</span> SVR
<span class="hljs-keyword">from</span> skopt.space <span class="hljs-keyword">import</span> Real, Integer, Categorical
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">import</span> src.data_handling <span class="hljs-keyword">as</span> data_handling
<span class="hljs-keyword">import</span> src.model.torch_model <span class="hljs-keyword">as</span> t
<span class="hljs-keyword">import</span> src.model.sklearn_model <span class="hljs-keyword">as</span> sk


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>: 
    load_dotenv(override=<span class="hljs-literal">True</span>)
    os.makedirs(PRODUCTION_MODEL_FOLDER_PATH, exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-comment"># create train, validation, test datasets</span>
    X_train, X_val, X_test, y_train, y_val, y_test, preprocessor = data_handling.main_script()

    <span class="hljs-comment"># store the trained preprocessor in local storage</span>
    joblib.dump(preprocessor, PREPROCESSOR_PATH)

    <span class="hljs-comment"># model tuning and training</span>
    best_dfn_full_trained, checkpoint = t.main_script(X_train, X_val, y_train, y_val)

    <span class="hljs-comment"># serialize the trained model</span>
    torch.save(checkpoint, DFN_FILE_PATH)

    <span class="hljs-comment"># svr</span>
    best_svr_trained, best_hparams_svr = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">1</span>]
    )
    <span class="hljs-keyword">if</span> best_svr_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(SVR_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({ <span class="hljs-string">'best_model'</span>: best_svr_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_svr }, f)

    <span class="hljs-comment"># elastic net</span>
    best_en_trained, best_hparams_en = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">0</span>]
    )
    <span class="hljs-keyword">if</span> best_en_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(EN_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({ <span class="hljs-string">'best_model'</span>: best_en_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_en }, f)

    <span class="hljs-comment"># light gbm</span>
    best_gbm_trained, best_hparams_gbm = sk.main_script(
        X_train, X_val, y_train, y_val, **sklearn_models[<span class="hljs-number">2</span>]
    )

    <span class="hljs-keyword">if</span> best_gbm_trained <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
        <span class="hljs-keyword">with</span> open(GBM_FILE_PATH, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> f:
            pickle.dump({<span class="hljs-string">'best_model'</span>: best_gbm_trained, <span class="hljs-string">'best_hparams'</span>: best_hparams_gbm }, f)
</code></pre>
<p>Run the script to train and serialize the models using the <code>uv</code> package management:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> venv
<span class="hljs-variable">$source</span> .venv/bin/activate
<span class="hljs-variable">$uv</span> run src/main.py
</code></pre>
<p>The <code>main.py</code> script includes several key components.</p>
<h4 id="heading-scripts-for-data-handling">Scripts for Data Handling</h4>
<p>These scripts involve loading original data, structure missing values, and engineer features necessary for the future prediction.</p>
<p><code>src/data_handling/main.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> joblib
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">import</span> src.data_handling.scripts <span class="hljs-keyword">as</span> scripts
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger


<span class="hljs-comment"># load and save the original data frame in parquet</span>
df = scripts.load_original_dataframe()
df.to_parquet(ORIGINAL_DF_PATH, index=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># imputation</span>
df = scripts.structure_missing_values(df=df)

<span class="hljs-comment"># feature engineering</span>
df = scripts.handle_feature_engineering(df=df)

<span class="hljs-comment"># save processed df in csv and parquet</span>
scripts.save_df_to_csv(df=df)
df.to_parquet(PROCESSED_DF_PATH, index=<span class="hljs-literal">False</span>)


<span class="hljs-comment"># for preprocessing, classify numerical and categorical columns</span>
num_cols, cat_cols = scripts.categorize_num_cat_cols(df=df, target_col=target_col)
<span class="hljs-keyword">if</span> cat_cols:
    <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cat_cols: df[col] = df[col].astype(<span class="hljs-string">'string'</span>)

<span class="hljs-comment"># creates training, validation, and test datasets (test dataset is for inference only)</span>
y = df[target_col]
X = df.copy().drop(target_col, axis=<span class="hljs-string">'columns'</span>)
test_size, random_state = <span class="hljs-number">50000</span>, <span class="hljs-number">42</span>
X_tv, X_test, y_tv, y_test = train_test_split(
    X, y, test_size=test_size, random_state=random_state
)
X_train, X_val, y_train, y_val = train_test_split(
    X_tv, y_tv, test_size=test_size, random_state=random_state
)

<span class="hljs-comment"># transform the input datasets</span>
X_train, X_val, X_test, preprocessor = scripts.transform_input(
    X_train, X_val, X_test, num_cols=num_cols, cat_cols=cat_cols
)

<span class="hljs-comment"># retrain and serialize the preprocessor</span>
<span class="hljs-keyword">if</span> preprocessor <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>: preprocessor.fit(X)
joblib.dump(preprocessor, PREPROCESSOR_PATH)
</code></pre>
<h4 id="heading-scripts-for-model-training-and-tuning-pytorch-model">Scripts for Model Training and Tuning (PyTorch Model)</h4>
<p>The scripts involve initiating the model, searching optimal neural architecture and hyperparameters, and serializing the fully-trained model so that the system can load the trained model when performing inference.</p>
<p>Because the primary model is built on PyTorch and the backups use Scikit-Learn, we’re drafting the scripts separately.</p>
<h4 id="heading-1-pytorch-models">1. PyTorch Models</h4>
<p><strong>The training script</strong> contains training the model with the validation over a subset of training data.</p>
<p>It contains the early stopping logic when the loss history is not improved for a given consecutive epochs (that is, 10 epochs).</p>
<p><code>src/model/torch_model/scripts/training.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">import</span> optuna <span class="hljs-comment"># type: ignore</span>
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># device</span>
device_type = device_type <span class="hljs-keyword">if</span> device_type <span class="hljs-keyword">else</span> <span class="hljs-string">'cuda'</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'mps'</span> <span class="hljs-keyword">if</span> torch.backends.mps.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'cpu'</span>
device = torch.device(device_type)

<span class="hljs-comment"># gradient scaler for stability (only applicable for cuba)</span>
scaler = torch.GradScaler(device=device_type) <span class="hljs-keyword">if</span> device_type == <span class="hljs-string">'cuba'</span> <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>

<span class="hljs-comment"># start training</span>
best_val_loss = float(<span class="hljs-string">'inf'</span>)
epochs_no_improve = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(num_epochs):
    model.train()
    <span class="hljs-keyword">for</span> batch_X, batch_y <span class="hljs-keyword">in</span> train_data_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()

        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># pytorch's AMP system automatically handles the casting of tensors to Float16 or Float32</span>
            <span class="hljs-keyword">with</span> torch.autocast(device_type=device_type):
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)

                <span class="hljs-comment"># break the training loop when models return nan or inf</span>
                <span class="hljs-keyword">if</span> torch.any(torch.isnan(outputs)) <span class="hljs-keyword">or</span> torch.any(torch.isinf(outputs)):
                    main_logger.error(
                        <span class="hljs-string">'pytorch model returns nan or inf. break the training loop.'</span>
                    )
                    <span class="hljs-keyword">break</span>

            <span class="hljs-comment"># create scaled gradients of losses</span>
            <span class="hljs-keyword">if</span> scaler <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
                scaler.scale(loss).backward()
                scaler.unscale_(optimizer)  <span class="hljs-comment"># cliping grad</span>
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=<span class="hljs-number">1.0</span>)
                scaler.step(optimizer)  <span class="hljs-comment"># unscales the gradients</span>
                scaler.update()  <span class="hljs-comment"># updates the scale</span>

            <span class="hljs-keyword">else</span>:
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=<span class="hljs-number">1.0</span>) <span class="hljs-comment"># cliping grad</span>
                optimizer.step()

        <span class="hljs-keyword">except</span>:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()


    <span class="hljs-comment"># run validation on a subset of the training dataset</span>
    model.eval()
    val_loss = <span class="hljs-number">0.0</span>

    <span class="hljs-comment"># switch the torch mode</span>
    <span class="hljs-keyword">with</span> torch.inference_mode():
        <span class="hljs-keyword">for</span> batch_X_val, batch_y_val <span class="hljs-keyword">in</span> val_data_loader:
            batch_X_val, batch_y_val = batch_X_val.to(device), batch_y_val.to(device)
            outputs_val = model(batch_X_val)
            val_loss += criterion(outputs_val, batch_y_val).item()

    val_loss /= len(val_data_loader)

    <span class="hljs-comment"># check if early stop</span>
    <span class="hljs-keyword">if</span> val_loss &lt; best_val_loss - min_delta:
        best_val_loss = val_loss
        epochs_no_improve = <span class="hljs-number">0</span>
    <span class="hljs-keyword">else</span>:
        epochs_no_improve += <span class="hljs-number">1</span>
        <span class="hljs-keyword">if</span> epochs_no_improve &gt;= patience:
            main_logger.info(<span class="hljs-string">f'early stopping at epoch <span class="hljs-subst">{epoch + <span class="hljs-number">1</span>}</span>'</span>)
            <span class="hljs-keyword">break</span>
</code></pre>
<p><strong>The tuning script</strong> uses the <code>study</code> component from the <code>Optuna</code> library to run the Bayesian Optimization.</p>
<p>The <code>study</code> component choose a neural architecture and hyperparameter set to test from the global search space.</p>
<p>Then, it builds, trains, and validates the model to find the optimal neural architecture that can minimize the loss (MSE, for instance).</p>
<p><code>src/model/torch_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> itertools
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> optuna
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">import</span> torch.optim <span class="hljs-keyword">as</span> optim
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> DataLoader, TensorDataset
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

<span class="hljs-keyword">from</span> src.model.torch_model.scripts.pretrained_base <span class="hljs-keyword">import</span> DFN
<span class="hljs-keyword">from</span> src.model.torch_model.scripts.training <span class="hljs-keyword">import</span> train_model
<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># device</span>
device_type = <span class="hljs-string">"cuda"</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"mps"</span> <span class="hljs-keyword">if</span> torch.backends.mps.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">"cpu"</span>
device = torch.device(device_type)

<span class="hljs-comment"># loss function</span>
criterion = nn.MSELoss()

<span class="hljs-comment"># define objective function for optuna</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">objective</span>(<span class="hljs-params">trial</span>):</span>
    <span class="hljs-comment"># model</span>
    num_layers = trial.suggest_int(<span class="hljs-string">'num_layers'</span>, <span class="hljs-number">1</span>, <span class="hljs-number">20</span>)
    batch_norm = trial.suggest_categorical(<span class="hljs-string">'batch_norm'</span>, [<span class="hljs-literal">True</span>, <span class="hljs-literal">False</span>])
    dropout_rates = []
    hidden_units_per_layer = []
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(num_layers):
        dropout_rates.append(trial.suggest_float(<span class="hljs-string">f'dropout_rate_layer_<span class="hljs-subst">{i}</span>'</span>, <span class="hljs-number">0.0</span>, <span class="hljs-number">0.6</span>))
        hidden_units_per_layer.append(trial.suggest_int(<span class="hljs-string">f'n_units_layer_<span class="hljs-subst">{i}</span>'</span>, <span class="hljs-number">8</span>, <span class="hljs-number">256</span>)) <span class="hljs-comment"># hidden units per layer</span>

    model = DFN(
        input_dim=X_train.shape[<span class="hljs-number">1</span>],
        num_layers=num_layers,
        dropout_rates=dropout_rates,
        batch_norm=batch_norm,
        hidden_units_per_layer=hidden_units_per_layer
    ).to(device)

    <span class="hljs-comment"># optimizer</span>
    learning_rate = trial.suggest_float(<span class="hljs-string">'learning_rate'</span>, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1e-1</span>, log=<span class="hljs-literal">True</span>)
    optimizer_name = trial.suggest_categorical(<span class="hljs-string">'optimizer'</span>, [<span class="hljs-string">'adam'</span>, <span class="hljs-string">'rmsprop'</span>, <span class="hljs-string">'sgd'</span>, <span class="hljs-string">'adamw'</span>, <span class="hljs-string">'adamax'</span>, <span class="hljs-string">'adadelta'</span>, <span class="hljs-string">'radam'</span>])
    optimizer = _handle_optimizer(optimizer_name=optimizer_name, model=model, lr=learning_rate)

    <span class="hljs-comment"># data loaders</span>
    batch_size = trial.suggest_categorical(<span class="hljs-string">'batch_size'</span>, [<span class="hljs-number">32</span>, <span class="hljs-number">64</span>, <span class="hljs-number">128</span>, <span class="hljs-number">256</span>])
    test_size = <span class="hljs-number">10000</span> <span class="hljs-keyword">if</span> len(X_train) &gt; <span class="hljs-number">15000</span> <span class="hljs-keyword">else</span> int(len(X_train) * <span class="hljs-number">0.2</span>)
    X_train_search, X_val_search, y_train_search, y_val_search = train_test_split(X_train, y_train, test_size=test_size, random_state=<span class="hljs-number">42</span>)
    train_data_loader = create_torch_data_loader(X=X_train_search, y=y_train_search, batch_size=batch_size)
    val_data_loader = create_torch_data_loader(X=X_val_search, y=y_val_search, batch_size=batch_size)

    <span class="hljs-comment"># training</span>
    num_epochs = <span class="hljs-number">3000</span> <span class="hljs-comment"># ensure enough epochs (early stopping would stop the loop when overfitting)</span>
    _, best_val_loss = train_model(
        train_data_loader=train_data_loader,
        val_data_loader=val_data_loader,
        model=model,
        optimizer=optimizer,
        criterion = criterion,
        num_epochs=num_epochs,
        trial=trial,
    )
    <span class="hljs-keyword">return</span> best_val_loss


<span class="hljs-comment"># start to optimize hyperparameters and architecture</span>
study = optuna.create_study(direction=<span class="hljs-string">'minimize'</span>, sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=<span class="hljs-number">50</span>, timeout=<span class="hljs-number">600</span>)

<span class="hljs-comment"># best </span>
best_trial = study.best_trial
best_hparams = best_trial.params

<span class="hljs-comment"># construct the model based on the tuning results</span>
best_lr = best_hparams[<span class="hljs-string">'learning_rate'</span>]
best_batch_size = best_hparams[<span class="hljs-string">'batch_size'</span>]
input_dim = X_train.shape[<span class="hljs-number">1</span>]
best_model = DFN(
    input_dim=input_dim,
    num_layers=best_hparams[<span class="hljs-string">'num_layers'</span>],
    hidden_units_per_layer=[v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> best_hparams.items() <span class="hljs-keyword">if</span> <span class="hljs-string">'n_units_layer_'</span> <span class="hljs-keyword">in</span> k],
    batch_norm=best_hparams[<span class="hljs-string">'batch_norm'</span>],
    dropout_rates=[v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> best_hparams.items() <span class="hljs-keyword">if</span> <span class="hljs-string">'dropout_rate_layer_'</span> <span class="hljs-keyword">in</span> k],
).to(device)

<span class="hljs-comment"># construct an optimizer based on the tuning results</span>
best_optimizer_name = best_hparams[<span class="hljs-string">'optimizer'</span>]
best_optimizer = _handle_optimizer(
    optimizer_name=best_optimizer_name, model=best_model, lr=best_lr
)

<span class="hljs-comment"># create torch data loaders</span>
train_data_loader = create_torch_data_loader(
    X=X_train, y=y_train, batch_size=best_batch_size
)
val_data_loader = create_torch_data_loader(
    X=X_val, y=y_val, batch_size=best_batch_size
)

<span class="hljs-comment"># retrain the best model with full training dataset applying the optimal batch size and optimizer</span>
best_model, _ = train_model(
    train_data_loader=train_data_loader,
    val_data_loader=val_data_loader,
    model=best_model,
    optimizer=best_optimizer,
    criterion = criterion,
    num_epochs=<span class="hljs-number">1000</span>
)

<span class="hljs-comment"># create a checkpoint for serialization (reconstruct the model using the checkpoint)</span>
checkpoint = {
    <span class="hljs-string">'state_dict'</span>: best_model.state_dict(),
    <span class="hljs-string">'hparams'</span>: best_hparams,
    <span class="hljs-string">'input_dim'</span>: X_train.shape[<span class="hljs-number">1</span>],
    <span class="hljs-string">'optimizer'</span>: best_optimizer,
    <span class="hljs-string">'batch_size'</span>: best_batch_size
}

<span class="hljs-comment"># serialize the model w/ checkpoint</span>
torch.save(checkpoint, FILE_PATH)
</code></pre>
<h4 id="heading-2-scikit-learn-models-backups">2. Scikit-Learn Models (Backups)</h4>
<p>For Scikit-Learn models, we’ll run <strong>k-fold cross validation</strong> during training to prevent overfitting.</p>
<p>K-fold cross-validation is a technique for evaluating a machine learning model's performance by training and testing it on different subsets of training data.</p>
<p>We define the <code>run_kfold_validation</code> function where the model is trained and validated using <strong>5-fold cross-validation</strong>.</p>
<p><code>src/model/sklearn_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> KFold
<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_kfold_validation</span>(<span class="hljs-params">
        X_train,
        y_train,
        base_model,
        hparams: dict,
        n_splits: int = <span class="hljs-number">5</span>, <span class="hljs-comment"># the number of folds </span>
        early_stopping_rounds: int = <span class="hljs-number">10</span>,
        max_iters: int = <span class="hljs-number">200</span>
    </span>) -&gt; float:</span>

    mses = <span class="hljs-number">0.0</span>

    <span class="hljs-comment"># create k-fold component</span>
    kf = KFold(n_splits=n_splits, shuffle=<span class="hljs-literal">True</span>, random_state=<span class="hljs-number">42</span>)

    <span class="hljs-keyword">for</span> fold, (train_index, val_index) <span class="hljs-keyword">in</span> enumerate(kf.split(X_train)):
        <span class="hljs-comment"># create a subset of training and validation datasets from the entire training data</span>
        X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

        <span class="hljs-comment"># reconstruct a model</span>
        model = base_model(**hparams)

        <span class="hljs-comment"># start the cross validation</span>
        best_val_mse = float(<span class="hljs-string">'inf'</span>)
        patience_counter = <span class="hljs-number">0</span>
        best_model_state = <span class="hljs-literal">None</span>
        best_iteration = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> iteration <span class="hljs-keyword">in</span> range(max_iters):
            <span class="hljs-comment"># train on a subset of the training data</span>
            <span class="hljs-keyword">try</span>:
                model.train_one_step(X_train_fold, y_train_fold, iteration)
            <span class="hljs-keyword">except</span>:
                model.fit(X_train_fold, y_train_fold)

            <span class="hljs-comment"># make a prediction on validation data </span>
            y_pred_val_kf = model.predict(X_val_fold)

            <span class="hljs-comment"># compute validation loss (MSE)</span>
            current_val_mse = mean_squared_error(y_val_fold, y_pred_val_kf)

            <span class="hljs-comment"># check if epochs should be stopped (early stopping)</span>
           <span class="hljs-keyword">if</span> current_val_mse &lt; best_val_mse:
                best_val_mse = current_val_mse
                patience_counter = <span class="hljs-number">0</span>
                best_model_state = model.get_params()
                best_iteration = iteration
           <span class="hljs-keyword">else</span>:
                patience_counter += <span class="hljs-number">1</span>

           <span class="hljs-comment"># execute early stopping when patience_counter exceeds early_stopping_rounds</span>
           <span class="hljs-keyword">if</span> patience_counter &gt;= early_stopping_rounds:
                main_logger.info(<span class="hljs-string">f"Fold <span class="hljs-subst">{fold}</span>: Early stopping triggered at iteration <span class="hljs-subst">{iteration}</span> (best at <span class="hljs-subst">{best_iteration}</span>). Best MSE: <span class="hljs-subst">{best_val_mse:<span class="hljs-number">.4</span>f}</span>"</span>)
                <span class="hljs-keyword">break</span>


        <span class="hljs-comment"># after training epochs, reconstruct the best performing model </span>
        <span class="hljs-keyword">if</span> best_model_state: model.set_params(**best_model_state)

        <span class="hljs-comment"># make prediction</span>
        y_pred_val_kf = model.predict(X_val_fold)

        <span class="hljs-comment"># add MSEs</span>
        mses += mean_squared_error(y_pred_val_kf, y_val_fold)

    <span class="hljs-comment"># compute the final loss (avarage of MSEs across folds)</span>
    ave_mse = mses / n_splits
    <span class="hljs-keyword">return</span> ave_mse
</code></pre>
<p>Then, for the <strong>tuning script</strong>, we use the <code>gp_minimize</code> function from the <code>Scikit-Optimize</code> library.</p>
<p>The <code>gp_minimize</code> function is used to tune hyperparameters with Bayesian optimization.</p>
<p>This function intelligently searches the best hyperparameter set that can minimize the model's error, which is calculated using the <code>run_kfold_validation</code> function defined earlier.</p>
<p>The best-performing hyperparameters are then used to reconstruct and train the final model.</p>
<p><code>src/model/sklearn_model/scripts/tuning.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> functools <span class="hljs-keyword">import</span> partial
<span class="hljs-keyword">from</span> skopt <span class="hljs-keyword">import</span> gp_minimize


<span class="hljs-comment"># define the objective function for Bayesian Optimization using Scikit-Optimize</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">objective</span>(<span class="hljs-params">params, X_train, y_train, base_model, hparam_names</span>):</span>
    hparams = {item: params[i] <span class="hljs-keyword">for</span> i, item <span class="hljs-keyword">in</span> enumerate(hparam_names)}
    ave_mse = run_kfold_validation(X_train=X_train, y_train=y_train, base_model=base_model, hparams=hparams)
    <span class="hljs-keyword">return</span> ave_mse

<span class="hljs-comment"># create the search space</span>
hparam_names = [s.name <span class="hljs-keyword">for</span> s <span class="hljs-keyword">in</span> space]
objective_partial = partial(objective, X_train=X_train, y_train=y_train, base_model=base_model, hparam_names=hparam_names)

<span class="hljs-comment"># search the optimal hyperparameters</span>
results = gp_minimize(
    func=objective_partial,
    dimensions=space,
    n_calls=n_calls,
    random_state=<span class="hljs-number">42</span>,
    verbose=<span class="hljs-literal">False</span>,
    n_initial_points=<span class="hljs-number">10</span>,
)
<span class="hljs-comment"># results</span>
best_hparams = dict(zip(hparam_names, results.x))
best_mse = results.fun

<span class="hljs-comment"># reconstruct the model with the best hyperparameters</span>
best_model = base_model(**best_hparams)

<span class="hljs-comment"># retrain the model with full training dataset</span>
best_model.fit(X_train, y_train)
</code></pre>
<h3 id="heading-step-2-configure-featuremodel-stores-in-s3">Step 2: Configure Feature/Model Stores in S3</h3>
<p>The trained models and processed data are stored in the S3 bucket as a <strong>Parquet file</strong>.</p>
<p>We’ll draft the <code>s3_upload</code> function where the <strong>Boto3 client</strong>, a low-level interface to an AWS service, initiates the connection to S3:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> boto3
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">s3_upload</span>(<span class="hljs-params">file_path: str</span>):</span>
    <span class="hljs-comment"># initiate the boto3 client</span>
    load_dotenv(override=<span class="hljs-literal">True</span>)
    S3_BUCKET_NAME = os.environ.get(<span class="hljs-string">'S3_BUCKET_NAME'</span>) <span class="hljs-comment"># the bucket created in s3</span>
    s3_client = boto3.client(<span class="hljs-string">'s3'</span>, region_name=os.environ.get(<span class="hljs-string">'AWS_REGION_NAME'</span>)) <span class="hljs-comment"># your default region</span>

    <span class="hljs-keyword">if</span> s3_client:
        <span class="hljs-comment"># create s3 key and upload the file to the bucket</span>
        s3_key = file_path <span class="hljs-keyword">if</span> file_path[<span class="hljs-number">0</span>] != <span class="hljs-string">'/'</span> <span class="hljs-keyword">else</span> file_path[<span class="hljs-number">1</span>:]
        s3_client.upload_file(file_path, S3_BUCKET_NAME, s3_key)
        main_logger.info(<span class="hljs-string">f"file uploaded to s3://<span class="hljs-subst">{S3_BUCKET_NAME}</span>/<span class="hljs-subst">{s3_key}</span>"</span>)
    <span class="hljs-keyword">else</span>:
        main_logger.error(<span class="hljs-string">'failed to create an S3 client.'</span>)
</code></pre>
<h4 id="heading-model-store">Model Store</h4>
<p>Trained PyTorch models are serialized (converted) into <code>.pth</code> files.</p>
<p>Then, these files are uploaded to the S3 bucket, enabling the system to load the trained model when it performs inference in production.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> torch

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_upload

<span class="hljs-comment"># model serialization, store in local</span>
torch.save(trained_model.state_dict(), MODEL_FILE_PATH)

<span class="hljs-comment"># upload to s3 model store</span>
s3_upload(file_path=MODEL_FILE_PATH)
</code></pre>
<h4 id="heading-feature-store">Feature Store</h4>
<p>The processed data is converted into a CSV and Parquet file format.</p>
<p>Then, the Parquet files are uploaded to the S3 bucket, enabling the system to load the lightweight data when it creates prediction data to perform inference in production.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_upload

<span class="hljs-comment"># store csv and parquet files in local</span>
df.to_csv(file_path, index=<span class="hljs-literal">False</span>)
df.to_parquet(DATA_FILE_PATH, index=<span class="hljs-literal">False</span>)

<span class="hljs-comment"># store in s3 feature store</span>
s3_upload(file_path=DATA_FILE_PATH)

<span class="hljs-comment"># trained preprocessor is also stored to transform the prediction data</span>
s3_upload(file_path=PROCESSOR_PATH)
</code></pre>
<h3 id="heading-step-3-create-a-flask-application-with-api-endpoints">Step 3: Create a Flask Application with API Endpoints</h3>
<p>Next, we’ll create a Flask application with API endpoints.</p>
<p>Flask needs to configure Python scripts in the <code>app.py</code> file located at the root of the project repository.</p>
<p>As showed in the code snippets, the <code>app.py</code> file needs to contain the components in order of:</p>
<ol>
<li><p>AWS Boto3 client setup,</p>
</li>
<li><p>Flask app configuration and API endpoint setup,</p>
</li>
<li><p>Loading the trained preprocessor, processed input data <code>X_test</code>, and trained models,</p>
</li>
<li><p>Invoke the Lambda function via API Gateway, and</p>
</li>
<li><p>The local test section.</p>
</li>
</ol>
<p>Note that <code>X_test</code> should never be used during model training to avoid data leakage.</p>
<p><code>app.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> flask <span class="hljs-keyword">import</span> Flask
<span class="hljs-keyword">from</span> flask_cors <span class="hljs-keyword">import</span> cross_origin
<span class="hljs-keyword">from</span> waitress <span class="hljs-keyword">import</span> serve
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> main_logger

<span class="hljs-comment"># global variables (will be loaded from the S3 buckets)</span>
_redis_client = <span class="hljs-literal">None</span>
X_test = <span class="hljs-literal">None</span>
preprocessor = <span class="hljs-literal">None</span>
model = <span class="hljs-literal">None</span>
backup_model = <span class="hljs-literal">None</span>

<span class="hljs-comment"># load env if local else skip (lambda refers to env in production)</span>
AWS_LAMBDA_RUNTIME_API = os.environ.get(<span class="hljs-string">'AWS_LAMBDA_RUNTIME_API'</span>, <span class="hljs-literal">None</span>)
<span class="hljs-keyword">if</span> AWS_LAMBDA_RUNTIME_API <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>: load_dotenv(override=<span class="hljs-literal">True</span>)


<span class="hljs-comment">#### &lt;---- 1. AWS BOTO3 CLIENT ----&gt;</span>
<span class="hljs-comment"># boto3 client </span>
S3_BUCKET_NAME = os.environ.get(<span class="hljs-string">'S3_BUCKET_NAME'</span>, <span class="hljs-string">'ml-sales-pred'</span>)
s3_client = boto3.client(<span class="hljs-string">'s3'</span>, region_name=os.environ.get(<span class="hljs-string">'AWS_REGION_NAME'</span>, <span class="hljs-string">'us-east-1'</span>))
<span class="hljs-keyword">try</span>:
    <span class="hljs-comment"># test connection to boto3 client</span>
    sts_client = boto3.client(<span class="hljs-string">'sts'</span>)
    identity = sts_client.get_caller_identity()
    main_logger.info(<span class="hljs-string">f"Lambda is using role: <span class="hljs-subst">{identity[<span class="hljs-string">'Arn'</span>]}</span>"</span>)
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
    main_logger.error(<span class="hljs-string">f"Lambda credentials/permissions error: <span class="hljs-subst">{e}</span>"</span>)

<span class="hljs-comment">#### &lt;---- 2. FLASK CONFIGURATION &amp; API ENDPOINTS ----&gt;</span>
<span class="hljs-comment"># configure the flask app</span>
app = Flask(__name__)
app.config[<span class="hljs-string">'CORS_HEADERS'</span>] = <span class="hljs-string">'Content-Type'</span>

<span class="hljs-comment"># add a simple API endpoint to serve the prediction by price point to test</span>
<span class="hljs-meta">@app.route('/v1/predict-price/&lt;string:stockcode&gt;', methods=['GET', 'OPTIONS'])</span>
<span class="hljs-meta">@cross_origin(origins=origins, methods=['GET', 'OPTIONS'], supports_credentials=True)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_price</span>(<span class="hljs-params">stockcode</span>):</span>
    df_stockcode = <span class="hljs-literal">None</span>

    <span class="hljs-comment"># fetch request params</span>
    data = request.args.to_dict()

    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># fetch cache</span>
        <span class="hljs-keyword">if</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
            <span class="hljs-comment"># returns cached prediction results if any without performing inference</span>
            cached_prediction_result = _redis_client.get(cache_key_prediction_result_by_stockcode)
            <span class="hljs-keyword">if</span> cached_prediction_result: 
                <span class="hljs-keyword">return</span> jsonify(json.loads(json.dumps(cached_prediction_result)))

            <span class="hljs-comment"># historical data of the selected product</span>
            cached_df_stockcode = _redis_client.get(cache_key_df_stockcode)
            <span class="hljs-keyword">if</span> cached_df_stockcode: df_stockcode = json.loads(json.dumps(cached_df_stockcode))


        <span class="hljs-comment"># define the price range to make predictions. can be a request param, or historical min/max prices</span>
        min_price = float(data.get(<span class="hljs-string">'unitprice_min'</span>, df_stockcode[<span class="hljs-string">'unitprice_min'</span>][<span class="hljs-number">0</span>]))
        max_price = float(data.get(<span class="hljs-string">'unitprice_max'</span>, df_stockcode[<span class="hljs-string">'unitprice_max'</span>][<span class="hljs-number">0</span>]))

        <span class="hljs-comment"># create bins in the price range. when the number of the bins increase, the prediction becomes more smooth, but requires more computational cost</span>
        NUM_PRICE_BINS = int(data.get(<span class="hljs-string">'num_price_bins'</span>, <span class="hljs-number">100</span>))
        price_range = np.linspace(min_price, max_price, NUM_PRICE_BINS)

        <span class="hljs-comment"># create a prediction dataset by merging X_test (dataset never used in model training) and df_stockcode</span>
        price_range_df = pd.DataFrame({ <span class="hljs-string">'unitprice'</span>: price_range })
        test_sample = X_test.sample(n=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">42</span>)
        test_sample_merged = test_sample.merge(price_range_df, how=<span class="hljs-string">'cross'</span>) <span class="hljs-keyword">if</span> X_test <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">else</span> price_range_df
        test_sample_merged.drop(<span class="hljs-string">'unitprice_x'</span>, axis=<span class="hljs-number">1</span>, inplace=<span class="hljs-literal">True</span>)
        test_sample_merged.rename(columns={<span class="hljs-string">'unitprice_y'</span>: <span class="hljs-string">'unitprice'</span>}, inplace=<span class="hljs-literal">True</span>)

        <span class="hljs-comment"># preprocess the dataset</span>
        X = preprocessor.transform(test_sample_merged) <span class="hljs-keyword">if</span> preprocessor <span class="hljs-keyword">else</span> test_sample_merged

        <span class="hljs-comment"># perform inference</span>
        y_pred_actual = <span class="hljs-literal">None</span>
        epsilon = <span class="hljs-number">0</span>
        <span class="hljs-comment"># try using the primary model</span>
        <span class="hljs-keyword">if</span> model:
            input_tensor = torch.tensor(X, dtype=torch.float32)
            model.eval()
            <span class="hljs-keyword">with</span> torch.inference_mode():
                y_pred = model(input_tensor)
                y_pred = y_pred.cpu().numpy().flatten()
                y_pred_actual = np.exp(y_pred + epsilon)

        <span class="hljs-comment"># if not, use backups</span>
        <span class="hljs-keyword">elif</span> backup_model:
            y_pred = backup_model.predict(X)
            y_pred_actual = np.exp(y_pred + epsilon)


        <span class="hljs-comment"># finalize the outcome for client app</span>
        df_ = test_sample_merged.copy()
        df_[<span class="hljs-string">'quantity'</span>] = np.floor(y_pred_actual) <span class="hljs-comment"># quantity must be an integer</span>
        df_[<span class="hljs-string">'sales'</span>] = df_[<span class="hljs-string">'quantity'</span>] * df_[<span class="hljs-string">'unitprice'</span>] <span class="hljs-comment"># compute sales</span>
        df_ = df_.sort_values(by=<span class="hljs-string">'unitprice'</span>)

        <span class="hljs-comment"># aggregate the results by the unitprice in the price range</span>
        df_results = df_.groupby(<span class="hljs-string">'unitprice'</span>).agg(
            quantity=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'median'</span>),
            quantity_min=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'min'</span>),
            quantity_max=(<span class="hljs-string">'quantity'</span>, <span class="hljs-string">'max'</span>),
            sales=(<span class="hljs-string">'sales'</span>, <span class="hljs-string">'median'</span>),
        ).reset_index()

        <span class="hljs-comment"># find the optimal price point</span>
        optimal_row = df_results.loc[df_results[<span class="hljs-string">'sales'</span>].idxmax()]
        optimal_price = optimal_row[<span class="hljs-string">'unitprice'</span>]
        optimal_quantity = optimal_row[<span class="hljs-string">'quantity'</span>]
        best_sales = optimal_row[<span class="hljs-string">'sales'</span>]

        all_outputs = []
        <span class="hljs-keyword">for</span> _, row <span class="hljs-keyword">in</span> df_results.iterrows():
            current_output = {
                <span class="hljs-string">"stockcode"</span>: stockcode,
                <span class="hljs-string">"unit_price"</span>: float(row[<span class="hljs-string">'unitprice'</span>]),
                <span class="hljs-string">'quantity'</span>: int(row[<span class="hljs-string">'quantity'</span>]),
                <span class="hljs-string">'quantity_min'</span>: int(row[<span class="hljs-string">'quantity_min'</span>]),
                <span class="hljs-string">'quantity_max'</span>: int(row[<span class="hljs-string">'quantity_max'</span>]),
                <span class="hljs-string">"predicted_sales"</span>: float(row[<span class="hljs-string">'sales'</span>]),
            }
            all_outputs.append(current_output)

        <span class="hljs-comment"># store the prediction results in cache</span>
        <span class="hljs-keyword">if</span> all_outputs <span class="hljs-keyword">and</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
             serialized_data = json.dumps(all_outputs)
            _redis_client.set(
                cache_key_prediction_result_by_stockcode, 
                serialized_data,
                ex=<span class="hljs-number">3600</span>     <span class="hljs-comment"># expire in an hour</span>
            )

        <span class="hljs-comment"># return a list of all outputs</span>
        <span class="hljs-keyword">return</span> jsonify(all_outputs)

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e: <span class="hljs-keyword">return</span> jsonify([])


<span class="hljs-comment"># request header management (for the process from API gateway to the Lambda)</span>
<span class="hljs-meta">@app.after_request</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">add_header</span>(<span class="hljs-params">response</span>):</span>
    response.headers[<span class="hljs-string">'Cache-Control'</span>] = <span class="hljs-string">'public, max-age=0'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Origin'</span>] = CLIENT_A
    response.headers[<span class="hljs-string">'Access-Control-Allow-Headers'</span>] = <span class="hljs-string">'Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token,Origin'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Methods'</span>] = <span class="hljs-string">'GET, POST, OPTIONSS'</span>
    response.headers[<span class="hljs-string">'Access-Control-Allow-Credentials'</span>] = <span class="hljs-string">'true'</span>
    <span class="hljs-keyword">return</span> response

<span class="hljs-comment">#### &lt;---- 3. LOADING PROCESSOR, DATASET, AND MODELS ----&gt;</span>
load_processor()
load_x_test()
load_model()

<span class="hljs-comment">#### &lt;---- 4. INVOKE LAMBDA ----&gt;</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">handler</span>(<span class="hljs-params">event, context</span>):</span>
    logger.info(<span class="hljs-string">"lambda handler invoked."</span>)
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># connecting the redis client after the lambda is invoked</span>
        get_redis_client()
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        logger.critical(<span class="hljs-string">f"failed to establish initial Redis connection in handler: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> {
            <span class="hljs-string">'statusCode'</span>: <span class="hljs-number">500</span>,
            <span class="hljs-string">'body'</span>: json.dumps({<span class="hljs-string">'error'</span>: <span class="hljs-string">'Failed to initialize Redis client. Check environment variables and network config.'</span>})
        }

    <span class="hljs-comment"># use the awsgi package to convert JSON to WSGI</span>
    <span class="hljs-keyword">return</span> awsgi.response(app, event, context)


<span class="hljs-comment">#### &lt;---- 5. FOR LOCAL TEST ----&gt;</span>
<span class="hljs-comment"># serve the application locally on WSGI server, waitress</span>
<span class="hljs-comment"># lambda will ignore this section.</span>
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:   
    <span class="hljs-keyword">if</span> os.getenv(<span class="hljs-string">'ENV'</span>) == <span class="hljs-string">'local'</span>:
        main_logger.info(<span class="hljs-string">"...start the operation (local)..."</span>)
        serve(app, host=<span class="hljs-string">'0.0.0.0'</span>, port=<span class="hljs-number">5002</span>)
    <span class="hljs-keyword">else</span>:
        app.run(host=<span class="hljs-string">'0.0.0.0'</span>, port=<span class="hljs-number">8080</span>)
</code></pre>
<p>I’ll test the endpoint locally using the <code>uv</code> package manager:</p>
<pre><code class="lang-python">$uv run app.py --cache-clear

$curl http://localhost:<span class="hljs-number">5002</span>/v1/predict-price/{STOCKCODE}
</code></pre>
<p>The system provided a list of sales predictions for each price point:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755607075000/e0e8cbcb-8817-4aa5-b3d1-37b76cc684fb.png" alt="Fig. Screenshot of the Flask app local response" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the Flask app local response</p>
<h4 id="heading-key-points-on-flask-app-configuration">Key Points on Flask App Configuration</h4>
<p>There are various points you should take into consideration when configuring a Flask application with Lambda. Let’s go over them now:</p>
<h5 id="heading-1-a-few-api-endpoints-per-container"><strong>1. A Few API Endpoints Per Container</strong></h5>
<p>Adding many API endpoints to a single serverless instance can lead to <strong>monolithic function concern</strong> where issues in one endpoint impact others.</p>
<p>In this project, we’ll focus on a single endpoint per container – and if needed, we can add separate Lambda functions to the system.</p>
<h5 id="heading-2-understanding-the-handler-function-and-the-role-of-awsgi"><strong>2. Understanding the</strong> <code>handler</code> <strong>Function and the role of AWSGI</strong></h5>
<p>The <code>handler</code> function is invoked every time the Lambda function receives a client request from the API Gateway.</p>
<p>The function takes the <code>event</code> argument that includes the request details in a <strong>JSON dictionary</strong> and passes it to the Flask application.</p>
<p><strong>AWSGI</strong> acts as an adapter, translating a Lambda event in JSON format into a WSGI request that a Flask application can understand, and converts the application’s response back into a JSON format that Lambda and API Gateway can process.</p>
<h5 id="heading-3-using-cache-storage"><strong>3. Using Cache Storage</strong></h5>
<p>The <code>get_redis_client</code> function is called once the <code>handler</code> function is called by the API Gateway. This allows the Flask application to store or fetch a cache from the Redis client:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> redis
<span class="hljs-keyword">import</span> redis.cluster
<span class="hljs-keyword">from</span> redis.cluster <span class="hljs-keyword">import</span> ClusterNode

_redis_client = <span class="hljs-literal">None</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_redis_client</span>():</span>
    <span class="hljs-keyword">global</span> _redis_client
    <span class="hljs-keyword">if</span> _redis_client <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        REDIS_HOST = os.environ.get(<span class="hljs-string">"REDIS_HOST"</span>)
        REDIS_PORT = int(os.environ.get(<span class="hljs-string">"REDIS_PORT"</span>, <span class="hljs-number">6379</span>))
        REDIS_TLS = os.environ.get(<span class="hljs-string">"REDIS_TLS"</span>, <span class="hljs-string">"true"</span>).lower() == <span class="hljs-string">"true"</span>
        <span class="hljs-keyword">try</span>:
            startup_nodes = [ClusterNode(host=REDIS_HOST, port=REDIS_PORT)]
            _redis_client = redis.cluster.RedisCluster(
                startup_nodes=startup_nodes,
                decode_responses=<span class="hljs-literal">True</span>,
                skip_full_coverage_check=<span class="hljs-literal">True</span>,
                ssl=REDIS_TLS,                  <span class="hljs-comment"># elasticache has encryption in transit: enabled -&gt; must be true</span>
                ssl_cert_reqs=<span class="hljs-literal">None</span>,
                socket_connect_timeout=<span class="hljs-number">5</span>,
                socket_timeout=<span class="hljs-number">5</span>,
                health_check_interval=<span class="hljs-number">30</span>,
                retry_on_timeout=<span class="hljs-literal">True</span>,
                retry_on_error=[
                    redis.exceptions.ConnectionError,
                    redis.exceptions.TimeoutError
                ],
                max_connections=<span class="hljs-number">10</span>,            <span class="hljs-comment"># limit connections for Lambda</span>
                max_connections_per_node=<span class="hljs-number">2</span>     <span class="hljs-comment"># limit per node</span>
            )
            _redis_client.ping()
            main_logger.info(<span class="hljs-string">"successfully connected to ElastiCache Redis Cluster (Configuration Endpoint)"</span>)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            main_logger.error(<span class="hljs-string">f"an unexpected error occurred during Redis Cluster connection: <span class="hljs-subst">{e}</span>"</span>, exc_info=<span class="hljs-literal">True</span>)
            _redis_client = <span class="hljs-literal">None</span>
    <span class="hljs-keyword">return</span> _redis_client
</code></pre>
<h5 id="heading-4-handling-heavy-tasks-outside-of-the-handler-function"><strong>4. Handling Heavy Tasks Outside of the</strong> <code>handler</code> <strong>Function</strong></h5>
<p>Serverless functions can experience a <strong>cold start duration</strong>.</p>
<p>While a Lambda function can run for up to 15 minutes, its associated API Gateway has a timeout of 29 seconds (29,000 ms) for a RESTful API.</p>
<p>So, any heavy tasks like loading preprocessors, input data, or models should be performed once outside of the <code>handler</code> function, ensuring they are ready <em>before</em> the API endpoint is called.</p>
<p>Here are the loading functions called in <code>app.py</code>.</p>
<p><code>app.py</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> joblib

<span class="hljs-keyword">from</span> src._utils <span class="hljs-keyword">import</span> s3_load, s3_load_to_temp_file

preprocessor = <span class="hljs-literal">None</span>
X_test = <span class="hljs-literal">None</span>
model = <span class="hljs-literal">None</span>
backup_model = <span class="hljs-literal">None</span>


<span class="hljs-comment"># load processor</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_preprocessor</span>():</span>
    <span class="hljs-keyword">global</span> preprocessor
    preprocessor_tempfile_path = s3_load_to_temp_file(PREPROCESSOR_PATH)
    preprocessor = joblib.load(preprocessor_tempfile_path)
    os.remove(preprocessor_tempfile_path)


<span class="hljs-comment"># load input data</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_x_test</span>():</span>
    <span class="hljs-keyword">global</span> X_test
    x_test_io = s3_load(file_path=X_TEST_PATH)
    X_test = pd.read_parquet(x_test_io)


<span class="hljs-comment"># load model</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_model</span>():</span>
    <span class="hljs-keyword">global</span> model, backup_model
    <span class="hljs-comment"># try loading &amp; reconstructing the primary model</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># first load io file from the s3 bucket</span>
        model_data_bytes_io_ = s3_load(file_path=DFN_FILE_PATH)
        <span class="hljs-comment"># convert to checkpoint dictionary (containing hyperparameter set)</span>
        checkpoint_ = torch.load(
            model_data_bytes_io_, 
            weights_only=<span class="hljs-literal">False</span>, 
            map_location=device
        )
        <span class="hljs-comment"># reconstruct the model</span>
        model = t.scripts.load_model(checkpoint=checkpoint_, file_path=DFN_FILE_PATH)
        <span class="hljs-comment"># set the model evaluation mode</span>
        model.eval()

    <span class="hljs-comment"># else, backup model</span>
     <span class="hljs-keyword">except</span>:
        load_artifacts_backup_model()
</code></pre>
<h3 id="heading-step-4-publish-a-docker-image-to-ecr">Step 4: Publish a Docker Image to ECR</h3>
<p>After configuring the Flask application, we’ll containerize the entire application on <strong>Docker</strong>.</p>
<p>Containerization makes a package of the application, including models, its dependencies, and configuration in machine learning context, as a container<strong>.</strong></p>
<p>Docker creates a container image based on the instructions defined in a Dockerfile, and the Docker engine uses the image to run the isolated container.</p>
<p>In this project, we’ll upload the Docker container image to ECR, so the Lambda function can access it in production.</p>
<p>After this, we’ll define the <code>.dockerignore</code> file to optimize the container image:</p>
<p><code>.dockerignore</code></p>
<pre><code class="lang-plaintext"># any irrelevant data
__pycache__/
.ruff_cache/
.DS_Store/
.venv/
dist/
.vscode
*.psd
*.pdf
[a-f]*.log
tmp/
awscli-bundle/

# add any experimental models, unnecessary data
dfn_bayesian/
dfn_grid/
data/
notebooks/
</code></pre>
<p><code>Dockerfile</code></p>
<pre><code class="lang-dockerfile"><span class="hljs-comment"># serve from aws ecr </span>
<span class="hljs-keyword">FROM</span> public.ecr.aws/lambda/python:<span class="hljs-number">3.12</span>

<span class="hljs-comment"># define a working directory in the container</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app</span>

<span class="hljs-comment"># copy the entire repository (except .dockerignore) into the container at /app</span>
<span class="hljs-keyword">COPY</span><span class="bash"> . /app/</span>

<span class="hljs-comment"># install dependencies defined in the requirements.txt</span>
<span class="hljs-keyword">RUN</span><span class="bash"> pip install --no-cache-dir -r requirements.txt</span>

<span class="hljs-comment"># define commands</span>
<span class="hljs-keyword">ENTRYPOINT</span><span class="bash"> [ <span class="hljs-string">"python"</span> ]</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [ <span class="hljs-string">"-m"</span>, <span class="hljs-string">"awslambdaric"</span>, <span class="hljs-string">"app.handler"</span> ]</span>
</code></pre>
<h4 id="heading-test-in-local">Test in Local</h4>
<p>Next, we’ll test the Docker image by building the container named <code>my-app</code> locally:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> build -t my-app -f Dockerfile .
</code></pre>
<p>Then, we’ll run the container with the <code>waitress</code> server in local:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$docker</span> run -p 5002:5002 -e ENV=<span class="hljs-built_in">local</span> my-app app.py
</code></pre>
<p>The <code>-e ENV=local</code> flag sets the environment variable inside the container, which will trigger the <code>waitress.serve()</code> call in the <code>app.py</code>.</p>
<p>In the terminal, you’ll find a message saying the following:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*zu8mamgKMKOUxwCA.png" alt="Flask app response" width="600" height="400" loading="lazy"></p>
<p>You can also call the endpoint created to see the results returned:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$uv</span> run app.py --cache-clear

<span class="hljs-variable">$curl</span> http://localhost:5002/v1/predict-price/{STOCKCODE}
</code></pre>
<h4 id="heading-publish-the-docker-image-to-ecr">Publish the Docker Image to ECR</h4>
<p>To publish the Docker image, we first need to configure the default AWS credentials and region:</p>
<ul>
<li><p>From the AWS account console, issue an access token and check the default region.</p>
</li>
<li><p>Store them in the <code>~/aws/credentials</code> and <code>~/aws/config</code> files:</p>
</li>
</ul>
<p><code>~/aws/credentials</code></p>
<pre><code class="lang-plaintext">[default] 
aws_secret_access_key=
aws_access_key_id=
</code></pre>
<p><code>~/aws/config</code></p>
<pre><code class="lang-plaintext">[default]
region=
</code></pre>
<p>After the configuration, we’ll publish the Docker image to ECR.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># authenticate the docker client to ECR</span>
<span class="hljs-variable">$aws</span> ecr get-login-password --region &lt;your-aws-region&gt; | docker login --username AWS --password-stdin &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com

<span class="hljs-comment"># create repository</span>
<span class="hljs-variable">$aws</span> ecr create-repository --repository-name &lt;your-repo-name&gt; --region &lt;your-aws-region&gt;

<span class="hljs-comment"># tag the docker image</span>
<span class="hljs-variable">$docker</span> tag &lt;your-repo-name&gt;:&lt;your-app-version&gt;  &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com/&lt;your-app-name&gt;:&lt;your-app-version&gt;

<span class="hljs-comment"># push</span>
<span class="hljs-variable">$docker</span> push &lt;your-aws-account-id&gt;.dkr.ecr.&lt;your-aws-region&gt;.amazonaws.com/&lt;your-repo-name&gt;:&lt;your-app-version&gt;
</code></pre>
<p>Here’s what’s going on:</p>
<ul>
<li><p><code>&lt;your-aws-region&gt;</code>: Your default AWS region (for example, <code>us-east-1</code> ).</p>
</li>
<li><p><code>&lt;your-aws-account-id&gt;</code>: 12-digit AWS account ID.</p>
</li>
<li><p><code>&lt;your-repo-name&gt;</code>: Your desired repository name.</p>
</li>
<li><p><code>&lt;your-app-version&gt;</code>: Your desired tag name (for example, <code>v1.0</code>).</p>
</li>
</ul>
<p>Now, the Docker image is stored in ECR with the tag:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*tUQkbDW-uAmrjBfx.png" alt="Fig. Screenshot of the AWS ECR console" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS ECR console</p>
<h3 id="heading-step-5-create-a-lambda-function">Step 5: Create a Lambda Function</h3>
<p>Next, we’ll create a Lambda function.</p>
<p>From the Lambda console, choose:</p>
<ul>
<li><p>The <code>Container Image</code> option,</p>
</li>
<li><p>The container image URL from the pull down list,</p>
</li>
<li><p>A function name of our choice, and</p>
</li>
<li><p>An architecture type (arm64 is recommended for a better price-performance).</p>
</li>
</ul>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*3b-wIEUzRooQcvN_.png" alt="Fig. Screenshot of AWS Lambda function configurationFig. Screenshot of AWS Lambda function configuration" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of AWS Lambda function configuration</p>
<p>The Lambda function <code>my-app</code> was successfully launched.</p>
<h4 id="heading-connect-the-lambda-function-to-api-gateway">Connect the Lambda function to API Gateway</h4>
<p>Next, we’ll add API gateway as an event trigger to the Lambda function.</p>
<p>First, visit the API Gateway console and create <strong>REST API methods</strong> using the ARN of the Lambda function (press enter or click to view image in full size):</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*60TP64gdSjhKfiO8.png" alt="Fig. Screenshot of the AWS API Gateway configurationFig. Screenshot of the AWS API Gateway configuration" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS API Gateway configuration</p>
<p>Then, add resources to the created API gateway to create an endpoint:<br><code>API Gateway &gt; APIs &gt; Resources &gt; Create Resource</code></p>
<ul>
<li><p>Align the resource endpoint with the API endpoint defined in the <a target="_blank" href="http://app.py"><code>app.py</code></a>.</p>
</li>
<li><p>Configure CORS (for example, accept specific origins).</p>
</li>
<li><p>Deploy the resource to the stage.</p>
</li>
</ul>
<p>Going back to the Lambda console, you’ll find the API Gateway is connected as an event trigger:<br><code>Lambda &gt; Function &gt; my-app (your function name)</code></p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/0*DlfiEieZArmYlOuT.png" alt="Fig. Screenshot of the AWS Lambda dashboard" width="600" height="400" loading="lazy"></p>
<p>Fig. Screenshot of the AWS Lambda dashboard</p>
<h3 id="heading-step-6-configure-aws-resources">Step 6: Configure AWS Resources</h3>
<p>Lastly, we’ll configure the related AWS resources to make the system work in production.</p>
<p>This process involves the following steps:</p>
<h4 id="heading-1-the-iam-role-controls-who-to-access-resources">1. The IAM Role: Controls Who to Access Resources</h4>
<p>AWS requires <strong>IAM roles</strong> to grant temporary, secure permissions to users, mitigating security risks related to long-term credentials like passwords.</p>
<p>The IAM role leverages policies to grant accesses to the selected service. Policies can be issued by AWS or customized by the user by defining the inline policy.</p>
<p>It is important to avoid overly permissive access rights for the IAM role.</p>
<ol>
<li><p>In the Lambda function console, check the execution role:<br> <code>Lambda &gt; Function &gt; &lt;FUNCTION&gt; &gt; Permission &gt; The execution role</code>.</p>
</li>
<li><p>Set up the following policies to allow the Lambda’s IAM role to handle necessary operations:</p>
<ul>
<li><p><strong>Lambda</strong> <code>AWSLambdaExecute</code>: Allows executing the function.</p>
</li>
<li><p><strong>EC2</strong> <code>Inline policy</code>: Allows controlling the security group and the VPC of the Lambda function.</p>
</li>
<li><p><strong>ECR</strong> <code>AmazonElasticContainerRegistryPublicFullAccess</code> + <code>Inline policy</code>: Allows storing and pulling the Docker image.</p>
</li>
<li><p><strong>ElastiCache</strong> <code>AmazonElastiCacheFullAccess</code> + <code>Inline policy</code>: Allows storing and pulling caches.</p>
</li>
<li><p><strong>S3</strong>: <code>AmazonS3ReadOnlyAccess</code> + <code>Inline policy</code>: Allows reading and storing contents.</p>
</li>
</ul>
</li>
</ol>
<p>Now, the IAM role can access these resources and perfo the allowed actions.</p>
<h4 id="heading-2-the-security-group-controls-network-traffic">2. The Security Group: Controls Network Traffic</h4>
<p>A <strong>security group</strong> is a virtual firewall that controls inbound and outbound network traffic for AWS resources.</p>
<p>It uses stateful (allowing return traffic automatically) “allow-only” rules based on protocol, port, and IP address, where it denies all traffic by default.</p>
<p>Create a new security group for the Lambda function:<br><code>EC2 &gt; Security Groups &gt; &lt;YOUR SECURITY GROUP&gt;</code></p>
<p>Now, we’ll want to setup inbound / outbound traffic rules.</p>
<p>The inbound rules:</p>
<ul>
<li><p><strong>S3 → Lambda</strong>:<strong>Type</strong>*: HTTPS /* <strong>Protocol</strong>*: TCP /* <strong>Port range</strong>*: 443 / Source: Custom**</p>
</li>
<li><p><strong>ElastiCache → Lambda</strong>:<strong>Type</strong>*: Custom TCP /* <strong>Port range</strong>*: 6379 / Source: Custom**</p>
</li>
</ul>
<p>*Choose the created security group for the Lambda function as a custom source.</p>
<p>The outbound rules:</p>
<ul>
<li><p><strong>Lambda → Internet</strong>: <strong>Type</strong>*: HTTPS /* <strong>Protocol</strong>*: TCP /* <strong>Port range</strong>*: 443 /* <strong>Destination</strong>*: 0.0.0.0/0*</p>
</li>
<li><p><strong>ElastiCache → Internet</strong>: <strong>Type</strong>*: All Traffic /* <strong>Destination</strong>*: 0.0.0.0/0*</p>
</li>
</ul>
<h4 id="heading-3-the-virtual-private-cloud-vpc">3. The Virtual Private Cloud (VPC)</h4>
<p>A <strong>Virtual Private Cloud (VPC)</strong> provides a logically isolated private network for the AWS resources, acting as our own private data center within AWS.</p>
<p>AWS can create a <strong>Hyperplane ENI</strong> (Elastic Network Interface) for the Lambda function and its connected resources in the subnets of the VPC.</p>
<p>Though it’s optional, we’ll use the VPC to connect the Lambda function to the S3 storage and ElastiCache.</p>
<p>This process involves:</p>
<ol>
<li><p>Creating a VPC endpoint from the VPC console:<code>VPC &gt; Create VPC</code>.</p>
</li>
<li><p>Creating an STS (Security Token Service) endpoint:<br> <code>VPC &gt; PrivateLink and Lattice &gt; Endpoints &gt; Create Endpoint &gt;</code></p>
<ul>
<li><p><strong>Type</strong>*: AWS Service*</p>
</li>
<li><p><strong>Service name</strong>*: com.amazonaws.&lt;YOUR REGION&gt;.sts*</p>
</li>
<li><p><strong>Type</strong>*: Interface*</p>
</li>
<li><p><strong>VPC:</strong> Select the VPC created earlier.</p>
</li>
<li><p><strong>Subnets</strong>*: Select all subnets.*</p>
</li>
<li><p><strong>Security groups</strong>*: Select the security group of the Lambda function.*</p>
</li>
<li><p><strong>Policy</strong>*: Full access*</p>
</li>
<li><p><strong>Enable DNS names</strong></p>
</li>
</ul>
</li>
</ol>
<p>The VPC must have a dedicated endpoint for STS to receive temporary credentials from STS.</p>
<ol start="3">
<li><p>Create an S3 endpoint in the VPC:<br> <code>VPC &gt; PrivateLink and Lattice &gt; Endpoints &gt; Create Endpoint &gt;</code></p>
<ul>
<li><p><strong>Type</strong>*: AWS Service*</p>
</li>
<li><p><strong>Service name</strong>*: com.amazonaws.&lt;YOUR REGION&gt;.s3*</p>
</li>
<li><p><strong>Type</strong>*: Gateway*</p>
</li>
<li><p><strong>VPC:</strong> Select the VPC created earlier.</p>
</li>
<li><p><strong>Subnets</strong>*: Select all subnets.*</p>
</li>
<li><p><strong>Security groups</strong>*: Select the security group of the Lambda function.*</p>
</li>
<li><p><strong>Policy</strong>*: Full access*</p>
</li>
</ul>
</li>
</ol>
<p>Lastly, check the security group of the Lambda function and ensure that its VPC ID directs to the VPC created: <code>EC2 &gt; Security Group &gt; &lt;YOUR SECURITY GROUP FOR THE LAMDA FUNCTION&gt; &gt; VPC ID</code>.</p>
<p>That’s all for the deployment flow.</p>
<p>We can now test the API endpoint in production. Copy the <strong>Invoke URL</strong> of the deployed API endpoint: <code>API Gateway &gt; APIs &gt; Stages &gt; Invoke URL</code>. Then call the API endpoint and check if it responds predictions:</p>
<pre><code class="lang-bash"><span class="hljs-variable">$curl</span> -H <span class="hljs-string">'Authorization: Bearer YOUR_API_TOKEN'</span> -H <span class="hljs-string">'Accept: application/json'</span> \
     <span class="hljs-string">'&lt;INVOKE URL&gt;/&lt;ENDPOINT&gt;'</span>
</code></pre>
<p>For logging and debugging, we’ll use the LiveTail of CloudWatch: <code>CloudWatch &gt; LiveTail</code>.</p>
<h2 id="heading-building-a-client-application-optional">Building a Client Application (Optional)</h2>
<p>For full-stack deployment, we’ll build a simple React application to display the prediction using the <a target="_blank" href="https://recharts.org/en-US">recharts</a> library for visualization.</p>
<p>Other options for quick frontend deployment include <a target="_blank" href="https://streamlit.io/">Streamlit</a> or <a target="_blank" href="https://www.gradio.app/">Gradio</a>.</p>
<h3 id="heading-the-react-application">The React Application</h3>
<p>The React application creates a web page that fetches and visualizes sales predictions from an external API, recommending an optimal price point.</p>
<p>The app uses <code>useState</code> to manage its data and state, including the selected product, the list of sales predictions, and the loading/error status.</p>
<p>When the user initiates a request, a <code>useEffect</code> hook triggers a <code>fetch</code> request to a Flask backend. It handles the API response as a <strong>data stream</strong>, processing it line by line to progressively update the predictions.</p>
<p>The <code>AreaChart</code> from the <code>recharts</code> library then visualizes this data. The X-axis represents the <code>price</code> and the Y-axis represents the <code>sales</code>. The chart updates in real-time as the data streams in. Finally, the app displays the optimal price once all the predictions are received.</p>
<p><code>App.js</code>: (in a separate React app)</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> { useState, useEffect } <span class="hljs-keyword">from</span> <span class="hljs-string">"react"</span>
<span class="hljs-keyword">import</span> { AreaChart, Area, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer, ReferenceLine } <span class="hljs-keyword">from</span> <span class="hljs-string">'recharts'</span>


<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">App</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-comment">// state</span>
  <span class="hljs-keyword">const</span> [predictions, setPredictions] = useState([])
  <span class="hljs-keyword">const</span> [start, setStart] = useState(<span class="hljs-literal">false</span>)
  <span class="hljs-keyword">const</span> [isLoading, setIsLoading] = useState(<span class="hljs-literal">false</span>)

  <span class="hljs-comment">// product data</span>
  <span class="hljs-keyword">let</span> selectedStockcode = <span class="hljs-string">'85123A'</span>
  <span class="hljs-keyword">let</span> selectedProduct = productOptions.filter(<span class="hljs-function"><span class="hljs-params">item</span> =&gt;</span> item.id === selectedStockcode)[<span class="hljs-number">0</span>]

  <span class="hljs-comment">// api endpoint</span>
  <span class="hljs-keyword">const</span> flaskBackendUrl = <span class="hljs-string">"YOUR FLASK BACKEND URL"</span>

  <span class="hljs-comment">// create chart data to display</span>
  <span class="hljs-keyword">const</span> chartDataSales = predictions &amp;&amp; predictions.length &gt; <span class="hljs-number">0</span>
    ? predictions
      .map(<span class="hljs-function"><span class="hljs-params">item</span> =&gt;</span> ({
        <span class="hljs-attr">price</span>: item.unit_price,
        <span class="hljs-attr">sales</span>: item.predicted_sales,
        <span class="hljs-attr">volume</span>: item.unit_price !== <span class="hljs-number">0</span> ? item.predicted_sales / item.unit_price : <span class="hljs-number">0</span>
      }))
      .sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> a.price - b.price)
    : [...selectedProduct[<span class="hljs-string">'histPrices'</span>]]

  <span class="hljs-comment">// optimal price to display</span>
  <span class="hljs-keyword">const</span> optimalPrice = predictions.length &gt; <span class="hljs-number">0</span>
    ? predictions.sort(<span class="hljs-function">(<span class="hljs-params">a, b</span>) =&gt;</span> b.predicted_sales - a.predicted_sales)[<span class="hljs-number">0</span>][<span class="hljs-string">'unit_price'</span>]
    : <span class="hljs-number">0</span>

  <span class="hljs-comment">// fetch prediction results</span>
  useEffect(<span class="hljs-function">() =&gt;</span> {
    <span class="hljs-keyword">const</span> handlePrediction = <span class="hljs-keyword">async</span> () =&gt; {
      setIsLoading(<span class="hljs-literal">true</span>)
      setPredictions([])
      <span class="hljs-keyword">const</span> errorPrices = selectedProduct[<span class="hljs-string">'errorPrices'</span>]

      <span class="hljs-keyword">await</span> fetch(flaskBackendUrl)
        .then(<span class="hljs-function"><span class="hljs-params">res</span> =&gt;</span> {
          <span class="hljs-keyword">if</span> (res.status !== <span class="hljs-number">200</span>) { setPredictions(errorPrices); setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>) }
          <span class="hljs-keyword">else</span> <span class="hljs-keyword">return</span> <span class="hljs-built_in">Promise</span>.resolve(res.clone().json())
        })
        .then(<span class="hljs-function"><span class="hljs-params">res</span> =&gt;</span> {
          <span class="hljs-keyword">if</span> (res &amp;&amp; res.length &gt; <span class="hljs-number">0</span>) setPredictions(res)
          <span class="hljs-keyword">else</span> setPredictions(errorPrices)
          setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>)
        })
        .catch(<span class="hljs-function"><span class="hljs-params">err</span> =&gt;</span> { setPredictions(errorPrices); setIsLoading(<span class="hljs-literal">false</span>); setStart(<span class="hljs-literal">false</span>) })
        .finally(setStart(<span class="hljs-literal">false</span>))
    }

    <span class="hljs-keyword">if</span> (start) handlePrediction()
    <span class="hljs-keyword">if</span> (predictions &amp;&amp; predictions.length &gt; <span class="hljs-number">0</span>) setStart(<span class="hljs-literal">false</span>)
  }, [flaskBackendUrl, start])


  <span class="hljs-comment">// render</span>
  <span class="hljs-keyword">if</span> (isLoading) <span class="hljs-keyword">return</span> <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">Loading</span> /&gt;</span></span>
  <span class="hljs-keyword">return</span> (
    <span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">div</span>&gt;</span>
      <span class="hljs-tag">&lt;<span class="hljs-name">ResponsiveContainer</span> <span class="hljs-attr">width</span>=<span class="hljs-string">"100%"</span> <span class="hljs-attr">height</span>=<span class="hljs-string">"100%"</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">AreaChart</span>
          <span class="hljs-attr">key</span>=<span class="hljs-string">{chartDataSales.length}</span>
          <span class="hljs-attr">data</span>=<span class="hljs-string">{chartDataSales.sort(data</span> =&gt;</span> data.unit_price)}
          margin={{ top: 10, right: 30, left: 0, bottom: 0 }}
        &gt;
          <span class="hljs-tag">&lt;<span class="hljs-name">CartesianGrid</span> <span class="hljs-attr">strokeDasharray</span>=<span class="hljs-string">"3 3"</span> <span class="hljs-attr">strokeOpacity</span>=<span class="hljs-string">{0.6}</span> /&gt;</span>

          <span class="hljs-tag">&lt;<span class="hljs-name">XAxis</span>
            <span class="hljs-attr">dataKey</span>=<span class="hljs-string">"price"</span>
            <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">value:</span> "<span class="hljs-attr">Unit</span> <span class="hljs-attr">Price</span> ($)", <span class="hljs-attr">position:</span> "<span class="hljs-attr">insideBottom</span>", <span class="hljs-attr">offset:</span> <span class="hljs-attr">0</span>, <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span>, <span class="hljs-attr">marginTop:</span> <span class="hljs-attr">10</span> }}
            <span class="hljs-attr">tickFormatter</span>=<span class="hljs-string">{(tick)</span> =&gt;</span> `$${parseFloat(tick).toFixed(2)}`}
            tick={{ fontSize: 12 }}
            padding={{ left: 20, right: 20 }}
          /&gt;

          <span class="hljs-tag">&lt;<span class="hljs-name">YAxis</span>
            <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">value:</span> "<span class="hljs-attr">Predicted</span> <span class="hljs-attr">Sales</span> ($)", <span class="hljs-attr">angle:</span> <span class="hljs-attr">-90</span>, <span class="hljs-attr">position:</span> "<span class="hljs-attr">insideLeft</span>", <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span> }}
            <span class="hljs-attr">tick</span>=<span class="hljs-string">{{</span> <span class="hljs-attr">fontSize:</span> <span class="hljs-attr">12</span> }}
            <span class="hljs-attr">tickFormatter</span>=<span class="hljs-string">{(tick)</span> =&gt;</span> `$${tick.toLocaleString()}`}
          /&gt;

          {/* tooltips with the prediction result data */}
          <span class="hljs-tag">&lt;<span class="hljs-name">Tooltip</span>
            <span class="hljs-attr">contentStyle</span>=<span class="hljs-string">{{</span>
              <span class="hljs-attr">borderRadius:</span> '<span class="hljs-attr">8px</span>',
              <span class="hljs-attr">padding:</span> '<span class="hljs-attr">10px</span>',
              <span class="hljs-attr">boxShadow:</span> '<span class="hljs-attr">0px</span> <span class="hljs-attr">0px</span> <span class="hljs-attr">15px</span> <span class="hljs-attr">rgba</span>(<span class="hljs-attr">0</span>,<span class="hljs-attr">0</span>,<span class="hljs-attr">0</span>,<span class="hljs-attr">0.5</span>)'
            }}
            <span class="hljs-attr">formatter</span>=<span class="hljs-string">{(value,</span> <span class="hljs-attr">name</span>) =&gt;</span> {
              if (name === 'sales') {
                return [`$${value.toFixed(4)}`, 'Predicted Sales']
              }
              if (name === 'volume') {
                return [`${value.toFixed(0)}`, 'Volume']
              }
              return value
            }}
            labelFormatter={(label) =&gt; `Price: $${label.toFixed(2)}`}
          /&gt;

          {/* chart area = sales */}
          <span class="hljs-tag">&lt;<span class="hljs-name">Area</span>
            <span class="hljs-attr">type</span>=<span class="hljs-string">"monotone"</span>
            <span class="hljs-attr">dataKey</span>=<span class="hljs-string">"sales"</span>
            <span class="hljs-attr">fillOpacity</span>=<span class="hljs-string">{1}</span>
            <span class="hljs-attr">fill</span>=<span class="hljs-string">"url(#colorSales)"</span>
          /&gt;</span>

          {/* vertical line for the optimal price */}
          {optimalPrice &amp;&amp;
            <span class="hljs-tag">&lt;<span class="hljs-name">ReferenceLine</span>
              <span class="hljs-attr">x</span>=<span class="hljs-string">{optimalPrice}</span>
              <span class="hljs-attr">strokeDasharray</span>=<span class="hljs-string">"4 4"</span>
              <span class="hljs-attr">ifOverflow</span>=<span class="hljs-string">"visible"</span>
              <span class="hljs-attr">label</span>=<span class="hljs-string">{{</span>
                <span class="hljs-attr">value:</span> `<span class="hljs-attr">Optimal</span> <span class="hljs-attr">Price:</span> $${<span class="hljs-attr">optimalPrice</span> !== <span class="hljs-string">null</span> &amp;&amp; <span class="hljs-attr">optimalPrice</span> &gt;</span> 0 ? Math.ceil(optimalPrice * 10000) / 10000 : ''}`,
                position: "right",
                fontSize: 12,
                offset: 10
              }}
            /&gt;
          }
        <span class="hljs-tag">&lt;/<span class="hljs-name">AreaChart</span>&gt;</span>
      <span class="hljs-tag">&lt;/<span class="hljs-name">ResponsiveContainer</span>&gt;</span>

      {optimalPrice &amp;&amp; <span class="hljs-tag">&lt;<span class="hljs-name">p</span>&gt;</span>Optimal Price: $ {Math.ceil(optimalPrice * 10000) / 10000}<span class="hljs-tag">&lt;/<span class="hljs-name">p</span>&gt;</span>}

    <span class="hljs-tag">&lt;/<span class="hljs-name">div</span>&gt;</span></span>
  )
}

<span class="hljs-keyword">export</span> <span class="hljs-keyword">default</span> App
</code></pre>
<h2 id="heading-final-results">Final Results</h2>
<p>Now, the application is ready to serve.</p>
<p>You can explore the UI from <a target="_blank" href="https://kuriko-iwai.vercel.app/online-commerce-intelligence-hub">here</a>.</p>
<p>All code (backend) is available in <a target="_blank" href="https://github.com/krik8235/ml-sales-prediction">my Github Repo</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building a machine learning system requires thoughtful project scoping and architecture design.</p>
<p>In this article, we built a dynamic pricing system as a simple single interface on containerized serverless architecture.</p>
<p>Moving forward, we’d need to consider potential drawbacks of this minimal architecture:</p>
<ul>
<li><p><strong>Increase in cold start duration</strong>: The WSGI adapter <code>awsgi</code> layer adds a small overhead. Loading a larger container image takes longer time.</p>
</li>
<li><p><strong>Monolithic function:</strong> Adding endpoints to the Lambda function can lead to a monolithic function where an issue in one endpoint impacts others.</p>
</li>
<li><p><strong>Less granular observability</strong>: AWS CloudWatch cannot provide individual invocation/error metrics per API endpoint without custom instrumentation.</p>
</li>
</ul>
<p>To scale the application effectively, extracting functionalities into a new microservice can be a good strategy to the next step.</p>
<p>I’m Kuriko IWAI, and you can find more of my work and learn more about me here:</p>
<p><a target="_blank" href="https://kuriko-iwai.vercel.app/"><strong>Portfolio</strong></a> <strong>/</strong> <a target="_blank" href="https://www.linkedin.com/in/k-i-i/"><strong>LinkedIn</strong></a> <strong>/</strong> <a target="_blank" href="https://github.com/krik8235"><strong>Github</strong></a></p>
<p><em>All images, unless otherwise noted, are by the author. This application utilizes synthetic dataset licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.</em></p>
<p><em>This information about AWS is current as of August 2025 and is subject to change.</em></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Deep Reinforcement Learning in Natural Language Understanding ]]>
                </title>
                <description>
                    <![CDATA[ Language is messy, subtle, and full of meaning that shifts with context. Teaching machines to truly understand it is one of the hardest problems in artificial intelligence. That challenge is what natural language understanding (NLU) sets out to solve... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/deep-reinforcement-learning-in-natural-language-understanding/</link>
                <guid isPermaLink="false">689f4b8b1694c0dba616a0d0</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Artificial Intelligence ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Fri, 15 Aug 2025 15:00:27 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1755270013761/005fd330-7f59-4753-ba14-8852f4240f3c.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Language is messy, subtle, and full of meaning that shifts with context. Teaching machines to truly understand it is one of the hardest problems in artificial intelligence.</p>
<p>That challenge is what natural language understanding (NLU) sets out to solve. From voice assistants that follow instructions to support systems that interpret user intent, NLU sits at the core of many real-world AI applications.</p>
<p>Most systems today are trained using labeled data and supervised techniques. But there's growing interest in something more adaptive: deep reinforcement learning (DRL). Instead of learning from fixed examples, DRL allows a model to improve through trial, error, and feedback, much like a person learning through experience.</p>
<p>This article looks at where DRL fits into the modern NLU landscape. We'll explore how it's being used to fine-tune responses, guide conversation flow, and align models with human values.</p>
<h3 id="heading-what-well-cover">What we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-overview-of-deep-reinforcement-learning">Overview of Deep Reinforcement Learning</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-natural-language-understanding-nlu">What is Natural Language Understanding (NLU)?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-challenges-in-nlu-and-how-to-address-them">Challenges in NLU and How to Address Them</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-where-drl-adds-value-in-nlu">Where DRL Adds Value in NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-modern-architectures-in-nlu-from-bert-to-claude">Modern Architectures in NLU from BERT to Claude</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-the-niche-role-of-drl-in-modern-nlu">The Niche Role of DRL in Modern NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-reinforcement-learning-from-human-feedback-rlhf">Reinforcement Learning from Human Feedback (RLHF)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-ecosystem-and-tools-for-drl-in-nlp">Ecosystem and Tools for DRL in NLP</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-hands-on-demo-simulating-drl-feedback-in-nlu">Hands-On Demo: Simulating DRL Feedback in NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-case-studies-of-drl-in-nlu">Case Studies of DRL in NLU</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-wrapping-up">Wrapping Up</a></p>
</li>
</ul>
<h2 id="heading-overview-of-deep-reinforcement-learning">Overview of Deep Reinforcement Learning</h2>
<p>Reinforcement learning is a subfield of machine learning. It’s inspired by behavioral psychology, in which agents learn to maximize cumulative rewards by performing behaviors in a given environment.</p>
<p>Traditionally, reinforcement learning techniques have been used to solve simple problems with discrete state and action spaces. But the development of deep learning has opened the door to applying these techniques to more complicated, high-dimensional environments, like computer vision, natural language processing (NLP), and robotics.</p>
<p>DRL uses deep neural networks to approximate complex functions that translate observations into actions, allowing agents to learn from raw sensory data. Deep neural networks, which represent knowledge in numerous layers of abstraction, may catch detailed patterns and relationships in data, allowing for more effective decision-making.</p>
<p>Imagine you’re playing a video game where you’re controlling a character, and your goal is to get the highest score possible. Now, when you first start playing, you might not know the best way to play, right? You might try different things like jumping, running, or shooting, and you see what works and what doesn’t.</p>
<p>We can think of DRL as a technique that enables computers or robots to learn how to play video games as time goes on. DRL involves a computer learning from its environment, learning from its experiences and mistakes. The computer, like the player, tries different actions and receives feedback based on its performance. If it performs well, it gets rewards, while if it fails, it gets a penalty.</p>
<p>The computer’s job is to figure out the best possible actions to take in different situations to maximize rewards. Instead of learning from trial and error, DRL uses deep neural networks, which are like super-smart brains that can understand vast amounts of data and patterns. These neural networks help the computer make better decisions in the future, and over time, it can become even better at playing the game – sometimes even better than humans.</p>
<p><img src="https://cdn-images-1.medium.com/max/1600/1*7UeewswDEpqTALIvwkNNAw.png" alt="Deep reinforcement learning approach" width="600" height="400" loading="lazy"></p>
<p><a target="_blank" href="https://www.researchgate.net/publication/333909668_Demand_Response_Management_for_Industrial_Facilities_A_Deep_Reinforcement_Learning_Approach">Image Source</a></p>
<h2 id="heading-what-is-natural-language-understanding-nlu">What is Natural Language Understanding (NLU)?</h2>
<p>NLU is a subfield of artificial intelligence (AI), and its aim is to help computers understand, interpret, and respond to human language in meaningful ways. It involves creating algorithms and models that can process and analyze text to extract meaningful information, determine the intent behind it, and provide appropriate replies.</p>
<p>NLU is a basic part of many AI applications, such as chatbots, virtual assistants, and personalized recommendation systems, which require the ability to interpret and respond to human language.</p>
<p>Its key components include:</p>
<ul>
<li><p><strong>Text processing:</strong> NLU systems must be able to process and interpret text, which includes tokenization (cutting it down into words or phrases), part-of-speech tagging, and named entity recognition.</p>
</li>
<li><p><strong>Sentiment analysis:</strong> Identifying the sentiment communicated in a piece of text (positive, negative, or neutral) is a common task in NLU.</p>
</li>
<li><p><strong>Intent recognition:</strong> Identifying the goal or objective of a user’s input, such as buying a flight or requesting weather forecasts.</p>
</li>
<li><p><strong>Language generation:</strong> (technically part of Natural Language Generation, or NLG): While NLU focuses on understanding text, NLG is about producing coherent, contextually appropriate text. Many AI systems combine both, first interpreting the input through NLU, then generating an appropriate response using NLG.</p>
</li>
<li><p><strong>Entity extraction:</strong> Identifying and categorizing essential details in the text, such as dates, locations, and people.</p>
</li>
</ul>
<h2 id="heading-challenges-in-nlu-and-how-to-address-them"><strong>Challenges in NLU and How to Address Them</strong></h2>
<p>NLU aims to help machines interpret, understand, and respond to human language in ways that make sense. While it has made great progress, there are still challenges that limit how well it works in practice.</p>
<p>Below are some of these challenges and how Deep Reinforcement Learning (DRL) can play a supportive role. DRL is not a replacement for large-scale pretraining or instruction tuning, but it can complement them by helping models adapt through interaction and feedback.</p>
<h3 id="heading-ambiguity"><strong>Ambiguity</strong></h3>
<p>Naturally, words can have more than one meaning, and a single sentence or phrase might be understood in different ways. This makes it hard for NLU systems to always pinpoint what the speaker or writer intends.</p>
<p>DRL can help reduce ambiguity by allowing models to learn from feedback. If a certain interpretation gets positive results, the model can prioritize it. If not, it can try a different approach. While this does not remove ambiguity entirely, it can improve a model’s ability to make better choices over time, especially when combined with a strong pretrained foundation.</p>
<h3 id="heading-contextual-understanding"><strong>Contextual understanding</strong></h3>
<p>Understanding language often depends on context such as cultural references, sarcasm, or the tone behind certain words. These are straightforward for people but challenging for machines to recognize.</p>
<p>By learning from interaction signals such as whether a user is satisfied with a response, DRL can help a model adapt to context more effectively. However, the core ability to understand context still comes from large-scale pretraining. DRL mainly fine-tunes and adjusts this behavior during use.</p>
<h3 id="heading-language-variation"><strong>Language variation</strong></h3>
<p>Human language comes in many forms including different dialects, slang, colloquialisms, and regional expressions. This variety can challenge NLU systems that have not seen enough examples of these patterns during training.</p>
<p>With DRL, models can adapt to new language styles when exposed to them repeatedly in real-world use. This makes them more flexible and responsive, although their base understanding still relies on the diversity of the data used during pretraining.</p>
<h3 id="heading-scalability"><strong>Scalability</strong></h3>
<p>As text data continues to grow, NLU systems must be able to process large volumes quickly and efficiently, especially in real-time applications such as chatbots and virtual assistants.</p>
<p>DRL can contribute by helping models optimize certain processing steps through trial and feedback. While it will not replace architectural or infrastructure improvements, it can help fine-tune performance for specific high-traffic tasks.</p>
<h3 id="heading-computational-complexity"><strong>Computational complexity</strong></h3>
<p>Training advanced NLU models is resource-intensive, which can be a challenge for mobile devices, edge computing, or other resource-limited environments.</p>
<p>DRL can make the learning process more efficient by reusing past experiences through techniques such as off-policy learning and reward modeling. Combined with smaller, distilled model architectures, this can make it easier to deploy capable NLU systems even with limited computing power.</p>
<h2 id="heading-where-drl-adds-value-in-nlu"><strong>Where DRL Adds Value in NLU</strong></h2>
<p>DRL is not a primary training method for most NLU models. Its main value comes when interaction, feedback, or rewards can be used to improve how a system behaves after it has already been pretrained. When applied selectively, DRL can help refine and personalize model performance in ways that matter for specific use cases.</p>
<p>Here are some areas where DRL has shown potential:</p>
<ol>
<li><p><strong>Dialogue systems</strong><br> DRL can help chatbots and virtual assistants manage conversations more smoothly. It can be used to refine turn-taking, handle vague questions in a better way, or adjust responses to improve user satisfaction during longer conversations.</p>
</li>
<li><p><strong>Text summarization</strong><br> Most summarization models rely on supervised learning. DRL can be added as a fine-tuning step to focus on factors such as relevance or fluency, especially when custom reward signals are linked to specific goals or user preferences.</p>
</li>
<li><p><strong>Response generation and language modeling</strong><br> DRL can guide language generation toward outputs that are more useful, aligned with user intent, or better suited to certain tone and safety requirements.</p>
</li>
<li><p><strong>Reward-based optimization in parsing or classification</strong><br> In certain cases, DRL has been used to improve outputs based on downstream objectives such as increasing label confidence or enhancing the quality of supporting explanations, alongside accuracy.</p>
</li>
<li><p><strong>Interactive machine translation</strong><br> DRL can help translation systems adapt over time by learning from reinforcement signals like human corrections or post-editing feedback, leading to gradual improvements in quality.</p>
</li>
</ol>
<p>In short, DRL works best as a targeted enhancement. It is not used to build general-purpose NLU systems from scratch, but it can make existing systems more adaptable, aligned, and responsive when feedback loops are part of the application.</p>
<h2 id="heading-modern-architectures-in-nlu-from-bert-to-claude"><strong>Modern Architectures in NLU from BERT to Claude</strong></h2>
<p>Early NLU systems used Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs), but most modern systems use transformers.</p>
<p>These models use a mechanism called self-attention to capture long-range dependencies. <strong>Self-attention</strong> allows each word to “attend” to every other word in the input, assigning weights that determine relevance for understanding the current word. <strong>Long-range dependencies</strong> occur when the meaning of one word depends on another far away in the text (like linking “he” to “the president” from earlier sentences). This helps maintain context over large spans of text.</p>
<p>Here’s how the main types of transformer models are used today:</p>
<h3 id="heading-encoder-only-models">Encoder-only models</h3>
<p>Examples: BERT, RoBERTa, ALBERT, DeBERTa</p>
<p>These models process text input and create rich contextual representations without generating new text. They are excellent for classification, entity extraction, and tasks that require understanding rather than producing language. The encoder reads the whole input and encodes it into a vector representation, which is then used by a task-specific head for predictions.  </p>
<p>They're often fine-tuned for specific tasks and perform especially well in structured language understanding.</p>
<h3 id="heading-encoder-decoder-models">Encoder-decoder models</h3>
<p>Examples: T5, FLAN-T5</p>
<p>These models have two components: an encoder that reads and encodes the input text, and a decoder that generates an output sequence based on that encoded representation. They are ideal for sequence-to-sequence tasks such as summarization, translation, and instruction following. The encoder captures the meaning of the input, while the decoder produces coherent output in the target form.  </p>
<p>They’re flexible and particularly useful in multi-task learning setups</p>
<h3 id="heading-decoder-only-models">Decoder-only models</h3>
<p>Examples: GPT-4, Claude 3, Gemini</p>
<p>These models generate text one token at a time, predicting the next token based on all previous tokens in the sequence. They excel in open-ended text generation, creative writing, and reasoning tasks. Because they are trained to predict the next word given any context, they can perform many tasks simply by being prompted, without additional training.  </p>
<p>They’re typically aligned with human preferences using techniques like Reinforcement Learning from Human Feedback (RHLF).</p>
<p>These models are now widely used in real-world applications, such as chatbots, enterprise tools, and multilingual digital assistants, and many can handle new tasks with just a prompt, requiring no additional training.</p>
<h2 id="heading-the-niche-role-of-drl-in-modern-nlu"><strong>The Niche Role of DRL in Modern NLU</strong></h2>
<p>DRL is not a general-purpose solution for most NLU challenges, such as handling ambiguity or understanding context. These problems are typically addressed using large-scale pretraining and supervised or instruction-based fine-tuning.</p>
<p>That said, DRL still plays a valuable role in specific areas where feedback and long-term optimization are useful. It is commonly applied in:</p>
<ul>
<li><p><strong>Improving dialogue strategy:</strong> DRL helps conversational agents manage turn-taking, adjust tone, and adapt to user preferences across multiple interactions.</p>
</li>
<li><p><strong>Aligning model behavior using RLHF:</strong> Reinforcement learning from human feedback (RLHF – more on this below) uses DRL to train models that respond in ways people find more helpful, safe, or contextually appropriate.</p>
</li>
<li><p><strong>Reward modeling for alignment and safety:</strong> DRL enables the training of reward models that guide language systems toward ethical, culturally aware, or domain-specific behavior.</p>
</li>
</ul>
<p>Looking ahead, DRL is likely to grow in importance for applications that involve real-time interaction, long-horizon reasoning, or agent-driven workflows. For now, it serves as a targeted enhancement alongside more widely used training methods.</p>
<h3 id="heading-reinforcement-learning-from-human-feedback-rlhf">Reinforcement Learning from Human Feedback (RLHF)</h3>
<p>Let’s talk a bit more about RLHF, as it’s pretty important here. It’s also currently the primary way DRL is applied in large-scale language models such as GPT‑4, Claude, and Gemini.  </p>
<p>It works in three main steps:</p>
<ol>
<li><p><strong>Reward model training</strong> – Human annotators rank model outputs for the same prompt. These rankings are used to train a reward model that scores outputs based on how helpful, safe, or relevant they are.</p>
</li>
<li><p><strong>Policy optimization</strong> – Using algorithms such as PPO (Proximal Policy Optimization), the base language model is fine-tuned to maximize the reward model’s score.</p>
</li>
<li><p><strong>Iteration and safety</strong> – RLHF loops are often combined with safety-focused reward modeling, constitutional AI (following explicit guidelines for safe behavior), refusal strategies for harmful requests, and red‑teaming to probe weaknesses.</p>
</li>
</ol>
<p>Data‑efficient variants are increasingly common, such as offline RL, replay buffers, and leveraging implicit feedback like click‑through logs.</p>
<p>In practice, RLHF has significantly improved the ability of models to follow instructions, avoid harmful outputs, and align with human values.</p>
<h2 id="heading-ecosystem-and-tools-for-drl-in-nlp"><strong>Ecosystem and Tools for DRL in NLP</strong></h2>
<p>If you're looking to explore DRL in NLU, you don't have to start from scratch. There’s a solid ecosystem of tools that make it easier to test ideas, build prototypes, and fine-tune models using rewards and feedback.</p>
<p>Here are a few go-to libraries:</p>
<ol>
<li><p><code>trl</code> by Hugging Face: A lightweight framework built specifically for applying reinforcement learning to transformer models. It's widely used for RLHF, reward modeling, and steering model outputs based on human preferences.</p>
</li>
<li><p>Stable-Baselines3: A simple, well-documented library for classic DRL algorithms like PPO, A2C, and DQN. It’s great for testing DRL setups in smaller or custom environments.</p>
</li>
<li><p>RLlib (part of Ray): Designed for scaling up. If you're working on distributed training or combining DRL with larger pipelines, RLlib helps manage the complexity.</p>
</li>
</ol>
<p>These libraries pair well with open-source large language models like LLaMA, Mistral, Gemma, and Command R+. Together, they give you everything you need to experiment with DRL-backed language systems, whether you're tuning responses in a chatbot or building a reward model for alignment.</p>
<h2 id="heading-hands-on-demo-simulating-drl-feedback-in-nlu">Hands-On Demo: Simulating DRL Feedback in NLU</h2>
<p>You don’t need a full reinforcement learning pipeline to understand reward signals. This notebook demonstrates how you can simulate <strong>preference-based feedback</strong> using GPT-3.5. Users interact with the model, provide binary feedback (good or bad), and the system logs each interaction with a corresponding reward. It mirrors the principles behind techniques like RLHF.</p>
<h3 id="heading-setup-and-authentication">Setup and Authentication</h3>
<p>First, you’ll need to install the required packages and set up your API key.</p>
<pre><code class="lang-python">pip install openai ipywidgets pandas matplotlib
</code></pre>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> openai
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> ipywidgets <span class="hljs-keyword">as</span> widgets
<span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> display, Markdown, clear_output
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt

<span class="hljs-comment"># Load your OpenAI API key</span>
openai.api_key = os.getenv(<span class="hljs-string">"OPENAI_API_KEY"</span>) <span class="hljs-keyword">or</span> input(<span class="hljs-string">"Enter your OpenAI API key: "</span>)
</code></pre>
<p><strong>What this does</strong>:</p>
<ul>
<li><p>Installs and loads required libraries</p>
</li>
<li><p>Reads your OpenAI key from an environment variable or prompts for it interactively</p>
</li>
</ul>
<h3 id="heading-step-1-generate-a-gpt-35-response">Step 1: Generate a GPT-3.5 Response</h3>
<p>Now, try sending a prompt and seeing what response you get:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_gpt_response</span>(<span class="hljs-params">prompt</span>):</span>
    <span class="hljs-keyword">try</span>:
        response = openai.ChatCompletion.create(
            model=<span class="hljs-string">"gpt-3.5-turbo"</span>,
            messages=[{<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: prompt}],
            temperature=<span class="hljs-number">0.7</span>
        )
        <span class="hljs-keyword">return</span> response[<span class="hljs-string">'choices'</span>][<span class="hljs-number">0</span>][<span class="hljs-string">'message'</span>][<span class="hljs-string">'content'</span>].strip()
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error: <span class="hljs-subst">{e}</span>"</span>
</code></pre>
<p><strong>What this does</strong>:</p>
<ul>
<li><p>Uses OpenAI’s GPT-3.5 to generate a response</p>
</li>
<li><p>Handles errors if the API call fails</p>
</li>
</ul>
<h3 id="heading-step-2-store-feedback-history">Step 2: Store Feedback History</h3>
<p>You can now track user responses and simulated reward signals like this:</p>
<pre><code class="lang-python">history = []
</code></pre>
<p>This code initializes a list to store logs of each interaction.</p>
<h3 id="heading-step-3-run-feedback-interaction">Step 3: Run Feedback Interaction</h3>
<p>Now you can capture the prompt, display the response, and accept feedback.</p>
<pre><code class="lang-python"><span class="hljs-comment">#  Main interaction logic</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_interaction</span>(<span class="hljs-params">prompt</span>):</span>
    clear_output(wait=<span class="hljs-literal">True</span>)
    response = get_gpt_response(prompt)
    display(Markdown(<span class="hljs-string">f"### Prompt\n`<span class="hljs-subst">{prompt}</span>`"</span>))
    display(Markdown(<span class="hljs-string">f"### GPT-3.5 Response\n&gt; <span class="hljs-subst">{response}</span>"</span>))

    <span class="hljs-comment"># Feedback buttons</span>
    good_btn = widgets.Button(description=<span class="hljs-string">"👍 Good"</span>, button_style=<span class="hljs-string">'success'</span>)
    bad_btn = widgets.Button(description=<span class="hljs-string">"👎 Bad"</span>, button_style=<span class="hljs-string">'danger'</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_feedback</span>(<span class="hljs-params">feedback</span>):</span>
        reward = <span class="hljs-number">1</span> <span class="hljs-keyword">if</span> feedback == <span class="hljs-string">'good'</span> <span class="hljs-keyword">else</span> <span class="hljs-number">-1</span>
        history.append({
            <span class="hljs-string">"prompt"</span>: prompt,
            <span class="hljs-string">"response"</span>: response,
            <span class="hljs-string">"feedback"</span>: feedback,
            <span class="hljs-string">"reward"</span>: reward
        })
        display(Markdown(
            <span class="hljs-string">f"**Feedback Recorded:** `<span class="hljs-subst">{feedback}</span>` — Reward = `<span class="hljs-subst">{reward}</span>`"</span>
        ))
        display(Markdown(<span class="hljs-string">"---"</span>))
        display(Markdown(<span class="hljs-string">"### Reward History"</span>))
        df = pd.DataFrame(history)
        display(df.tail(<span class="hljs-number">5</span>))
        plot_rewards()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_good</span>(<span class="hljs-params">_</span>):</span> on_feedback(<span class="hljs-string">'good'</span>)
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_bad</span>(<span class="hljs-params">_</span>):</span> on_feedback(<span class="hljs-string">'bad'</span>)

    display(widgets.HBox([good_btn, bad_btn]))
    good_btn.on_click(on_good)
    bad_btn.on_click(on_bad)
</code></pre>
<p><strong>What this does</strong>:</p>
<ul>
<li><p>Shows GPT-3.5’s response to the user’s prompt</p>
</li>
<li><p>Displays feedback buttons</p>
</li>
<li><p>Logs reward and shows feedback history</p>
</li>
</ul>
<h3 id="heading-step-4-plot-reward-history">Step 4: Plot Reward History</h3>
<p>You can also visualize reward trends:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">plot_rewards</span>():</span>
    df = pd.DataFrame(history)
    plt.figure(figsize=(<span class="hljs-number">6</span>,<span class="hljs-number">3</span>))
    plt.plot(df[<span class="hljs-string">'reward'</span>], marker=<span class="hljs-string">'o'</span>)
    plt.title(<span class="hljs-string">"Reward Over Time"</span>)
    plt.xlabel(<span class="hljs-string">"Interaction"</span>)
    plt.ylabel(<span class="hljs-string">"Reward"</span>)
    plt.grid(<span class="hljs-literal">True</span>)
    plt.show()
</code></pre>
<p>This plots the user’s reward signals over time to simulate policy shaping.</p>
<h3 id="heading-step-5-build-input-interface">Step 5: Build Input Interface</h3>
<p>You can also allow users to type and submit prompts.</p>
<pre><code class="lang-python">prompt_input = widgets.Textarea(
    placeholder=<span class="hljs-string">"Ask something..."</span>,
    description=<span class="hljs-string">"Prompt:"</span>,
    layout=widgets.Layout(width=<span class="hljs-string">'100%'</span>, height=<span class="hljs-string">'80px'</span>),
    style={<span class="hljs-string">'description_width'</span>: <span class="hljs-string">'initial'</span>}
)

generate_btn = widgets.Button(
    description=<span class="hljs-string">"Generate Response"</span>, button_style=<span class="hljs-string">'primary'</span>
)

output_area = widgets.Output()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">on_generate_click</span>(<span class="hljs-params">_</span>):</span>
    <span class="hljs-keyword">with</span> output_area:
        run_interaction(prompt_input.value)

generate_btn.on_click(on_generate_click)

display(prompt_input)
display(generate_btn)
display(output_area)
</code></pre>
<p>This sets up a simple form to collect prompts and connects the generate button to the main interaction logic.</p>
<p>This gives the output:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753736920176/35079f63-2ca0-4bd4-aea6-3de3589b0c9f.png" alt="Demo result" class="image--center mx-auto" width="2492" height="751" loading="lazy"></p>
<p>This demo captures the fundamentals of preference-based learning using GPT-3.5. It doesn’t update model weights but shows how feedback can be structured as a reward signal. This is the foundation of reinforcement learning in modern LLM pipelines.</p>
<p><strong>Note:</strong> This demo only logs feedback. In true RLHF, a second phase fine-tunes the model weights based on it.</p>
<p>A real-world example of this is <a target="_blank" href="https://openai.com/index/instruction-following/"><strong>InstructGPT</strong></a>. This is a version of OpenAI’s GPT models that’s trained to follow instructions written by people. Instead of just predicting the next word, it tries to really figure out and then do what you’ve asked, the way you asked it.</p>
<p>Despite being over 100× smaller than GPT-3, InstructGPT was preferred by humans in <strong>85%</strong> of blind comparisons. And one of the key reasons was that is uses RLHF. This made it safer, more truthful, and better at following complex instructions, showing how reward signals like the one simulated here can greatly improve real-world model performance.</p>
<h2 id="heading-case-studies-of-drl-in-nlu">Case Studies of DRL in NLU</h2>
<p>While DRL is not the default approach for most NLU tasks, it has shown promising results in targeted use cases, especially where learning from interaction or adapting over time adds value. Below are a few examples that illustrate how DRL can enhance language understanding in practice:</p>
<h3 id="heading-1-welocalize-amp-global-e-commerce-giant-drl-powered-multilingual-nlu">1. Welocalize &amp; Global E-Commerce Giant – DRL-Powered Multilingual NLU</h3>
<p>A global e-commerce platform partnered with Welocalize to <a target="_blank" href="https://www.welocalize.com/insights/case-study-transforming-global-customer-interactions-with-nlu/">launch a DRL-powered multilingual NLU system</a> capable of interpreting customer intent across 30+ languages and domains. This system used reinforcement learning to adapt to cultural nuances and refine predictions through user interaction. Over 13 million high-quality utterances delivered for culturally adaptive, accurate customer support and product recommendations.</p>
<h3 id="heading-2-reinforcement-learning-with-label-sensitive-reward-acl-2024">2. Reinforcement Learning with Label-Sensitive Reward (ACL 2024)</h3>
<p>Researchers introduced a framework called <a target="_blank" href="https://aclanthology.org/anthology-files/pdf/acl/2024.acl-long.231.pdf">RLLR (Reinforcement Learning with Label-Sensitive Reward)</a> to improve NLU tasks like sentiment classification, topic labeling, and intent detection. By incorporating label-sensitive reward signals and optimizing via Proximal Policy Optimization (PPO), the model aligned its predictions with both rationale quality and true label accuracy.</p>
<p>These examples show how DRL, when paired with specific feedback signals or interactive goals, can be a useful layer on top of traditional NLU systems. Though still niche, the approach continues to evolve through research and industry experimentation.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>The integration of DRL with NLU has shown promising results in niche but growing areas. Adaptive learning through various interactions and feedback allows DRL to enhance NLU models’ ability to handle ambiguity, context, and linguistic differences. </p>
<p>As research progresses, the link between DRL and NLU is expected to drive advancements in AI-powered language applications, making them more efficient, scalable, and context-aware.</p>
<p>I hope this was helpful!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code ]]>
                </title>
                <description>
                    <![CDATA[ The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design. In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks: Custom class... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/build-a-multilayer-perceptron-with-examples-and-python-code/</link>
                <guid isPermaLink="false">6839f729798ea464918cffe8</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ neural networks ]]>
                    </category>
                
                    <category>
                        <![CDATA[ binary classification ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MLP (Multi-Layer Perceptrons) ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kuriko ]]>
                </dc:creator>
                <pubDate>Fri, 30 May 2025 18:21:29 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1748616370600/01903917-4be7-476b-90d1-18295d19edef.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The <strong>perceptron</strong> is a fundamental concept in deep learning, with many algorithms stemming from its original design.</p>
<p>In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:</p>
<ul>
<li><p>Custom classifier</p>
</li>
<li><p>Scikit-learn’s MLPClassifier</p>
</li>
<li><p>Keras Sequential classifier using SGD and Adam optimizers.</p>
</li>
</ul>
<p>This will help you learn about their various use cases and how they work.</p>
<h3 id="heading-table-of-contents">Table of Contents</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-is-a-perceptron">What is a Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-understanding-optimizers">Understanding Optimizers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-final-results-generalization">Final Results: Generalization</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ul>
<li><p>Mathematics (Calculus, Linear Algebra, Statistics)</p>
</li>
<li><p>Coding in Python</p>
</li>
<li><p>Basic understanding of Machine Learning concepts</p>
</li>
</ul>
<h2 id="heading-what-is-a-perceptron">What is a Perceptron?</h2>
<p>A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.</p>
<p>A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.</p>
<p>But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.</p>
<p>The perceptron consists of four main parts:</p>
<ul>
<li><p><strong>Input layer</strong>: Takes the initial numerical values into the system for further processing.</p>
</li>
<li><p><strong>Weights</strong>: Combines input values with weights (and bias terms).</p>
</li>
<li><p><strong>Activation function</strong>: Determines whether the neuron should fire based on the threshold value.</p>
</li>
<li><p><strong>Output layer</strong>: Produces classification result.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748438698612/5b2920db-4ec1-455b-840e-7b5e9d6c2e75.png" alt="Image: Organization of a perceptron. Source: Rosenblatt 1958" class="image--center mx-auto" width="2100" height="746" loading="lazy"></p>
<p>It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.</p>
<p>So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.</p>
<h3 id="heading-applications-of-perceptrons">Applications of Perceptrons</h3>
<p>Perceptrons are applied to tasks such as:</p>
<ul>
<li><p><strong>Image classification:</strong> Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.</p>
</li>
<li><p><strong>Linear regression:</strong> Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.</p>
</li>
</ul>
<h3 id="heading-how-the-activation-function-works">How the Activation Function Works</h3>
<p>For a single perceptron used for binary classification, the most common activation function is the <strong>step function</strong> (also known as the threshold function):</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq \theta \\ \\ 0 &amp;\text{if } z &lt; \theta \end{cases}$$</p><p>where:</p>
<ul>
<li><p><code>ϕ(z)</code>: the output of the activation function.</p>
</li>
<li><p><code>z</code>: the weighted sum of the inputs plus the bias:</p>
</li>
</ul>
<p>$$z = \sum_{i=1}^m w_i x_i + b$$</p><p>(xi: input values, w: weight associated with each input, b: bias terms)</p>
<p><code>θ</code> is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.</p>
<p>In that case, the formula becomes:</p>
<p>$$\phi(z) = \begin{cases} 1 &amp;\text{if } z \geq 0 \\ \\ 0 &amp;\text{if } z &lt; 0 \end{cases}$$</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748439460839/e74f1c1c-4e89-419b-aa9e-24a297d81ff5.png" alt="Image: Step Function (Author)" class="image--center mx-auto" width="1526" height="410" loading="lazy"></p>
<p>When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.</p>
<p>This occurs <strong>when the weighted sum is greater than zero,</strong> leading the perceptron to predict the input is in this binary class.</p>
<p>While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.</p>
<p>In modern implementations, we can use other activation functions like the <strong>sigmoid</strong> function:</p>
<p>$$\sigma (z) = \frac {1} {1 + e^{-z}}$$</p><p>The sigmoid function also outputs zero or one depending on the weighted sum (z).</p>
<h3 id="heading-how-the-loss-function-works">How the Loss Function Works</h3>
<p>The <strong>loss function</strong> is a crucial concept in machine learning that quantifies the error or discrepancy between the model's predictions and the actual target values.</p>
<p>Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model's parameters in a way that minimizes this error and improves performance.</p>
<p>In a binary classification task, the model may adopt the <strong>hinge loss function</strong> to penalize misclassifications by incurring an additional cost for incorrect predictions:</p>
<p>$$L(y, h(x)) = max(0, 1- y*h(x))$$</p><p>(h(x): prediction label, y: true label)</p>
<h2 id="heading-how-to-build-a-single-layered-classifier">How to Build a Single-Layered Classifier</h2>
<p>Now, let’s build a simple single-layer perceptron for binary classification.</p>
<h3 id="heading-1-custom-classifier">1. Custom Classifier</h3>
<h4 id="heading-initialize-the-classifier">Initialize the classifier</h4>
<p>We’ll first initialize the classifier with <code>weights</code>, <code>bias</code>, number of epochs (<code>n_iterations)</code>, and <code>learning_rates</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = <span class="hljs-literal">None</span>
    self.bias = <span class="hljs-literal">None</span>
</code></pre>
<h4 id="heading-define-the-activation-function">Define the activation function</h4>
<p>Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the <code>threshold</code> is set to zero.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
     <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
</code></pre>
<h4 id="heading-train-the-model">Train the model</h4>
<p>Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: <code>weights</code> and <code>bias</code>.</p>
<p>This process is controlled by a specified number of training epochs defined by <code>n_iterations</code>.</p>
<p>In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined <code>learning_rate</code>.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = <span class="hljs-number">0</span>

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
            <span class="hljs-comment"># compute weighted sum (z)</span>
            z = np.dot(X[i], self.weights) + self.bias

            <span class="hljs-comment"># apply the activation function</span>
            y_pred = self._step_function(z)

            <span class="hljs-comment"># update weights and bias</span>
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)
</code></pre>
<h4 id="heading-how-the-weights-work-in-the-iteration-loop">How the weights work in the iteration loop</h4>
<p>The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.</p>
<p>Its iterative update in the <code>for</code> loop aims to reduce classification errors such that:</p>
<p>$$\begin {align*} w_j &amp;:= w_j + \Delta w_j \\ &amp; := w_j + \eta (y_i - \hat y_i)x_{ij} \\ &amp;= \begin{cases} w_j &amp;\text{(a) } y_i - \hat y_i = 0\\ w_j + \eta x_ij &amp;\text{(b) } y_i - \hat y_i = 1 \\ w_j - \eta x_ij &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>(<code>w_j</code>: j-th weight, <code>η</code>: learning rate, (<code>yi​−y^​i​</code>): error)</p>
<p>This means that:</p>
<ol>
<li><p>When the prediction is <strong>correct</strong>, the error is zero, so the weight is unchanged.</p>
</li>
<li><p>When the prediction is <strong>too low</strong> (yi​=1 and y^​i​=0), the weight is adjusted to the same direction to increase the weighted sum.</p>
</li>
<li><p>When the prediction is <strong>too high</strong> (yi​=0 and y^​i​=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.</p>
</li>
</ol>
<h4 id="heading-how-the-bias-terms-work-in-the-iteration-loop">How the bias terms work in the iteration loop</h4>
<p>The bias determines the decision boundary’s intercept (position from the origin).</p>
<p>Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:</p>
<p>$$\begin {align*} b &amp;:= b + \Delta b \\ &amp; := b + \eta (y_i - \hat y_i) \\ &amp;= \begin{cases} b &amp;\text{(a) } y_i - \hat y_i = 0\\ b + \eta &amp;\text{(b) } y_i - \hat y_i = 1 \\ b - \eta &amp;\text{(c) } y_i - \hat y_i = -1 \\ \end{cases} \end{align*}$$</p><p>This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.</p>
<h4 id="heading-make-a-prediction">Make a prediction</h4>
<p>Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      <span class="hljs-keyword">return</span> predictions
</code></pre>
<p><strong>The entire classifier looks like this:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Perceptron</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = <span class="hljs-literal">None</span>
        self.bias = <span class="hljs-literal">None</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
        <span class="hljs-keyword">return</span> np.where(x &gt; threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        <span class="hljs-keyword">return</span> y_pred
</code></pre>
<h4 id="heading-simulate-with-synthetic-datasets">Simulate with synthetic datasets</h4>
<p>First, we generated a synthetic linearly separable dataset using <code>make_blob</code> and computed a decision boundary, then train the classifier we created.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> make_blobs
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># create a mock dataset</span>
X, y = make_blobs(n_features=<span class="hljs-number">2</span>, centers=<span class="hljs-number">2</span>, n_samples=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">12</span>)

<span class="hljs-comment"># split</span>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># train the model</span>
perceptron = Perceptron(learning_rate=<span class="hljs-number">0.1</span>, n_iterations=<span class="hljs-number">1000</span>).fit(X_train, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

<span class="hljs-comment"># evaluate the results</span>
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results">Results</h4>
<p>The classifier generated a clear, highly accurate linear decision boundary.</p>
<ul>
<li><p><em>Accuracy (Train): 0.981</em></p>
</li>
<li><p><em>Accuracy (Test): 0.975</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440470195/0a01c5ad-124e-4f59-b4d5-9ee5dd5b23ce.png" alt="Decision boundary of single-layer perceptron (Custom classifier)" class="image--center mx-auto" width="1478" height="732" loading="lazy"></p>
<h3 id="heading-2-leverage-sckitlearns-mcp-classifier">2. Leverage SckitLearn’s MCP Classifier</h3>
<p>For our convenience, we’ll use sckit-learn’s build-in classifier ( <code>MCPClassifier</code>) to build a similar, yet more robust classifier:</p>
<pre><code class="lang-python">model = MLPClassifier(
    hidden_layer_sizes=(), <span class="hljs-comment"># intentionally set empty to create a single layer perceptron</span>
    activation=<span class="hljs-string">'logistic'</span>, <span class="hljs-comment"># choosing a sigmoid function as an activation function</span>
    solver=<span class="hljs-string">'sgd'</span>, <span class="hljs-comment"># choosing SGD optimizer</span>
    max_iter=<span class="hljs-number">1000</span>,
    random_state=<span class="hljs-number">42</span>, 
    learning_rate=<span class="hljs-string">'constant'</span>, 
    learning_rate_init=<span class="hljs-number">0.1</span>
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"MCPClassifier\nAccuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> \nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
</code></pre>
<h4 id="heading-results-1">Results</h4>
<p>The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.</p>
<ul>
<li><p><em>Accuracy (Train): 0.985</em></p>
</li>
<li><p><em>Accuracy (Test): 0.995</em></p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440118956/f5391f47-711a-4948-b956-1a76dbd7ca92.png" alt="Decision boundary of single-layer perceptron (MCP Classifier)" class="image--center mx-auto" width="1720" height="850" loading="lazy"></p>
<h3 id="heading-limitations-of-single-layer-perceptrons">Limitations of Single-Layer Perceptrons</h3>
<p>Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.</p>
<p>Unlike more general neural networks, single-layer perceptrons use a <strong>step function</strong> as their activation.</p>
<p>Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).</p>
<p>This fundamental property precludes the use of <strong>gradient-based optimization algorithms</strong> such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.</p>
<p>In contrast, most neural networks employ differentiable activation functions (for example, <strong>sigmoid</strong>, <strong>ReLU</strong>) and loss functions (for example, <strong>MSE</strong>, <strong>Cross-Entropy</strong>) for effective optimization.</p>
<p>Other challenges of a single-layer perceptron include:</p>
<ul>
<li><p><strong>Limited to linear separability:</strong> Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.</p>
</li>
<li><p><strong>Lack of depth:</strong> Being single-layered, they cannot learn complex hierarchical representations.</p>
</li>
<li><p><strong>Limited optimizer options:</strong> As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.</p>
</li>
</ul>
<p>So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.</p>
<h2 id="heading-what-is-a-multi-layer-perceptron">What is a Multi-Layer Perceptron?</h2>
<p>An MLP is a class of feedforward artificial neural network that consists of at least <strong>three layers</strong> of nodes:</p>
<ul>
<li><p>an input layer,</p>
</li>
<li><p>one or more hidden layers, and</p>
</li>
<li><p>an output layer.</p>
</li>
</ul>
<p>Except for the input nodes, each node is a neuron that uses a <strong>nonlinear</strong> activation function.​</p>
<p>MLPs are widely used for classification problems as well as regression:</p>
<ul>
<li><p><strong>Classification tasks:</strong> MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.​</p>
</li>
<li><p><strong>Regression analysis:</strong> They are also applied in regression problems where the relationship between input and output is complex.​</p>
</li>
</ul>
<h2 id="heading-how-to-build-multi-layered-perceptrons">How to Build Multi-Layered Perceptrons</h2>
<p>Let’s handle a binary classification task using a standard MLP architecture.</p>
<h3 id="heading-outline-of-the-project">Outline of the Project</h3>
<h4 id="heading-objective">Objective</h4>
<ul>
<li>Detect fraudulent transactions</li>
</ul>
<h4 id="heading-evaluation-metrics">Evaluation Metrics</h4>
<ul>
<li><p>Considering the cost of misclassification, we’ll prioritize improving <strong>Recall</strong> and <strong>Precision scores</strong></p>
</li>
<li><p>Then check the accuracy of classification with <strong>Accuracy</strong> Score (TP + TN / (TP + TN + FP + FN ))</p>
</li>
</ul>
<p><strong>Cost of Misclassification (from high to low):</strong></p>
<ul>
<li><p><strong>False Negative (FN):</strong> The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)</p>
</li>
<li><p><strong>False Positive (FP):</strong> The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)</p>
</li>
<li><p><strong>True Positive (TP):</strong> The model correctly identifies a fraudulent transaction as fraud.</p>
</li>
<li><p><strong>True Negative (TN):</strong>  The model correctly identifies a non-fraudulent transaction as non-fraud.</p>
</li>
</ul>
<h3 id="heading-planning-an-mlp-architecture">Planning an MLP Architecture</h3>
<p>In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.</p>
<p>Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.</p>
<p>During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440761512/37753a4c-f7f8-44bc-bea9-c50360830456.png" alt="Standard MLP Architecture for Binary Classification Tasks)" class="image--center mx-auto" width="1384" height="752" loading="lazy"></p>
<p>Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using <a target="_blank" href="https://www.researchgate.net/publication/355148120_SS-MLP_A_Novel_Spectral-Spatial_MLP_Architecture_for_Hyperspectral_Image_Classification">image source</a>)</p>
<p>Especially in deeper network, <strong>ReLU</strong> is advantageous in preventing <a target="_blank" href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem#:~:text=In%20machine%20learning%2C%20the%20vanishing,derivative%20of%20the%20loss%20function">vanishing gradient problems</a> where gradients become extremely small as they are backpropagated from the output layers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440797954/ba19bf66-cdb9-4bfb-9b92-e1e3f72e9fc7.png" alt="Comparison of major activation functions: From left to right: Sigmoid, Tanh, ReLU" class="image--center mx-auto" width="1694" height="578" loading="lazy"></p>
<p><a target="_blank" href="https://medium.com/data-science-collective/a-comprehensive-guide-on-neural-network-in-deep-learning-442ba9f1f0e5">Learn More: A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h3 id="heading-preprocessing-the-datasets">Preprocessing the Datasets</h3>
<p>First, we consolidate <a target="_blank" href="https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets">three datasets  –  transaction, customer, and credit card</a>  –  into a single DataFrame, independently sanitizing numerical and categorical data:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler, OneHotEncoder
<span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

<span class="hljs-comment"># download the raw data to local</span>
<span class="hljs-keyword">import</span> kagglehub
path = kagglehub.dataset_download(<span class="hljs-string">"computingvictor/transactions-fraud-datasets"</span>)
dir = <span class="hljs-string">f'<span class="hljs-subst">{path}</span>/gd_card_flaud_demo'</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sanitize_df</span>(<span class="hljs-params">amount_str</span>):</span>
    <span class="hljs-string">"""Removes '$' and converts the string to a float."""</span>
    <span class="hljs-keyword">if</span> isinstance(amount_str, str):
        <span class="hljs-keyword">return</span> float(amount_str.replace(<span class="hljs-string">'$'</span>, <span class="hljs-string">''</span>))
    <span class="hljs-keyword">return</span> amount_str

<span class="hljs-comment"># load transaction data</span>
trx_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/transactions_data.csv'</span>)

<span class="hljs-comment"># sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)</span>
trx_df = trx_df[trx_df[<span class="hljs-string">'errors'</span>].isna()]
trx_df = trx_df.drop(columns=[<span class="hljs-string">'merchant_city'</span>,<span class="hljs-string">'merchant_state'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'mcc'</span>, <span class="hljs-string">'errors'</span>], axis=<span class="hljs-string">'columns'</span>)
trx_df[<span class="hljs-string">'amount'</span>] = trx_df[<span class="hljs-string">'amount'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge the dataframe with fraud transaction flag.</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/train_fraud_labels.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get(<span class="hljs-string">'target'</span>, {})
fraud_labels_series = pd.Series(fraud_labels_dict, name=<span class="hljs-string">'is_fraud'</span>)
fraud_labels_series.index = fraud_labels_series.index.astype(int) <span class="hljs-comment"># convert the datatype from string to integer</span>
merged_df = pd.merge(trx_df, fraud_labels_series, left_on=<span class="hljs-string">'id'</span>, right_index=<span class="hljs-literal">True</span>, how=<span class="hljs-string">'left'</span>)
merged_df.fillna({<span class="hljs-string">'is_fraud'</span>: <span class="hljs-string">'No'</span>}, inplace=<span class="hljs-literal">True</span>)
merged_df[<span class="hljs-string">'is_fraud'</span>] = merged_df[<span class="hljs-string">'is_fraud'</span>].map({<span class="hljs-string">'Yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'No'</span>: <span class="hljs-number">0</span>})

<span class="hljs-comment"># load card data</span>
card_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/cards_data.csv'</span>)
card_df = card_df.drop(columns=[<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'acct_open_date'</span>, <span class="hljs-string">'card_number'</span>, <span class="hljs-string">'expires'</span>, <span class="hljs-string">'cvv'</span>], axis=<span class="hljs-string">'columns'</span>)
card_df[<span class="hljs-string">'credit_limit'</span>] = card_df[<span class="hljs-string">'credit_limit'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge transaction and card data</span>
merged_df = pd.merge(left=merged_df, right=card_df, left_on=<span class="hljs-string">'card_id'</span>, right_on=<span class="hljs-string">'id'</span>, how=<span class="hljs-string">'inner'</span>)
merged_df = merged_df.drop(columns=[<span class="hljs-string">'id_y'</span>, <span class="hljs-string">'card_id'</span>], axis=<span class="hljs-string">'columns'</span>)

<span class="hljs-comment"># converts categorical variables into a new binary column (0 or 1)</span>
categorical_cols = merged_df.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=<span class="hljs-literal">False</span>, dtype=float) 
df = df.dropna().drop([<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'id_x'</span>], axis=<span class="hljs-number">1</span>)
print(<span class="hljs-string">'\nDataFrame: \n'</span>, df.head(n=<span class="hljs-number">3</span>))
</code></pre>
<p>DataFrame:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440856826/ba79bdaf-e0a1-457f-ab19-fda3e0f08141.png" alt="Base DataFrame" class="image--center mx-auto" width="1546" height="810" loading="lazy"></p>
<p>Our DataFrame shows an extremely <strong>skewed data distribution</strong> with:</p>
<ul>
<li><p>Fraud samples: 1,191</p>
</li>
<li><p>Non-fraud samples: 11,477,397</p>
</li>
</ul>
<p>For classification tasks, <strong>it's crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact</strong> on classification model performance, especially regarding the minority class.</p>
<p>For our data, we’ll:</p>
<ol>
<li><p>split the 1,191 fraud samples into training, validation, and test sets,</p>
</li>
<li><p>add an equal number of randomly chosen non-fraud samples from the DataFrame, and</p>
</li>
<li><p>adjust split balances later if generalization challenges arise.</p>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># define the desired size of the fraud samples for the validation and test sets</span>
val_size_per_class = <span class="hljs-number">200</span>
test_size_per_class = <span class="hljs-number">200</span>

<span class="hljs-comment"># create test sets</span>
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced test set</span>
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_test = X_test[<span class="hljs-string">'is_fraud'</span>]
X_test = X_test.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the original dataframes to avoid data leakage</span>
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


<span class="hljs-comment"># create validation sets</span>
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced validation set</span>
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_val = X_val[<span class="hljs-string">'is_fraud'</span>]
X_val = X_val.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the remaining dataframes</span>
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


<span class="hljs-comment"># create training sets</span>
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_train = X_train[<span class="hljs-string">'is_fraud'</span>]
X_train = X_train.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)


print(<span class="hljs-string">"\n--- Final Dataset Shapes and Distributions ---"</span>)
print(<span class="hljs-string">f"X_train shape: <span class="hljs-subst">{X_train.shape}</span>, y_train distribution: <span class="hljs-subst">{np.unique(y_train, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_val shape: <span class="hljs-subst">{X_val.shape}</span>, y_val distribution: <span class="hljs-subst">{np.unique(y_val, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_test shape: <span class="hljs-subst">{X_test.shape}</span>, y_test distribution: <span class="hljs-subst">{np.unique(y_test, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
</code></pre>
<p>After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a <strong>50:50 split between fraud and non-fraud transactions</strong>:</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*IZtK3l0hSqmkOrm9h_d9Jw.png" alt="X, y datasets shape" width="600" height="400" loading="lazy"></p>
<p>Considering the high dimensional feature space with 19 input features, we’ll apply <strong>SMOTE</strong> to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> imblearn.over_sampling <span class="hljs-keyword">import</span> SMOTE
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter

train_target = <span class="hljs-number">2000</span>

smote_train = SMOTE(
  sampling_strategy={<span class="hljs-number">0</span>: train_target, <span class="hljs-number">1</span>: train_target},  <span class="hljs-comment"># increase sample size to 2,000</span>
  random_state=<span class="hljs-number">12</span>
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(<span class="hljs-string">f"\nAfter SMOTE with custom sampling_strategy (target train: <span class="hljs-subst">{train_target}</span>):"</span>)
print(<span class="hljs-string">f"X_train_oversampled shape: <span class="hljs-subst">{X_train.shape}</span>"</span>)
print(<span class="hljs-string">f"y_train_oversampled distribution: <span class="hljs-subst">{Counter(y_train)}</span>"</span>)
</code></pre>
<p>We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748440986995/ed079321-3972-4226-b1a8-244010445162.png" alt="Training sample shape after SMOTE" class="image--center mx-auto" width="1578" height="218" loading="lazy"></p>
<p>Lastly, we’ll apply <strong>column transformers</strong> to numerical and categorical features separately.</p>
<p>Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

categorical_features = X_train.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns.tolist()
categorical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)),(<span class="hljs-string">'onehot'</span>, OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>))])

numerical_features = X_train.select_dtypes(include=[<span class="hljs-string">'int64'</span>, <span class="hljs-string">'float64'</span>]).columns.tolist()
numerical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)), (<span class="hljs-string">'scaler'</span>, StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        (<span class="hljs-string">'num'</span>, numerical_transformer, numerical_features),
        (<span class="hljs-string">'cat'</span>, categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)
</code></pre>
<h2 id="heading-understanding-optimizers">Understanding Optimizers</h2>
<p>In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.</p>
<p>Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.</p>
<p>In this article, we’ll use the SGD Optimizer and Adam Optimizer.</p>
<h3 id="heading-1-how-a-sgd-stochastic-gradient-descent-optimizer-works">1. How a SGD (Stochastic Gradient Descent) Optimizer Works</h3>
<p>SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:</p>
<p>$$\begin{align*} w_j &amp;:= w_j - \eta \frac {\partial J} {\partial w_j} \\ \\ b &amp;:= b - \eta \frac {\partial J} {\partial b} \end{align*}$$</p><p>(w: weight, b: bias, J: cost function, <em>η</em>: learning rate)</p>
<p>In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:</p>
<p>$$\begin{align*} J(y, \hat y) &amp;=−[y log(\hat y) + (1-y)log(1-\hat y)] \\ \\ \hat y &amp;= \sigma (z) = \frac {1} {1+e^{-z}} \\ \\ z &amp;= \sum_{i=1}^m w_i x_i + b \end {align*}$$</p><h3 id="heading-2-how-adam-adaptive-moment-estimation-optimizer-works">2. How Adam (Adaptive Moment Estimation) Optimizer Works</h3>
<p>Adam is an optimization algorithm that computes <strong>individual adaptive learning rates</strong> for different parameters from estimates of first and second moments of the gradients.</p>
<p>Adam optimizer combines the advantages of <a target="_blank" href="https://keras.io/api/optimizers/rmsprop/"><strong>RMSprop</strong></a> (using squared gradients to scale the learning rate) and <a target="_blank" href="https://optimization.cbe.cornell.edu/index.php?title=Momentum"><strong>Momentum</strong></a> (using past gradients to accelerate convergence):</p>
<p>$$w_{j,t+1} = w_{j,t} - \alpha \cdot \frac{\hat{m}{t,w_j}}{\sqrt{\hat{v}{t,w_j}} + \epsilon}$$</p><p>where:</p>
<ul>
<li><p><code>α</code>: The learning rate (default is 0.001)</p>
</li>
<li><p><code>ϵ</code>: A small positive constant used to avoid division by zero</p>
</li>
<li><p><code>m^</code>: First moment (mean) estimate with a bias correction, leveraging <strong>Momentum</strong>:</p>
</li>
</ul>
<p>$$\begin{align*} \hat m_t &amp;= \frac {m_t} {1 - \beta_1^t} \\ \\ m_t &amp;= \beta_1 m_{t-1} + (1-\beta_1) \underbrace{ \frac {\partial L} {\partial w_t}}_{\text{gradient}} \end{align*}$$</p><p>(<code>β1</code>​​: <strong>Decay rates</strong>, typically set to β1=0.9)</p>
<p><code>v^</code>: Second moment (variance) estimate with a bias correction, leveraging <strong>RMSprop</strong>:</p>
<p>$$\begin{align*} \hat v_t &amp;= \frac {v_t} {1 - \beta_2^t} \\ \\ v_t &amp;=\beta_2 v_{t-1} + (1- \beta_2) (\frac {\partial L} {\partial w_t})^2 \end {align*}$$</p><p>(<code>β2</code>​​: <strong>Decay rates</strong>, typically set to β2=0.999)</p>
<p>Since both <code>m</code>​​ and <code>v</code>​ are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.</p>
<p>Learn More: <a target="_blank" href="https://medium.com/@kuriko-iwai/a-comprehensive-guide-on-neural-network-in-deep-learning-9c795a1f1648">A Comprehensive Guide on Neural Network in Deep Learning</a></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-sgd-optimizer">How to Build an MLP Classifier with SGD Optimizer</h2>
<h3 id="heading-custom-classifier">Custom Classifier</h3>
<p>This process involves a <strong>forward pass</strong> and <strong>backpropagation</strong>, during which SGD computes optimal weights and biases using gradients:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
    <span class="hljs-comment"># SGD starts with randomly selected mini-batch for the epoch</span>
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    <span class="hljs-comment"># A. forward pass</span>
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[<span class="hljs-number">-1</span>]  <span class="hljs-comment"># final output of the network</span>

    <span class="hljs-comment"># B. backpropagation</span>
    <span class="hljs-comment"># 1) calculating gradients for the output layer)</span>
    delta = y_pred - y_batch
    dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

    <span class="hljs-comment"># 2) update output layer parameters</span>
    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

    <span class="hljs-comment"># 3) iterate backward from last hidden layer to the input layer</span>
    <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
        delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
        dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
        db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db
</code></pre>
<p>In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
    activations = [X]
    zs = []

    <span class="hljs-comment"># forward through hidden layers</span>
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
        z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) <span class="hljs-comment"># using ReLU for hidden layers</span>
        activations.append(a)

    <span class="hljs-comment"># forward through output layer</span>
    z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
    zs.append(z_output)

    <span class="hljs-comment"># computes the final output using sigmoid function</span>
    y_pred = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
    activations.append(y_pred)
    <span class="hljs-keyword">return</span> activations, zs
</code></pre>
<p>So the final classifier looks like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_SGD</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.01</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]
        self.weights = []
        self.biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            <span class="hljs-comment"># shuffle datasets</span>
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>]

                delta = y_pred - y_batch
                dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten() <span class="hljs-comment"># for 1D output</span>
</code></pre>
<h3 id="heading-training-prediction">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python"><span class="hljs-comment"># 1. define the model</span>
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>, ), <span class="hljs-comment"># 2 hidden layers with 30 neurons each</span>
  learning_rate=<span class="hljs-number">0.001</span>,           <span class="hljs-comment"># a step size</span>
  n_epochs=<span class="hljs-number">1000</span>,                 <span class="hljs-comment"># number of epochs</span>
  batch_size=<span class="hljs-number">32</span>                  <span class="hljs-comment"># mini-batch size</span>
)

<span class="hljs-comment"># 2. train the model</span>
mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># 3. make a prediction with training and validation datasets</span>
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

<span class="hljs-comment"># 4. compute evaluation matrics</span>
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
recall = recall_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
f1 = f1_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)


print(<span class="hljs-string">f"\nMLP (Custom SGD) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom SGD) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-2">Results</h3>
<ul>
<li><p>Recall: <em>0.7930 — 0.6650 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7790 — 0.6786 (from training to validation)</em></p>
</li>
</ul>
<p>The model effectively learned and generalized the patterns, achieving a <strong>Recall of 79.3%</strong> (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.</p>
<p><strong>Loss history:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441103897/088deb38-846d-4026-a706-701be93036ca.png" alt="Loss by epoch, weight history, bias history (Source: Kuriko Iwai)" class="image--center mx-auto" width="1770" height="460" loading="lazy"></p>
<p>We visualized the <strong>decision boundary</strong> using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442430297/032ee809-1b7e-4bb1-81c0-8715361658a5.png" alt="Image: Decision Boundary of MLP Classifier with SGD optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="1508" height="754" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier">Leverage SckitLearn’s MCP Classifier</h3>
<p>We can use an MCP Classifier to define a similar model, incorporating;</p>
<ul>
<li><p><strong>Early stopping</strong> using internal validation to prevent overfitting and</p>
</li>
<li><p><strong>L2 regularization</strong> with a small tolerance.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPClassifier

<span class="hljs-comment"># define a model</span>
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'sgd'</span>,
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    momentum=<span class="hljs-number">0.9</span>,
    nesterovs_momentum=<span class="hljs-literal">True</span>,
    alpha=<span class="hljs-number">0.00001</span>,           <span class="hljs-comment"># l2 regulation strength</span>
    max_iter=<span class="hljs-number">3000</span>,           <span class="hljs-comment"># max epochs (keep it high)</span>
    batch_size=<span class="hljs-number">16</span>,           <span class="hljs-comment"># mini-batch size</span>
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,     <span class="hljs-comment"># apply early stopping</span>
    n_iter_no_change=<span class="hljs-number">50</span>,     <span class="hljs-comment"># stop the iteration if internal validation score doesn't improve for 50 epochs</span>
    validation_fraction=<span class="hljs-number">0.1</span>, <span class="hljs-comment"># proportion of training data for internal validation (default is 0.1)</span>
    tol=<span class="hljs-number">1e-4</span>,                <span class="hljs-comment"># tolerance for optimization</span>
    verbose=<span class="hljs-literal">False</span>,
)

<span class="hljs-comment"># training</span>
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-3">Results</h3>
<ul>
<li><p>Recall: <em>0.7830 - 0.6200 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.8208  - 0.6703 (from training to validation)</em></p>
</li>
</ul>
<p>The model showed strong performance during training, achieving a Recall <strong>of 78.30%</strong>. Its performance declined on the validation set.</p>
<p>This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.</p>
<h3 id="heading-leverage-keras-sequential-classifier">Leverage Keras Sequential Classifier</h3>
<p>For the sequential classifier, we can further enhance the classifier by:</p>
<ul>
<li><p>Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train​) to address dataset imbalance and promote faster convergence,</p>
</li>
<li><p>Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,</p>
</li>
<li><p>Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,</p>
</li>
<li><p>Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and</p>
</li>
<li><p>Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> SGD
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


<span class="hljs-comment"># calculates an initial bias for the output layer </span>
initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])


<span class="hljs-comment"># defines the model</span>
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>), <span class="hljs-comment"># 10% of the neurons in that layer randomly dropped out</span>
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, <span class="hljs-comment"># binary classification</span>
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) <span class="hljs-comment"># to address the imbalanced datasets</span>
])



<span class="hljs-comment"># compiles the model with the SGD optimizer</span>
opt = SGD(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_sgd.compile(
    optimizer=opt, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>,
    metrics=[
        <span class="hljs-string">'accuracy'</span>, <span class="hljs-comment"># add several metrics to return</span>
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)


<span class="hljs-comment"># defines early stopping to prevent overfitting</span>
early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,  <span class="hljs-comment"># monitor recall </span>
    mode=<span class="hljs-string">'max'</span>,         <span class="hljs-comment"># maximize recall</span>
    patience=<span class="hljs-number">50</span>,        <span class="hljs-comment"># stop after 50 epochs without loss improvement</span>
    min_delta=<span class="hljs-number">1e-4</span>,     <span class="hljs-comment"># minimum change to be considered an improvement (tol)</span>
    verbose=<span class="hljs-number">0</span>
)


<span class="hljs-comment"># compute the class weight</span>
class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


<span class="hljs-comment"># train the model</span>
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val), <span class="hljs-comment"># use our external val set</span>
    callbacks=[early_stopping_callback], <span class="hljs-comment"># early stopping to prevent overfitting</span>
    class_weight=class_weights_dict, <span class="hljs-comment"># penarlize more misclassification on minority class</span>
    verbose=<span class="hljs-number">0</span>
)

<span class="hljs-comment"># evaluate</span>
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)

<span class="hljs-comment"># display model summary</span>
model_keras_sgd.summary()
</code></pre>
<h3 id="heading-results-4">Results</h3>
<ul>
<li><p>Recall: <em>0.7125 — 0.7250 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.7607 — 0.7545 (from training to validation)</em></p>
</li>
</ul>
<p>Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.</p>
<p>It suggests that the regularization techniques are likely effective in preventing significant overfitting.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441165170/4e0528e3-514a-454c-b52a-2a0318ba405a.png" alt="Image: Summary of the Keras Sequential Model with SGD Optimizer" class="image--center mx-auto" width="1668" height="512" loading="lazy"></p>
<h2 id="heading-how-to-build-an-mlp-classifier-with-adam-optimizer">How to Build an MLP Classifier with Adam Optimizer</h2>
<h3 id="heading-custom-classifier-1">Custom Classifier</h3>
<p>This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:</p>
<pre><code class="lang-python"><span class="hljs-comment"># apply Adam updates for output layer parameters</span>
<span class="hljs-comment"># 1) weights (w)</span>
self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

<span class="hljs-comment"># 2) bias (b)</span>
self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
</code></pre>
<p>Following the principles of forward and backward passes, we construct the final classifier by initializing it with <code>beta1</code> and <code>beta2</code>, built upon an <code>MLP_SGD</code> architecture:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_Adam</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span>,
                 beta1=<span class="hljs-number">0.9</span>, beta2=<span class="hljs-number">0.999</span>, epsilon=<span class="hljs-number">1e-8</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        <span class="hljs-comment"># Adam optimizer internal states for each parameter (weights and biases)</span>
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x &gt; <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
            self.v_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-comment"># global time step for Adam bias correction</span>
        t = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># Mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += <span class="hljs-number">1</span>

                <span class="hljs-comment"># 1. forward pass</span>
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>] <span class="hljs-comment"># Output of the network</span>

                <span class="hljs-comment"># 2. backpropagation</span>
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>] <span class="hljs-comment"># Average over batch</span>
                grad_b_output = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                <span class="hljs-comment"># apply Adam updates to weights</span>
                self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
                self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
                m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                <span class="hljs-comment"># apply Adam updates to bias</span>
                self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
                self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
                m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                <span class="hljs-comment"># Propagate gradients backward through hidden layers</span>
                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    grad_b_hidden = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    <span class="hljs-comment"># apply Adam updates to weights</span>
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_hidden ** <span class="hljs-number">2</span>)
                    m_w_hat = self.m_weights[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    <span class="hljs-comment"># apply Adam updates to bias</span>
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_hidden ** <span class="hljs-number">2</span>)
                    m_b_hat = self.m_biases[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities &gt;= threshold).astype(int).flatten()
</code></pre>
<h3 id="heading-training-prediction-1">Training / Prediction</h3>
<p>Train the model and make a prediction using training and validation datasets:</p>
<pre><code class="lang-python">mlp_adam = MLP_Adam(hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">10</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">500</span>, batch_size=<span class="hljs-number">32</span>)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(<span class="hljs-string">f"\nMLP (Custom Adam) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom Adam) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
</code></pre>
<h3 id="heading-results-5">Results</h3>
<ul>
<li><p>Recall: <em>0.9870–0.6150 (from training to validation)</em></p>
</li>
<li><p>Precision: <em>0.9811–0.6474 (from training to validation)</em></p>
</li>
</ul>
<p>While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.</p>
<p><strong>Loss History</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442341394/3183a9b1-5df0-4f74-9473-6b5b595dc9c0.png" alt="Loss by epoch, middle: weights history by epoch, right: bias history by epoch (source: Kuriko Iwai)" class="image--center mx-auto" width="1676" height="456" loading="lazy"></p>
<p>We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748442311514/34f004c9-bf1d-41e5-a0af-08c62802b78c.png" alt="Decision Boundary of MLP with Adam Optimizer (source: Kuriko Iwai)" class="image--center mx-auto" width="1770" height="916" loading="lazy"></p>
<h3 id="heading-leverage-sckitlearns-mcp-classifier-1">Leverage SckitLearn’s MCP Classifier</h3>
<p>We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:</p>
<pre><code class="lang-python">model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'adam'</span>,             <span class="hljs-comment"># update the optimizer from SGD to Adam</span>
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    alpha=<span class="hljs-number">0.0001</span>,
    max_iter=<span class="hljs-number">3000</span>,
    batch_size=<span class="hljs-number">16</span>,
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,
    n_iter_no_change=<span class="hljs-number">50</span>,
    validation_fraction=<span class="hljs-number">0.1</span>,
    tol=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-literal">False</span>,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)
</code></pre>
<h3 id="heading-results-6">Results</h3>
<ul>
<li><p><em>Recall: 0.8975–0.6400 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8864 —  0.6305 (from training to validation)</em></p>
</li>
</ul>
<p>Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.</p>
<h3 id="heading-leverage-keras-sequential-classifier-1">Leverage Keras Sequential Classifier</h3>
<p>Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> Adam
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>)),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>, 
    metrics=[
        <span class="hljs-string">'accuracy'</span>,
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,
    mode=<span class="hljs-string">'max'</span>,
    patience=<span class="hljs-number">50</span>,
    min_delta=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-number">0</span>
)

class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=<span class="hljs-number">0</span>
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"\n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)


model_keras_adam.summary()
</code></pre>
<h3 id="heading-results-7">Results</h3>
<ul>
<li><p><em>Recall: 0.7995–0.7500 (from training to validation)</em></p>
</li>
<li><p><em>Precision: 0.8409–0.8065 (from training to validation)</em></p>
</li>
</ul>
<p>The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).</p>
<p>This indicates good generalization, with only minor performance degradation on unseen data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748441767800/fe43f181-4323-461f-b56a-125fc78e9c84.png" alt="Image: Keras Sequential Model with Adam Optimizer (Source: Kuriko Iwai)" class="image--center mx-auto" width="1484" height="542" loading="lazy"></p>
<h2 id="heading-final-results-generalization">Final Results: Generalization</h2>
<p>Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Custom classifiers</span>
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># MLPClassifer</span>
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># Keras Sequential</span>
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
</code></pre>
<p>Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an <strong>AUPRC (Area Under Precision-Recall Curve) of 0.72.</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748874699534/f0f008c4-9067-4e2a-b070-4bb5cbae8f23.png" alt="Precision-Recall Curves for Six Classifier Models (Comparing Custom, MLP, and Keras Sequential Classifiers with SGD and Adam Optimizers (Source: Kuriko Iwai)" class="image--center mx-auto" width="2160" height="426" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.</p>
<p>Our findings underscore that effective machine learning hinges on three critical factors:</p>
<ol>
<li><p><strong>robust data preprocessing</strong> (tailored to objectives and data distribution),</p>
</li>
<li><p><strong>judicious model selection</strong>, and</p>
</li>
<li><p><strong>strategic framework or library choices</strong>.</p>
</li>
</ol>
<h3 id="heading-choosing-the-right-framework"><strong>Choosing the right framework</strong></h3>
<p>Generally speaking, choose <code>MLPClassifier</code> when:</p>
<ul>
<li><p>You’re primarily working with <strong>tabular data,</strong></p>
</li>
<li><p>You want to prioritize <strong>simplicity, quick iteration, and seamless integration,</strong></p>
</li>
<li><p>You have simple, shallow architectures, and</p>
</li>
<li><p>You have a moderate dataset size (manageable on a CPU).</p>
</li>
</ul>
<p>Choose Keras <code>Sequential</code> when:</p>
<ul>
<li><p>You’re dealing with <strong>image, text, audio, or other sequential data,</strong></p>
</li>
<li><p>You’re building <strong>deep learning models</strong> such as CNNs, RNNs, LSTMs,</p>
</li>
<li><p>You need <strong>fine-grained control</strong> over the model architecture, training process, or custom components,</p>
</li>
<li><p>You need to leverage <strong>GPU acceleration</strong>,</p>
</li>
<li><p>You’re planning for <strong>production deployment</strong>, and</p>
</li>
<li><p>You want to experiment with more advanced deep learning techniques.</p>
</li>
</ul>
<h3 id="heading-limitation-of-mlps">Limitation of MLPs</h3>
<p>While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.</p>
<p>Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.</p>
<p>You can find more info about me on my <a target="_blank" href="https://kuriko.vercel.app/">Portfolio</a> / <a target="_blank" href="https://www.linkedin.com/in/k-i-i">LinkedIn</a> / <a target="_blank" href="https://github.com/versionhq/multi-agent-system">Github</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Create a DeepSeek R1 API in R with Plumber ]]>
                </title>
                <description>
                    <![CDATA[ To create an AI chatbot and integrate it with another platform, you have to communicate with large language model using an API. This API receives prompts from the client and sends them to the model to generate answers. In this tutorial, you will lear... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-create-a-deepseek-r1-api-in-r-with-plumber/</link>
                <guid isPermaLink="false">67b7b91a534c03e678009324</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ APIs ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Adejumo Ridwan Suleiman ]]>
                </dc:creator>
                <pubDate>Thu, 20 Feb 2025 23:22:01 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740093558918/453118b9-3c93-4e57-a1ad-7471e1046ef1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>To create an AI chatbot and integrate it with another platform, you have to communicate with large language model using an API. This API receives prompts from the client and sends them to the model to generate answers.</p>
<p>In this tutorial, you will learn how to create such an API using the DeepSeek R1 large language model so external applications can call it. We will use the <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-R1">DeepSeek R1 model</a>, available on HuggingFace, and the Plumber R package to deploy it as an API.</p>
<p>HuggingFace is an open source platform for building, training, and deploying machine learning models, while Plumber is an R package that expose R code as a RESTful APIs accessible to other applications through HTTP requests.</p>
<p>With this API, you can:</p>
<ul>
<li><p>Build AI applications</p>
</li>
<li><p>Connect to external data and extract meaningful insights</p>
</li>
<li><p>Integrate into existing applications to provide customer support, create documentations, and so on.</p>
</li>
</ul>
<h2 id="heading-what-is-the-deepseek-r1-model">What is the DeepSeek R1 Model?</h2>
<p>DeepSeek R1 is the latest large language model from the Chinese company <a target="_blank" href="https://www.deepseek.com/">DeepSeek</a>. It was designed to enhance the problem-solving and analytic capabilities of AI systems.</p>
<p>DeepSeek-R1 uses reinforcement learning and supervised fine-tuning to handle complex reasoning tasks. Unlike proprietary models, DeepSeek R1 is open-source and free to use.</p>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>Sign up for a <a target="_blank" href="https://huggingface.co/">HuggingFace account</a> if you don’t already have one</p>
</li>
<li><p>Install <a target="_blank" href="https://posit.co/download/rstudio-desktop/">R and R Studio</a>.</p>
</li>
<li><p>Install the <a target="_blank" href="https://www.rplumber.io/"><code>plumber</code></a> R package to build the API endpoint</p>
</li>
<li><p>Install the <a target="_blank" href="https://httr2.r-lib.org/"><code>httr2</code></a> R package to work with HTTP requests and interact with the Hugging Face API</p>
</li>
</ul>
<h2 id="heading-step-1-create-your-project-repository">Step 1: Create Your Project Repository</h2>
<p>You need to create an R project to create an API application in R. This ensures that all the files needed to keep your API working are kept together under the same directory. R Studio already has a template provided for API projects, so you can follow the steps below to create yours.</p>
<p>In your R Studio IDE, click on the File menu and go to New Project to open the New Project Wizard. Once in the wizard, select New Directory, then click New Plumber API Project. Inside the directory name field, give it a name (for example <code>DeepSeek-R1 API</code>), and then click on Create Project.</p>
<p>You will see a file called <code>plumber.R</code> with a sample API template. This is where you’ll write the code to connect to the DeepSeek R1 model on HuggingFace. Make sure that you clear this template before proceeding.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738503866976/60b959cd-b564-486d-8b65-c9ca0278e239.gif" alt="GIF showing how to create a new Plumber project in R" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Next, go to your terminal and create a <code>.env</code> file. This is where you will store the Hugging Face API key.</p>
<pre><code class="lang-plaintext">touch .env
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738504109388/6ce9bda3-305a-4f2e-87b8-adbe4c245861.png" alt="Image showing how to create a .env variable on the terminal" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Create a <code>.gitignore</code> file and add the <code>.env</code> file to it. This ensures that sensitive information like access tokens and API keys are not pushed to your Git repository.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738504889229/0d433bcb-2a7d-4379-a0c7-e09fb53e288f.png" alt="Image showing the .env file in the .gitignore file" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-step-2-create-a-hugging-face-access-token">Step 2: Create a Hugging Face Access Token</h2>
<p>We need to create an access token to connect to Hugging Face models. Go to your profile, click Settings, and click Create New Token to create your access token for the Hugging Face repository.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738504360986/077a2778-d790-4ff9-94e1-c2c372b2efef.png" alt="Image showing the access tokens page, with options to create a new token " class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Copy the access token and paste it into your <code>.env</code> file, and give it the name <code>HUGGINGFACE_ACCESS_TOKEN</code>.</p>
<pre><code class="lang-plaintext">HUGGINGFACE_ACCESS_TOKEN="&lt;your-access-token&gt;"
</code></pre>
<p>Next is to install the <code>dotenv</code> package, and paste the following code at the top of your <code>plumber.R</code> file:</p>
<pre><code class="lang-r"><span class="hljs-comment"># Load environment variables from .env</span>
dotenv::load_dot_env()
</code></pre>
<p><code>dotenv::load_dot_env()</code> loads all environment variables in the <code>.env</code> file, making them available to the <code>plumber.R</code> script.</p>
<h2 id="heading-step-3-build-the-deepseek-api-endpoint">Step 3: Build the DeepSeek API Endpoint</h2>
<p>Now that we have our project environment set up and API token ready, we’ll write the code to build the API application by connecting to the DeepSeek R1 model on HuggingFace.</p>
<p>Go to the <code>plumber.R</code> file and load the following libraries:</p>
<pre><code class="lang-r"><span class="hljs-keyword">library</span>(plumber)
<span class="hljs-keyword">library</span>(httr2)
</code></pre>
<p>Copy and paste the following code into <code>plumber.R</code>:</p>
<pre><code class="lang-r">
api_key &lt;- Sys.getenv(<span class="hljs-string">"HUGGINGFACE_ACCESS_TOKEN"</span>)



<span class="hljs-comment">#* @post /deepseek_chat</span>
<span class="hljs-keyword">function</span>(prompt) {
  url &lt;- <span class="hljs-string">"https://huggingface.co/api/inference-proxy/together/v1/chat/completions"</span>

  <span class="hljs-comment"># Create a request object</span>
  req &lt;- request(url) |&gt;
    req_auth_bearer_token(api_key) |&gt;
    req_body_json(list(
      model = <span class="hljs-string">"deepseek-ai/DeepSeek-R1"</span>,
      messages = list(
        list(role = <span class="hljs-string">"user"</span>, content = prompt)
      ),
      max_tokens = <span class="hljs-number">500</span>,
      stream = <span class="hljs-literal">FALSE</span>
    ))

  <span class="hljs-comment"># Perform the request and capture the response</span>
  res &lt;- req_perform(req)

  <span class="hljs-comment"># Parse the JSON response</span>
  parsed_data &lt;- res |&gt;
    resp_body_json()

  <span class="hljs-comment"># Extract the content from the response</span>
  content &lt;- parsed_data$choices
  <span class="hljs-keyword">return</span>(content)
}
</code></pre>
<p>Here’s what’s going on in the above code:</p>
<ul>
<li><p><code>Sys.getenv</code> gets the HuggingFace access token and stores it in the variable <code>access_token</code>.</p>
</li>
<li><p>The <code>url</code> variable contains the API link to access the DeepSeek model on HuggingFace. You can get this by searching the model name <code>deepseek-ai/DeepSeek-R1</code> on HuggingFace. Go to the <strong>View Code</strong> button, and under the <strong>cURL</strong> tab, copy the API URL</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739177037117/0781bce2-7bf8-411d-ad71-cb2bf11fe1bb.gif" alt="GIF showing how to copy the API url you are going to use in your plumber API code" width="600" height="400" loading="lazy"></p>
</li>
<li><p><code>#* @post /deepseek_chat</code> means that the endpoint makes a POST request through the path <code>/deepseek_chat</code>.</p>
</li>
<li><p>This endpoint takes an argument <code>prompt</code>, a text, or a question a user is expected to give.</p>
</li>
<li><p>The <code>req</code> object is a chain of various operations, which makes a <code>request()</code> to the <code>url</code>, and then takes the <code>api_key</code> inside the <code>req_auth_bearer_token()</code> function. Model properties such as <code>model</code> name, <code>role</code>, <code>prompt</code>, and <code>max_tokens</code> are passed to the <code>req</code> object through the <code>req_body_json</code> function.</p>
</li>
<li><p>The <code>headers</code> variable contains the authorization required to make a request to HuggingFace API.</p>
</li>
<li><p>The request is performed and captured in a response object <code>res</code> using the <code>req_perform()</code> function.</p>
</li>
<li><p>The <code>res</code> object returns a JSON object, which is now parsed to R using the<code>resp_body_json()</code> function.</p>
</li>
<li><p>The <code>content</code> of the <code>parsed_data</code> is now returned so you can extract the information you need from the application for which you want to use the API.</p>
</li>
</ul>
<h2 id="heading-step-4-test-the-api-endpoint">Step 4: Test the API Endpoint</h2>
<p>Let’s run the API endpoint to see how the application performs. Click on Run API. This will automatically open the API endpoint on your browser on the URL <a target="_blank" href="http://127.0.0.1:8634/docs/"><code>http://127.0.0.1:8634/docs/</code></a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739282692303/82a029ea-31f5-4088-9e72-2fe1b69d0f7d.png" alt="Image showing the Run API button" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<p>Click on the API endpoint dropdown, provide a prompt, and click the Execute button. You should receive a reply in a few minutes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1739282620577/b1a52679-b397-4d82-af56-0f81ebc5888e.gif" alt="Image showing how the API endpoint returns a response when a prompt is given" class="image--center mx-auto" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>With your API, you can make inferences to the Hugging Face model and build AI applications in R or other programing languages. You need to host your API to make it accessible to clients online. There are various <a target="_blank" href="https://www.rplumber.io/articles/hosting.html">ways of hosting an R Plumber application</a>: you can use Docker or host it on DigitalOcean using the plumberDeploy R package. However, the simplest and easiest way is to use <a target="_blank" href="https://posit.co/products/enterprise/connect/">Posit Connect</a>.</p>
<p>You can use the same approach used in this tutorial to try out other HuggingFace models, build an API to generate images or translate different languages. R Plumber is easy to use, and the documentation provides many resources.</p>
<p>If you are interested in model deployment using R Plumber, you can check out <a target="_blank" href="https://learndata.xyz/posts/forecasting%20time%20series%20data%20with%20facebook%20prophet/">this article</a> on how to deploy a Time Series model built on Prophet using R Plumber.</p>
<p>If you find this article interesting, please check my other articles on <a target="_blank" href="https://learndata.xyz/blog">learndata.xyz</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Understanding Deep Learning Research Tutorial - Theory, Code, and Math ]]>
                </title>
                <description>
                    <![CDATA[ Understanding deep learning research can often feel like unraveling a dense and intricate puzzle. From decoding mathematical notation to navigating complex code bases, the process can be daunting, especially for newcomers. But with the right guidance... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/understanding-deep-learning-research-tutorial-theory-code-and-math/</link>
                <guid isPermaLink="false">678921f0a510ee899ead3f7c</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 16 Jan 2025 15:12:48 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737040354258/4ad88afd-82ee-4b59-bc6a-cdc9a5537c59.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Understanding deep learning research can often feel like unraveling a dense and intricate puzzle. From decoding mathematical notation to navigating complex code bases, the process can be daunting, especially for newcomers. But with the right guidance, you can build the skills necessary to break down cutting-edge AI research and make it accessible.</p>
<p>We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to read, understand, and implement deep learning research. Taught by Yacine, a published researcher and machine learning practitioner, this tutorial provides a step-by-step approach to mastering essential skills like interpreting technical papers, understanding advanced mathematics, and navigating research codebases. With practical examples and a focus on recent AI papers, this course empowers you to confidently engage with the latest developments in machine learning.</p>
<h3 id="heading-what-youll-learn-in-this-course">What You’ll Learn in This Course</h3>
<p>The course is structured to address the key challenges that aspiring researchers and practitioners face when diving into deep learning:</p>
<h4 id="heading-1-how-to-read-research-papers"><strong>1. How to Read Research Papers</strong></h4>
<p>This section provides a comprehensive framework for effectively breaking down research papers:</p>
<ul>
<li><p>Learn how to <strong>get external context</strong> and perform an initial casual read to grasp the paper’s main idea.</p>
</li>
<li><p>Dive deeper into <strong>filling knowledge gaps</strong> and achieving conceptual understanding.</p>
</li>
<li><p>Explore how to conduct a <strong>code deep dive</strong> and meticulously analyze the paper’s methods and results.</p>
</li>
<li><p>Develop strategies to identify and address <strong>weird gaps</strong> or inconsistencies in the paper.</p>
</li>
</ul>
<h4 id="heading-2-understanding-deep-learning-math"><strong>2. Understanding Deep Learning Math</strong></h4>
<p>Many papers rely heavily on mathematical notation, which can be intimidating. Yacine simplifies this process by teaching:</p>
<ul>
<li><p>Techniques to <strong>relax and approach formulas systematically</strong>.</p>
</li>
<li><p>How to <strong>translate symbols into meaning</strong> and build intuition around complex equations (e.g., the QHAdam optimizer).</p>
</li>
<li><p>Methods to summarize mathematical insights for practical application.</p>
</li>
</ul>
<h4 id="heading-3-learning-math-efficiently"><strong>3. Learning Math Efficiently</strong></h4>
<p>Mastering the mathematics behind deep learning doesn’t have to be overwhelming. This section focuses on:</p>
<ul>
<li><p>Selecting the <strong>right subfields of math</strong> to study based on your goals.</p>
</li>
<li><p>Leveraging <strong>exercise-rich resources</strong> to reinforce learning.</p>
</li>
<li><p>Using the <strong>Green-Yellow-Red method</strong> to identify strengths and weaknesses.</p>
</li>
<li><p>Fixing gaps in understanding through targeted study of theory.</p>
</li>
</ul>
<h4 id="heading-4-navigating-deep-learning-codebases"><strong>4. Navigating Deep Learning Codebases</strong></h4>
<p>Research codebases are often sprawling and complex. Yacine walks you through:</p>
<ul>
<li><p>How to <strong>map the structure of a codebase</strong> after reading the related research paper.</p>
</li>
<li><p>Strategies to <strong>run, debug, and understand the code</strong>.</p>
</li>
<li><p>Methods to elucidate unclear components and take detailed notes for clarity.</p>
</li>
</ul>
<h4 id="heading-5-segment-anything-model-sam-deep-dive"><strong>5. Segment Anything Model (SAM) Deep Dive</strong></h4>
<p>The course culminates in an in-depth exploration of the Segment Anything Model (SAM), a groundbreaking approach to segmentation in computer vision. You’ll learn about:</p>
<ul>
<li><p>The <strong>task and testing</strong> process for SAM.</p>
</li>
<li><p>Its <strong>theoretical underpinnings</strong> and key model components, including the image encoder, prompt encoder, and mask decoder.</p>
</li>
<li><p>How the <strong>data pipeline and engine</strong> are structured.</p>
</li>
<li><p>Insights into SAM’s <strong>zero-shot results</strong> and limitations.</p>
</li>
</ul>
<h3 id="heading-why-this-course">Why This Course</h3>
<p>Whether you're a beginner curious about deep learning or an experienced developer aiming to engage with AI research, this course equips you with practical tools and methodologies to demystify deep learning research. By combining theory, hands-on practice, and real-world examples, Yacine ensures that you’ll walk away with actionable insights and confidence in your ability to tackle even the most complex papers.</p>
<p>Check out the <a target="_blank" href="https://youtu.be/onU5Hbb3qao">Deep Learning Research Tutorial</a> now on the freeCodeCamp.org YouTube channel (2-hour course).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/onU5Hbb3qao" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How Do Generative Models Work in Deep Learning? Generative Models For Data Augmentation Explained ]]>
                </title>
                <description>
                    <![CDATA[ Data is at the heart of model training in the world of deep learning. The quantity and quality of training data determine the effectiveness of machine learning algorithms. On the other hand, obtaining massive amounts of precisely categorized data is ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/generative-models-for-data-augmentation/</link>
                <guid isPermaLink="false">66d4608ac7632f8bfbf1e469</guid>
                
                    <category>
                        <![CDATA[ Deep Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oyedele Tioluwani ]]>
                </dc:creator>
                <pubDate>Fri, 26 Jul 2024 12:22:23 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/07/glenn-carstens-peters-1F4MukO0UNg-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Data is at the heart of model training in the world of deep learning. The quantity and quality of training data determine the effectiveness of machine learning algorithms.</p>
<p>On the other hand, obtaining massive amounts of precisely categorized data is a difficult and resource-intensive operation. This is where data augmentation comes into play as an appealing solution, with the innovative potential of generative models at its forefront.</p>
<p>In this article, we'll look into the fundamental relevance of generative models in data augmentation for deep learning, such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).</p>
<h2 id="heading-what-are-generative-models">What are Generative Models?</h2>
<p>Generative models are a type of machine learning model that create new data samples that are similar to those in a given dataset. They discover hidden trends and structures in the data, allowing them to generate synthetic data points that are similar to the actual data.</p>
<p>These models are used in a variety of applications, such as image generation, text generation, data augmentation, and others. For example, in an image generation project, a generative model could be trained on images of cats and dogs to learn how to generate new images of cats and dogs.</p>
<p>They learn patterns and styles from existing data and apply that information to create similar things. It’s like your computer having a creative engine that generates fresh ideas after studying the tactics utilized in prior ones.</p>
<h2 id="heading-what-is-data-augmentation">What is Data Augmentation?</h2>
<p>Data augmentation is a machine learning and deep learning technique that uses various transformations and adjustments to existing data to improve the quality and quantity of a training dataset. This entails generating new data samples from existing ones to expand the size and diversity of a dataset.</p>
<p>The basic purpose of data augmentation is to increase a machine learning models’ performance, generalization, and robustness, notably in computer vision tasks and other data-driven areas.</p>
<p>Data augmentation can be used to improve datasets for a wide range of machine-learning applications, such as image classification, object detection, and natural language processing. Data augmentation, for example, can be used to create synthetic photos of faces, which can then be used to train a deep-learning model to detect faces in real-world images.</p>
<p>Data augmentation is an important method in the data world because it addresses the underlying concerns of data quantity and quality. Access to large amounts of diverse, well-labeled data is required for building strong and accurate models in many machine learning and deep learning applications.</p>
<p>Data augmentation is a beneficial method for expanding limited datasets by creating new samples, which improves model generalization and performance. Furthermore, it improves the ability of machine learning algorithms to manage real-world fluctuations, resulting in more trustworthy and flexible AI systems.</p>
<h2 id="heading-why-use-generative-models-for-data-augmentation">Why Use Generative Models for Data Augmentation?</h2>
<p>There are several reasons why generative models are employed for data augmentation in machine learning:</p>
<ol>
<li><p><strong>Increased Data Diversity:</strong> Generative models can help boost dataset variety, making machine learning models more resilient to real-world fluctuations. A generative model could be used to generate synthetic images of faces with various expressions, ages, and ethnicities. This could help a machine learning model learn to detect faces more reliably in a wide range of real-world scenarios.</p>
</li>
<li><p><strong>Improved Model Generalization:</strong> Using generative models to augment data exposes machine learning models to a broader collection of data variables during training. This procedure improves the model’s ability to generalize to new, previously unknown data and its overall performance. This is particularly relevant for deep learning models, which require vast volumes of data to adequately train.</p>
</li>
<li><p><strong>Overcoming Data Scarcity:</strong> Obtaining a large and diverse labeled dataset can be a substantial issue in many machine learning applications. By developing synthetic data, generative models can assist in managing data scarcity by lowering reliance on limited real data.</p>
</li>
<li><p><strong>Reduction of Bias:</strong> By generating new data samples that address underrepresented or biased categories, generative models can be used to eliminate bias in training data, improving balance in AI applications.</p>
</li>
</ol>
<h2 id="heading-generative-models-for-data-augmentation">Generative Models for Data Augmentation</h2>
<p>Two main types of generative models can be used for data augmentation:</p>
<ul>
<li><p>Generative Adversarial Networks (GANs)</p>
</li>
<li><p>Variational AutoEncoders (VAEs)</p>
</li>
</ul>
<h3 id="heading-generative-adversarial-networks-gans">Generative Adversarial Networks (GANs)</h3>
<p>GANs are neural network designs that are used to create fresh data samples that are comparable to the training data. They are learning models that can construct new items that appear to be drawn from a certain dataset. GANs, for example, can be trained on a group of photos and then used to produce new images that look like they came from the original set.</p>
<p>Here’s a short explanation of how GANs work:</p>
<ul>
<li><p>A new data sample is generated by the generator. The discriminator is provided with both new and real data samples.</p>
</li>
<li><p>The discriminator attempts to determine which samples are real and which are fabricated.</p>
</li>
<li><p>The output of the discriminator is used to update both the generator and the discriminator.</p>
</li>
</ul>
<p>The generator creates a synthetic image by taking noisy data as input. The discriminator tries to correctly categorize both the generator’s fake image and an actual image from the training set.</p>
<p>The generator tries to improve its variables to produce a more convincing false image that can mislead the discriminator. The discriminator seeks to improve by adjusting its variables to distinguish between actual and fraudulent images. The two networks continue to compete and improve until the generator produces data that is similar to real data.</p>
<p>It is suitable for data augmentation due to its capacity to generate synthetic data indistinguishable from genuine data samples. This is significant because machine learning algorithms learn from data, and the more data used to train a model, the better it will perform. On the other hand, collecting enough real-world data to train a machine-learning model may be costly and time-consuming.</p>
<p>GANs can help to reduce the cost and time required to collect data by producing synthetic data that is similar to real-world data. This is especially beneficial for applications when collecting real-world data is difficult or expensive, such as medical imaging or video surveillance data.</p>
<p>GANs can also be used because of their variety. This is because GANs can be used to produce data samples that did not exist in the original dataset. This can help improve the robustness of machine learning models for real-world variations.</p>
<h3 id="heading-variational-autoencoders-vaes">Variational AutoEncoders (VAEs)</h3>
<p>VAEs are a type of generative model and a variation of autoencoders used in machine learning and deep learning. They are a form of generative model that may generate fresh data samples that are comparable to the data on which they were trained.</p>
<p>VAEs are a sort of Bayesian model, which implies that they employ probability distributions to represent the uncertainty in the data. This allows VAEs to create data samples that are more realistic than other types of generative models.</p>
<p>VAEs work by learning about data representation in latent space. The latent space is a compressed representation of data that captures the data’s most relevant qualities. By sampling from the latent space and decoding the samples back into the original data space, VAEs can then be utilized to produce new data samples.</p>
<p>Here’s a simple illustration of how a VAE works:</p>
<ul>
<li><p>As input, the encoder receives a data sample, such as an image of an animal.</p>
</li>
<li><p>The encoder generates a latent space representation of the data, which is a compressed version of the image that captures the cat’s most relevant characteristics, such as shape, size, and fur color.</p>
</li>
<li><p>The latent space representation is fed into the decoder.</p>
</li>
<li><p>The decoder generates a reconstructed data sample, which is a new image of an animal that resembles the original image.</p>
</li>
</ul>
<p>The encoder and decoder are taught to reduce the difference between the reconstructed and original images. This is accomplished by employing a loss function that compares the similarity of the two photos.</p>
<p>VAEs are a strong generative modeling tool that can be used for image production, text generation, data compression, and data denoising. They provide a probabilistic framework for modeling and producing complex data distributions while preserving a structured latent space for data production and interpolation.</p>
<p>The ability to generate data that is similar to real-world data also qualifies it for data augmentation. This means that the augmented data produced by VAEs is highly realistic and aligned with the underlying data distribution, which is required for effective data augmentation.</p>
<p>Each point in the structured latent space of VAEs represents a meaningful data variation. This enables controlled data creation. Users can build new data instances with specific attributes or variants by sampling different places in the latent space, making it suited for targeted data augmentation.</p>
<p>VAEs can address data scarcity issues by generating synthetic data when real data is limited. This is particularly valuable in scenarios where collecting more real data is impractical or expensive.</p>
<p>As VAEs continue to improve, they will likely play an increasingly important role in training machine learning models.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Generative models have played a significant part in the practice of data augmentation in the machine-learning field.</p>
<p>For instance, GANs have been used to generate synthetic images of faces, which have been used to train machine learning models to detect faces in real-world images.</p>
<p>VAEs were also utilized to create synthetic images of automobiles that were then used to train machine-learning models to recognize autos in real-world photographs.</p>
<p>These are all real-life applications of generative models in data Augmentation.</p>
<p>I hope this article was helpful.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
