<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Rakshath Naik - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Rakshath Naik - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 11 May 2026 16:01:28 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/rakshath1/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ Data Science Insights: Why the Mean Lies When Handling Messy Retail Data ]]>
                </title>
                <description>
                    <![CDATA[ In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on. Let's take the case of a retail shop. If we're looking at the average order value to u ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-science-insights-why-the-mean-lies-when-handling-messy-retail-data/</link>
                <guid isPermaLink="false">69fa21e5a386d7f121b5fe8c</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ statistics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rakshath Naik ]]>
                </dc:creator>
                <pubDate>Tue, 05 May 2026 16:59:17 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/4441dcfc-d100-4613-9937-9c62449c6780.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In our daily life, we use the word "average" all the time: average salary, average marks, average age, and so on.</p>
<p>Let's take the case of a retail shop. If we're looking at the average order value to understand customer spending, we'd load the data, run the code, and get a result of $20 per order.</p>
<p>Done.</p>
<p>Except something looks odd.</p>
<p>When we take a closer look, we see that most customers are buying items worth \(8 - \)15. So where's $20 coming from?</p>
<p>In that case, the problem isn’t data – it’s the average. This is a clean textbook trap where everything works perfectly in the textbook, but real-world data doesn’t behave nicely.</p>
<p>Some customers buy in bulk (very large orders), some return orders (negative quantities), and a few anomalies distort the entire picture.</p>
<p>In this article, we'll use the Online Retail Dataset to answer a simple but tricky question: What does “average” really mean in the real world?</p>
<h2 id="heading-table-of-contents">Table Of Contents</h2>
<ul>
<li><p><a href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-the-dataset">The Dataset</a></p>
</li>
<li><p><a href="#heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</a></p>
</li>
<li><p><a href="#heading-median-the-robust-middle">Median: The Robust Middle</a></p>
</li>
<li><p><a href="#heading-beyond-averages-understanding-spread-with-quartiles">Beyond Averages: Understanding Spread with Quartiles</a></p>
</li>
<li><p><a href="#heading-applying-iqr-to-our-dataset">Applying IQR to Our Dataset</a></p>
</li>
<li><p><a href="#heading-final-comparison-and-insights">Final Comparison and Insights</a></p>
</li>
<li><p><a href="#heading-conclusion">Conclusion</a></p>
</li>
<li><p><a href="#heading-connect-with-me">Connect with me</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along here, you'll need:</p>
<p><strong>Basic Python knowledge:</strong> Understanding of variables and functions.</p>
<p><strong>The Pandas library:</strong> Familiarity with loading data and basic DataFrame operations.</p>
<p><strong>A development environment:</strong> Access to a tool like Jupyter Notebook, VS Code, or Google Colab.</p>
<p><strong>A Dataset:</strong> For this analysis, I used the Online Retail Dataset, which is available for download <a href="https://archive.ics.uci.edu/dataset/352/online+retail">here</a>.</p>
<h2 id="heading-the-dataset"><strong>The Dataset</strong></h2>
<p>We'll work with the Online Retail Dataset, a real-world transactional dataset containing purchase records from a UK-based online retail store.</p>
<ol>
<li><p><strong>Source:</strong> UCI Machine Learning Repository</p>
</li>
<li><p><strong>Collected by:</strong> UK-based online retail company (2010–2011)</p>
</li>
<li><p><strong>Size:</strong> 541,909 transactions</p>
</li>
<li><p><strong>Features:</strong> 8 attributes (InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country)</p>
</li>
<li><p><strong>Ownership:</strong> Public dataset hosted by UCI</p>
</li>
<li><p><strong>License:</strong> Open for research and educational use</p>
</li>
</ol>
<h2 id="heading-mean-the-sensitive-giant">Mean: The Sensitive Giant</h2>
<p>In statistics and data analysis, the terms "<strong>average</strong>" and "<strong>arithmetic mean</strong>" are often used interchangeably. We aim to find the mean total price in our dataset. Mean in the context of the Online Retail Dataset is given as:</p>
<p>$$\text{Average Order Value} = \frac{\text{Sum of all TotalPrice values}}{\text{Number of transactions}}$$</p>
<p>In our dataset, the mean is calculated by summing all transaction values (including bulk purchases and returns) and dividing by the total number of transactions. This means every value, irrespective of unusually high or any negative values, directly influences the final average.</p>
<pre><code class="language-python"># Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
df = pd.read_excel(url, engine='openpyxl')

# Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate the Mean (Average Order Value)
mean_value = df['TotalPrice'].mean()
print(f"Average Order Value (Mean): {mean_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Average Order Value (Mean): 20.40
</code></pre>
<p>At first glance, the results may look promising: every transaction contributes equally. But that’s where the problem lies. Sometimes a few transactions, which are extremely high or low, affect the mean for all customers who lie in the closer range.</p>
<p>Take a look at the graph for the mean below.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/583bebff-0e5e-44b8-80cb-48e4662b9abf.png" alt="The graph shows the calculated mean for the Online Retail Dataset, where we get a mean of 20.40" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The graph shows the mean Total Price for the Online Retail Dataset. We get a mean of 20.42. (Image by Author)</p>
<p>The graph shows <strong>a right-skewed distribution</strong> where the calculated mean of 20.40 is actually a textbook trap. The tallest bar clearly shows that the majority of transactions lie in the range of \(8 - \)15 range, but the <strong>red line</strong> is being dragged to the right by the <strong>long tail</strong> of high-value bulk orders by some customers.</p>
<p>In this scenario, the average price is well above what a typical customer actually spends because it's highly sensitive to outliers – and in reality, the bulk of the data lives in the lower price range.</p>
<p>In simple words, the mean is being pulled by some extreme values to the right, especially by some lying in the range of 200–300, which is noticeable in the graph.</p>
<h2 id="heading-median-the-robust-middle">Median: The Robust Middle</h2>
<p>When the mean is distorted by extreme values, we need a metric that remains unaffected by such outliers. This is where the median comes into play.</p>
<p>Median is defined as the <strong>middle value after sorting the data.</strong></p>
<p>In our dataset, we sort all the transactions and pick the middle one.</p>
<p>The formula for calculating the median is:</p>
<p>$$\text{Median} = \begin{cases} X_{\left[ \frac{n+1}{2} \right]} &amp; \text{if } n \text{ is odd} \ \frac{X_{\left[ \frac{n}{2} \right]} + X_{\left[ \frac{n}{2} + 1 \right]}}{2} &amp; \text{if } n \text{ is even} \end{cases}$$</p>
<p>Unlike the mean, the median doesn't depend on extreme values, and it cares only about the position of the data, not the magnitude.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Calculate only the Median
median_value = df['TotalPrice'].median()
print(f"Typical Order Value (Median): {median_value:.2f}")
</code></pre>
<p>The results are as follows:</p>
<pre><code class="language-python">Typical Order Value (Median): 11.10
</code></pre>
<p>Now you'll notice that the result lies in the \(8 — \)15 range, where most of the transactions lie.</p>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/d89a4912-0e44-485e-8ea0-ff559cea6eba.png" alt="The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers." style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The figure demonstrates the graph for the median, where we get an accurate value of the transactions by the customers. (Image by Author)</p>
<p>In the previous graph, the mean was pulled to the right by large orders, but the median just asks what the middle customer spends. So even if someone spends $300 or some transactions are negative, the median stays stable.</p>
<p>In the above figure <strong>the median graph</strong> accurately highlights the range where most of the customers lie.</p>
<h2 id="heading-beyond-averages-understanding-spread-with-quartiles"><strong>Beyond Averages: Understanding Spread with Quartiles</strong></h2>
<p>So far, we've studied the median, but knowing the center is not enough.</p>
<p>To truly understand how customer spending is, we need to understand how the data is spread, and this is where quartiles come into play.</p>
<p>Quartiles divide the dataset into the following parts:</p>
<ol>
<li><p><strong>Q1(25th percentile):</strong> 25% of transactions are below this.</p>
</li>
<li><p><strong>Q2 (50th percentile):</strong> Median</p>
</li>
<li><p><strong>Q3 (75th percentile):</strong> 75% of transactions are below this.</p>
</li>
</ol>
<p>This is formally expressed as the Interquartile Range (IQR):</p>
<p>$$IQR = Q_3 - Q_1$$</p>
<h3 id="heading-the-iqr-detecting-outliers"><strong>The IQR: Detecting Outliers</strong></h3>
<p>The IQR measures the spread of the middle 50%.</p>
<p>If the IQR is small, then the data is concentrated. If it's large, the data is spread out. The IQR also helps us identify outliers mathematically.</p>
<p>Outlier Rule:</p>
<ol>
<li><p><strong>Lower Bound = Q1 — 1.5 * IQR</strong></p>
</li>
<li><p><strong>Upper Bound = Q3 + 1.5 * IQR</strong></p>
</li>
</ol>
<h4 id="heading-a-simple-example-to-understand-iqr">A Simple Example to Understand IQR</h4>
<p>Consider the following transaction values:</p>
<p>$$\left[ 5, 8, 10, 12, 15, 18, 20 \right]$$</p>
<h4 id="heading-step-1-find-the-median-q2">Step 1: Find the Median (Q2):</h4>
<p>The middle value is:</p>
<p>$$Q_2 = 12$$</p>
<h4 id="heading-step-2-find-q1-lower-quartile">Step 2: Find Q1 (Lower Quartile):</h4>
<p>The lower half is [5, 8, 10]. The median of the lower half is:</p>
<p>$$Q_1 = 8$$</p>
<h4 id="heading-step-3-find-q3-upper-quartile">Step 3: Find Q3 (Upper Quartile):</h4>
<p>The upper half is [15, 18, 20]. The median of the upper half is:</p>
<p>$$Q_3 = 18$$</p>
<h4 id="heading-step-4-calculate-iqr">Step 4: Calculate IQR:</h4>
<p>$$IQR = Q_3 - Q_1 = 18 - 8 = 10$$</p>
<h4 id="heading-step-5-find-outlier-bounds">Step 5: Find Outlier Bounds:</h4>
<p>$$\begin{aligned} \text{Lower Bound} &amp;= Q_1 - 1.5 \times IQR = 8 - 15 = -7 \ \text{Upper Bound} &amp;= Q_3 + 1.5 \times IQR = 18 + 15 = 33 \end{aligned}$$</p>
<p>Any value <strong>below -7 or above 33</strong> is an outlier (but in this demo problem, no outliers exist).</p>
<h2 id="heading-applying-iqr-to-our-dataset"><strong>Applying IQR to Our Dataset</strong></h2>
<p>In our retail dataset, instead of neat values, we have bulk values and even negative returns.</p>
<pre><code class="language-python"># 1. Calculate IQR and Bounds
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
</code></pre>
<p>When we calculate IQR for our dataset, we get:</p>
<pre><code class="language-python">Lower Bound: -18.75
Upper Bound: 42.45
Number of Outliers: 33180
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/e528db9b-57f9-4ee4-b331-143c2b1947fb.png" alt="The figure demonstrates the outlier range for our dataset" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>The graph demonstrates outliers, which are any values falling outside the range of -18.75 to 42.45. (Image by Author)</p>
<p>As the graph shows, the values outside the range -18.75 to 42.45 are considered outliers. These values will be removed.</p>
<h3 id="heading-revisiting-the-mean-after-removing-outliers">Revisiting the Mean After Removing Outliers</h3>
<p>Using the IQR method, we've removed extreme transactions that fell outside the typical spending range.</p>
<pre><code class="language-python"># Clean and Feature Engineering
df = df.dropna(subset=['CustomerID'])
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Original Mean
mean_value = df['TotalPrice'].mean()
print(f"Original Mean: {mean_value:.2f}")

# IQR Calculation
Q1 = df['TotalPrice'].quantile(0.25)
Q3 = df['TotalPrice'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# Remove Outliers
df_no_outliers = df[(df['TotalPrice'] &gt;= lower_bound) &amp; (df['TotalPrice'] &lt;= upper_bound)]

# New Mean after removing outliers
new_mean = df_no_outliers['TotalPrice'].mean()
print(f"Mean after removing outliers: {new_mean:.2f}")
</code></pre>
<p>After recomputing, we get:</p>
<pre><code class="language-python">Original Mean: 20.40
Lower Bound: -18.75
Upper Bound: 42.45
Mean after removing outliers: 11.63
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/17e6c2d0-883f-4e48-b45b-d1bf93164c63.png" alt="The graph demonstrates that the mean improves significantly after all outliers are removed. (Image by Author)" style="display:block;margin:0 auto" width="600" height="400" loading="lazy">

<p>Removing outliers significantly shifts the mean toward the region where most transactions occur. We now have a much better mean of 11.63 as opposed to the right-stretched mean of 20.40 we got with outliers.</p>
<h2 id="heading-final-comparison-and-insights"><strong>Final Comparison and Insights</strong></h2>
<p>Looking at the results from all the graphs, we get a complete understanding of the dataset. The original mean was 20.40, which appeared to be significantly higher than the most transactions that actually occurred. In that case, the mean was pulled upward by some of the high-valued transactions and was distorted by the outliers.</p>
<p>The median, on the other hand, was 11.10, which lies within the range where most transactions are concentrated. This shows that the median is a much better representation of what a typical customer spends, as it's not affected by extreme values.</p>
<p>After removing the outliers using the IQR, the mean dropped to 11.63, bringing it very close to the median. This confirms that the earlier mean was not inherently wrong, but was simply influenced by extreme values in the data. Once those values were handled, the mean became a much more reliable measure of central tendency.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>The results show that the mean can be misleading when data contains outliers. In our dataset, the original mean of 20.40 overstated customer spending, while the median (11.10) gave a more realistic picture. After removing outliers, the mean shifted to 11.63, aligning closely with the median.</p>
<p>This highlights a key lesson: <strong>The mean isn't wrong, but it must be used with an understanding of the data.</strong></p>
<p>Choosing the right measure of average depends on the dataset, and in messy real-world scenarios, the median or a cleaned mean often tells the true story.</p>
<h2 id="heading-connect-with-me"><strong>Connect with me</strong></h2>
<ol>
<li><p><a href="https://medium.com/@rakshathnaik62">Medium</a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/rakshath-/">LinkedIN</a></p>
</li>
</ol>
<p>If you want to dive deeper, you can visit: <a href="https://qubrica.com/mean-median-mode-python-guide/"><strong>Mean vs Median vs Mode: Understanding Central Tendency in Data Analysis</strong></a><strong>.</strong></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway ]]>
                </title>
                <description>
                    <![CDATA[ In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distingu ]]>
                </description>
                <link>https://www.freecodecamp.org/news/deploying-serverless-spam-classifier/</link>
                <guid isPermaLink="false">69f2e347b18c978233780179</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ serverless ]]>
                    </category>
                
                    <category>
                        <![CDATA[ AWS ]]>
                    </category>
                
                    <category>
                        <![CDATA[ MathJax ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Architecture ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Rakshath Naik ]]>
                </dc:creator>
                <pubDate>Thu, 30 Apr 2026 05:06:15 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/08672d22-a4df-4b99-8ef7-fffd18f5dc07.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In today's digital world, spam is no longer just an annoyance - it's a growing security threat. To combat this, developers often turn to machine learning to build intelligent filters that can distinguish legitimate emails from malicious ones.</p>
<p>While building a machine learning model in a notebook is relatively straightforward, the real challenge lies in the last mile: deploying that model into a scalable, production-ready system that users can actually interact with.</p>
<p>In this project, I built an end-to-end serverless spam classifier, combining Scikit-learn for model development with AWS Lambda, Amazon S3, and Amazon API Gateway for deployment. The result is a lightweight, scalable API that can classify messages in real time.</p>
<p>The system is designed to be modular and cost-efficient, allowing the model to be retrained and updated independently without affecting the live API. From detecting "free iPhone" scams to identifying phishing attempts, this project demonstrates how to bridge the gap between machine learning experimentation and real-world deployment.</p>
<h3 id="heading-table-of-contents">Table of&nbsp;Contents</h3>
<ul>
<li><p><a href="#heading-1-prerequisites">Prerequisites</a></p>
</li>
<li><p><a href="#heading-2-building-the-brain-the-model">Building the Brain: The Model</a></p>
</li>
<li><p><a href="#heading-3-deploying-the-model-to-aws">Deploying the Model to AWS</a></p>
</li>
<li><p><a href="#heading-4-how-to-run-the-project-locally">How to Run The Project Locally</a></p>
</li>
<li><p><a href="#heading-5-our-project-architecture">Our Project Architecture</a></p>
</li>
<li><p><a href="#heading-6-conclusion-the-power-of-serverless-ai">Conclusion: The Power of Serverless AI</a></p>
</li>
<li><p><a href="#heading-7-acknowledgment-references">Acknowledgment / References</a></p>
</li>
</ul>
<h2 id="heading-1-prerequisites">1. Prerequisites</h2>
<ol>
<li><p><strong>Fundamental skills:</strong> Basic proficiency in Python and understanding of Machine Learning concepts like classification.</p>
</li>
<li><p><strong>AWS account:</strong> Access to an AWS account with permissions for Lambda, S3, and API Gateway.</p>
</li>
<li><p><strong>Environment:</strong> Python 3.11 installed, along with libraries like scikit-learn, pandas, and joblib.</p>
</li>
<li><p><strong>AWS CLI:</strong> Configured on your local machine for file uploads.</p>
</li>
<li><p><strong>HuggingFace account:</strong> You can directly download the model from my account.</p>
</li>
</ol>
<h2 id="heading-2-building-the-brain-the-model">2. Building the Brain: The&nbsp;Model</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/b43af198-1472-4914-9469-6cd5ca5384e2.png" alt="Demonstrational image to show the brain of AI." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p><em>Photo by</em> <a href="https://unsplash.com/@steve_j?utm_source=medium&amp;utm_medium=referral"><em>Steve A Johnson</em></a> <em>on</em> <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral"><em>Unsplash</em></a></p>
<p>At the heart of this project lies a supervised learning approach. Instead of simply specifying which words are considered spam, we'll provide the computer with a dataset and an algorithm, enabling it to learn and identify spam patterns on its own.</p>
<h3 id="heading-1-vectorization-turning-text-into-math">1. Vectorization: Turning Text into&nbsp;Math</h3>
<p>Machine Learning models can't <strong>read</strong> text. They require numerical input. To solve this, we used the <a href="https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/">TF-IDF</a> (Term Frequency-Inverse Document Frequency) Vectorizer.</p>
<pre><code class="language-python">feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)
X_train_features = feature_extraction.fit_transform(X_train
</code></pre>
<p>Here's the mathematical formula:</p>
<p>$$w_{i,j} = tf_{i,j} \times \log \left( \frac{N}{df_i} \right)$$</p>
<p>TF-IDF term definitions:</p>
<ul>
<li><p><strong>wᵢ,ⱼ (Weight):</strong> The final importance score of a specific word in a document.</p>
</li>
<li><p><strong>tfᵢ,ⱼ (Term Frequency):</strong> How often a word appears in a single email.</p>
</li>
<li><p><strong>N (Total Documents):</strong> The total count of all emails in your dataset.</p>
</li>
<li><p><strong>dfᵢ (Document Frequency):</strong> The number of different emails that contain this specific word.</p>
</li>
<li><p><strong>log(N/dfᵢ) (IDF):</strong> A penalty that lowers the score of common words like <strong>the</strong> or <strong>is</strong> that appear everywhere.</p>
</li>
</ul>
<p>It cleans the data by removing common words, converts all text to lowercase for consistency, and assigns more importance to rare and meaningful words while giving less importance to frequently used words.</p>
<h3 id="heading-2-training-the-logistic-regression-engine">2. Training: The Logistic Regression Engine</h3>
<p>We'll use <strong>Logistic Regression</strong> here, a classification algorithm that predicts the probability of an outcome.</p>
<p>In this stage, we feed our vectorized training data into the Logistic Regression algorithm. The goal is to establish a mathematical relationship between specific word weights and the <strong>Spam</strong> or <strong>Ham</strong> label.</p>
<p>During training, the model iteratively adjusts its internal parameters to minimize error, eventually learning that words like winner or free correlate highly with spam, while conversational language correlates with legitimate messages.</p>
<pre><code class="language-python">model = LogisticRegression()
model.fit(X_train_features, Y_train)
</code></pre>
<p>In our case, it calculates the probability that an email belongs to spam or HAM.</p>
<p>The algorithm uses the Sigmoid function to map any real-valued number into a value between 0 and 1.</p>
<p>$$P(y=1|x) = \frac{1}{1 + e^{-(z)}}$$</p>
<p>where z = β₀ + β₁x₁ +&nbsp;… + βₙxₙ.</p>
<h3 id="heading-3-evaluation-testing-the-intelligence">3. Evaluation: Testing the Intelligence</h3>
<p>After training, we need to verify if the brain actually works on data it hasn't seen before.</p>
<pre><code class="language-python">prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
</code></pre>
<p>By comparing the model’s predictions against the actual labels in our test set, we calculate an Accuracy Score. This gives us the confidence that the model is ready for the real world (achieving ~94% accuracy in our tests).</p>
<h3 id="heading-4-exporting-the-logic-serialization">4. Exporting the Logic (Serialization)</h3>
<p>To move this brain from our local Python environment to the AWS Cloud, we'll use Joblib to save our work into binary files (.pkl).</p>
<pre><code class="language-python">joblib.dump(model, 'spam_model.pkl')
joblib.dump(feature_extraction, 'vectorizer.pkl')
</code></pre>
<p>We use the Pickle format because it allows us to freeze complex Python objects (mathematical weights and word mappings) into a portable binary format that can be instantly re-animated in the cloud.</p>
<p>We need the Vectorizer to translate new user text into the exact numerical coordinates the Model was trained to understand. Using one without the other is like having a key but no lock.</p>
<p>The trained Logistic Regression model and TF-IDF vectorizer are openly available for the community on Hugging Face here: <a href="https://huggingface.co/rakshath1/mail-spam-detector">Get the model on HuggingFace</a>.</p>
<h2 id="heading-3-deploying-the-model-to-aws">3. Deploying the Model to&nbsp;AWS</h2>
<p>Training a model is science, while deploying it is engineering. To make this classifier accessible to the world, we'll use a serverless stack that scales automatically and incurs nearly no maintenance costs.</p>
<h3 id="heading-1-model-storage-amazon-s3">1. Model Storage: Amazon&nbsp;S3</h3>
<p>First, we'll uploade our&nbsp;.pkl files to an S3 bucket. By decoupling the model from the code, we can update the AI's intelligence (simply by overwriting the file in S3) without redeploying the backend code. It makes the system highly maintainable.</p>
<h3 id="heading-2-the-production-backend-aws-lambda">2. The Production Backend: AWS&nbsp;Lambda</h3>
<p>To make the AI accessible, we'll move from a local script to a Serverless Cloud Architecture. This ensures the model is always available without the cost of a 24/7 server.</p>
<p>The deployment environment is AWS Lambda (Python 3.11). Since Lambda is a lightweight environment, it doesn't include Scikit-Learn or Joblib. To provide these, we'll download and store them in our S3 bucket and import them through the layers.</p>
<p><strong>Commands in AWS CLI:</strong></p>
<pre><code class="language-python">
# 1. Create a workspace
mkdir ml_layer &amp;&amp; cd ml_layer

# 2. Install scikit-learn and its dependencies into a folder
pip install \
    --platform manylinux2014_x86_64 \
    --target=python/lib/python3.11/site-packages \
    --implementation cp \
    --python-version 3.11 \
    --only-binary=:all: \
    scikit-learn joblib

# 3. Zip the folder
zip -r sklearn_lib.zip python

# 4. Upload to S3 (Using AWS CLI)
aws s3 cp sklearn_lib.zip s3://YOUR-BUCKET-NAME/
</code></pre>
<p>We store the Scikit-Learn library as a ZIP in S3 to bypass the AWS Lambda deployment package size limit. This allows the function to dynamically load heavy dependencies only when needed without bloating the core code.</p>
<p><strong>The Lambda Function:</strong></p>
<pre><code class="language-python">
import json
import boto3
import os
import sys
from io import BytesIO

# Ensures the custom Lambda layer(containing sklearn/joblib)
sys.path.append('/opt/python')

try:
    import joblib
except ImportError:
    # Fallback for specific Scikit-Learn distributions
    from sklearn.utils import _joblib as joblib

# Initialize S3 client
s3 = boto3.client('s3')

# Use placeholders for the article so readers can insert their own values
BUCKET_NAME = 'YOUR_S3_BUCKET_NAME' 
MODEL_KEY = 'spam_model.pkl'
VECTORIZER_KEY = 'vectorizer.pkl'

# Global variables for 'Warm Start' caching (improves performance by keeping model in RAM)
model = None
vectorizer = None

def load_model():
    """Downloads model files from S3 only if they aren't already in RAM"""
    global model, vectorizer
    if model is None or vectorizer is None:
        try:
            # 1. Load the Logistic Regression Model from S3
            m_obj = s3.get_object(Bucket=BUCKET_NAME, Key=MODEL_KEY)
            model = joblib.load(BytesIO(m_obj['Body'].read()))
            
            # 2. Load the TF-IDF Vectorizer directly from S3
            v_obj = s3.get_object(Bucket=BUCKET_NAME, Key=VECTORIZER_KEY)
            vectorizer = joblib.load(BytesIO(v_obj['Body'].read()))
        except Exception as e:
            raise Exception(f"Failed to load .pkl files from S3: {str(e)}")

def lambda_handler(event, context):
    try:
        # Ensure model and vectorizer are ready before processing
        load_model()
        
        # Handles both direct Lambda tests and API Gateway POST requests
        body = event.get('body', event)
        if isinstance(body, str):
            body = json.loads(body)
            
        text = body.get('text', '')
            
        if not text:
            return {
                'statusCode': 400,
                'body': json.dumps({'error': 'No text provided.'})
              }

        # 1. Transform input text to numeric features using the trained Vectorizer
        data_vec = vectorizer.transform([text])
        
        # 2. Predict using the Logistic Regression Model 
        prediction = int(model.predict(data_vec)[0])
        
      # 3. Map numeric result to human-readable label
        result_label = "HAM" if prediction == 1 else "SPAM"
        
        # RESPONSE WITH CORS
        return {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*' # needed for cross-domain web integration
            },
            'body': json.dumps({
                'status': 'success',
                'classification': result_label,
                'input_text': text
            })
        }
        
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error_message': f"Inference Error: {str(e)}"})
        }
</code></pre>
<p>Key features of the Lambda function:</p>
<ol>
<li><p><strong>Warm start caching:</strong> By defining the model and vectorizer variables outside the lambda_handler, we store them in the container's memory. This significantly reduces cold start latency for subsequent requests.</p>
</li>
<li><p><strong>Dynamic dependency loading:</strong> The <strong>sys.path.append('/opt/python')</strong> line allows us to import heavy libraries from S3/Layers without exceeding the upload limit.</p>
</li>
<li><p><strong>Bimodal input handling:</strong> The function is designed to handle both direct JSON testing from the AWS console and stringified payloads sent via API Gateway.</p>
</li>
</ol>
<h3 id="heading-3-the-api-gateway-the-bridge-to-the-web">3. The API Gateway - The Bridge to the&nbsp;Web</h3>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/8aa3e8d7-569a-4dd5-a6ac-184922474952.png" alt="Demonstrational image to show the API Gateway." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p>Photo by <a href="https://unsplash.com/@growtika?utm_source=medium&amp;utm_medium=referral">Growtika</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></p>
<h4 id="heading-creating-the-rest-api">Creating the REST API</h4>
<p>Next we'll create a REST API with a single POST method. Why POST, you might be wondering? Well, we need to securely send a JSON payload containing the user’s text message to our model.</p>
<ol>
<li><p>First navigate to the Amazon API Gateway console and select Create API -&gt; REST API.</p>
</li>
<li><p>Give your API a name, such as EmailSpamPredictor-API, and set the Endpoint Type to Regional.</p>
</li>
<li><p>Then in the left sidebar, click Resources and enter a resource name (e.g: <strong>/ predict</strong> as entered by me)</p>
</li>
<li><p>Next click the create method and select POST and then select Lambda Function for integration type</p>
</li>
<li><p>Ensure Lambda Proxy integration is enabled (this allows the full request to pass through to your code).</p>
</li>
</ol>
<p><strong>The CORS Configuration (The Troubleshooting Hub)</strong><br>This is where many developers encounter the dreaded <strong>Connection Error</strong>. Since our API is hosted on AWS, and if your front-end is on a separate website, the browser’s Same-Origin Policy will block the request by default.</p>
<p>To fix this, we'll enable <strong>CORS:</strong></p>
<ol>
<li><p><strong>Access-Control-Allow-Origin:</strong> Set to * (or specifically to your domain) to tell the browser that the API is allowed to talk to your front-end.</p>
</li>
<li><p><strong>The OPTIONS method:</strong> API Gateway creates an OPTIONS method automatically. This handles the Preflight request where the browser asks, “Are you allowed to receive data from me?” before sending the actual text.</p>
</li>
<li><p><strong>Access-Control-Allow-Headers:</strong> In the screenshot, you'll notice headers like Content-Type and Authorization are allowed. This ensures that when our JavaScript fetch() call sets the content type to application/json, the API Gateway doesn't reject it.</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/cf5c87c6-f374-4dda-8001-77a0aab52672.png" alt="Image illustrates the CORS configuration for our project. " style="display:block;margin:0 auto" width="1487" height="617" loading="lazy">

<p>Image illustrates the CORS configuration for our project. (Image by author)</p>
<h4 id="heading-deployment-stages">Deployment Stages</h4>
<p>Once the API is deployed to a production stage, AWS generates a permanent Invoke URL. This acts as the public gateway to our model and typically follows this structure: <a href="https://%5Bapi-id%5D.execute-api.%5Bregion%5D.amazonaws.com/prod/classify">https://[api-id].execute-api.[region].amazonaws.com/prod/classify</a>.</p>
<h4 id="heading-connecting-the-frontend-the-javascript-layer">Connecting the Frontend (The JavaScript Layer)</h4>
<p>With the API live, we can now write a simple JavaScript function to talk to our model. This script runs whenever a user clicks the <strong>Analyze</strong> button on your site.</p>
<pre><code class="language-python">
async function checkSpam() {
    const message = document.getElementById("userInput").value;
    const apiUrl = "YOUR_API_GATEWAY_INVOKE_URL";

    try {
        const response = await fetch(apiUrl, {
            method: "POST",
            headers: {
                "Content-Type": "application/json"
            },
            body: JSON.stringify({ "text": message })
        });

        const data = await response.json();
        
        // Display result on the webpage
        const resultElement = document.getElementById("result");
        resultElement.innerText = `Prediction: ${data.classification}`;
        resultElement.style.color = data.classification === "SPAM" ? "red" : "green";

    } catch (error) {
        console.error("Error:", error);
        alert("Could not connect to the Spam Detector API.");
    }
}
</code></pre>
<h2 id="heading-4-how-to-run-the-project-locally">4. How to Run The Project&nbsp;Locally</h2>
<p>You can store the front-end as an HTML file. Once it's ready, you shouldn’t just double-click the&nbsp;.html file. Opening it as a <strong>file</strong> in your browser can cause security restrictions. Instead, you should host it using a simple local server.</p>
<p><strong>Step 1:</strong> Open the terminal or Command Prompt.</p>
<p><strong>Step 2:</strong> Navigate to your project folder</p>
<pre><code class="language-shell">cd [PATH_TO_YOUR_FOLDER]
</code></pre>
<p><strong>Step 3:</strong> Start a local Python web server.</p>
<pre><code class="language-shell">python -m http.server 8000
</code></pre>
<p><strong>Step 4:</strong> Access the application.</p>
<p>Open your browser and navigate to:<br><a href="http://localhost:8000/your-file-name.html">http://localhost:8000/your-file-name.html</a></p>
<p><strong>Watch the Demo:</strong></p>
<div class="embed-wrapper"><iframe width="560" height="315" src="https://www.youtube.com/embed/q2X_azntmzY" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>

<h2 id="heading-5-our-project-architecture">5. Our Project Architecture</h2>
<img src="https://cdn.hashnode.com/uploads/covers/6942c2903c5d674e359eaf1e/c17673d4-5dd0-43dc-8e8d-3015bcd31864.png" alt="Image showing the Architecture Diagram of our Project." style="display:block;margin:0 auto" width="1000" height="563" loading="lazy">

<p>The image illustrates the architecture of our project (Building a Serverless Spam Classifier). It shows the process that takes place from the client input to the final model output. (Image by Author)</p>
<ol>
<li><p><strong>Client Front-End Interaction:</strong> The process starts on the far left. A user interacts with the web interface (for example, a website or a desktop app). They input text like <strong>WIN free iPhone now</strong> and trigger a request.</p>
</li>
<li><p><strong>The Entry Point: API Gateway:</strong> The request hits the Amazon API Gateway, which acts as the <strong>security guard</strong> and translator.&nbsp;<br><strong>(a)</strong> CORS OPTIONS handles the pre-flight handshake to ensure the browser has permission to talk to the AWS cloud.&nbsp;<br><strong>(b)</strong> Classification Request (POST) routes the actual message data to your backend logic.</p>
</li>
<li><p><strong>The Engine: AWS Lambda (Python 3.11):</strong>&nbsp;The central “<strong>lightbulb</strong>” represents your Lambda function. This is where the code you wrote lives. It doesn’t run 24/7 – it only wakes up when a request arrives.</p>
</li>
<li><p><strong>Storage &amp; Retrieval: S3 Bucket:</strong> Since Lambda is lightweight, it doesn’t store your heavy Machine Learning files internally.<br><strong>Dependency and Model Download:</strong> The function reaches out to the S3 Bucket to pull in the sklearn_<a href="http://lib.zip">lib.zip</a> (the engine) and the&nbsp;.pkl files (the intelligence).&nbsp;<br><strong>Required Dependency and Model:</strong> These assets are loaded into the Lambda’s temporary memory to prepare for the prediction.</p>
</li>
<li><p><strong>The Inference Pipeline:</strong>&nbsp;Inside the Lambda, a three-step mathematical cycle occurs:<br><strong>(a) Text Vectorizer:</strong> Translates the words into numbers.<br><strong>(b) Logistic Regression:</strong> Calculates the probability of spam based on those numbers.<br><strong>(c) Label:</strong> Assigns a final result (Spam or Ham).</p>
</li>
<li><p><strong>The Result Delivery:</strong> The result is sent back through the API Gateway, including the necessary CORS Headers to ensure the browser accepts it. The front-end then updates to show the “<strong>Result: SPAM</strong>” with a visual indicator.</p>
</li>
</ol>
<h2 id="heading-6-conclusion-the-power-of-serverless-ai">6. Conclusion: The Power of Serverless AI</h2>
<p>By merging the mathematical simplicity of Logistic Regression with the industrial strength of AWS Serverless Architecture, we have transformed a static Python script into a globally accessible, scalable API.</p>
<p>This project demonstrates that you don’t need a massive budget or a 24/7 dedicated server to deploy high-quality Machine Learning.</p>
<p>Using the S3-to-Lambda workaround allowed us to bypass common storage hurdles, ensuring that our Brain (the model) and its Muscle (Scikit-Learn) could function seamlessly within the cloud’s ephemeral environment. It bridges the gap between experimentation and real-world applications, making AI systems practical, efficient, and accessible.</p>
<h2 id="heading-7-acknowledgment-references">7. Acknowledgment / References</h2>
<ul>
<li><p>Pre-trained spam classification model: View on Hugging Face (<a href="https://huggingface.co/rakshath1/mail-spam-detector"><strong>rakshath1/mail-spam-detector · Hugging Face</strong></a><strong>)</strong></p>
</li>
<li><p>Scikit-learn <a href="https://scikit-learn.org/stable/api/index.html?utm_source=chatgpt.com">Documentation</a></p>
</li>
<li><p>AWS Lambda <a href="https://docs.aws.amazon.com/lambda/latest/api/welcome.html?utm_source=chatgpt.com">Documentation</a></p>
</li>
<li><p>Amazon S3 <a href="https://aws.amazon.com/documentation-overview/s3/">Documentation</a></p>
</li>
<li><p>Amazon API Gateway <a href="https://docs.aws.amazon.com/apigateway/">Documentation</a></p>
</li>
</ul>
<h3 id="heading-connect-with-me">Connect With Me</h3>
<ul>
<li><p><a href="https://medium.com/@rakshathnaik62">Medium</a></p>
</li>
<li><p><a href="https://www.linkedin.com/in/rakshath-/">LinkedIN</a></p>
</li>
</ul>
<p><strong>You may also like</strong></p>
<ol>
<li><p><a href="https://qubrica.com/python-polars-v-s-pandas-libraries-comparison/">How Polars overtook Pandas</a></p>
</li>
<li><p><a href="https://qubrica.com/devops-is-dead-platform-engineering-2026/"><strong>DevOps is Dead. Long Live Platform Engineering</strong></a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
