<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ Ibrahim Ogunbiyi - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ Ibrahim Ogunbiyi - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 24 May 2026 22:24:02 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/author/IbrahimOgunbiyi/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ What is Semantic Matching? How to Find Words in a Document Using NLP ]]>
                </title>
                <description>
                    <![CDATA[ Have you ever found yourself searching a document for a specific word or phrase just to discover that the term you're looking for isn't there? It can be frustrating, right? Sometimes, even though you might not see the exact term you’re looking for, t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-semantic-matching-find-words-in-a-document-using-nlp/</link>
                <guid isPermaLink="false">67802329a9edea9df0053dd7</guid>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Thu, 09 Jan 2025 19:27:37 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/Dh7gzpVpdWQ/upload/4e1e504663acda31b980e6fba0c2d661.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Have you ever found yourself searching a document for a specific word or phrase just to discover that the term you're looking for isn't there? It can be frustrating, right?</p>
<p>Sometimes, even though you might not see the exact term you’re looking for, the document might contain similar words or phrases that have the same meaning or context but don’t have the exact same form (such as differences in spelling).</p>
<p>Traditional NLP search approaches have relied on using exact forms to search for words or phrases in a particular document. But this fails at finding words based on semantic or contextual meaning.</p>
<p>To solve this, semantic matching comes into play. It’s an advanced way of searching that takes advantage of traditional search methods while also focusing more on locating or matching words or phrases based on their meaning or context (rather than solely on their exact form).</p>
<p>In this article, you will learn how to perform semantic matching using NLP. Without further ado, let’s get started.</p>
<h2 id="heading-requirements">Requirements</h2>
<p>To make sure that you can reproduce the experiment in this tutorial, you’ll need to have a few things.</p>
<p>First, you’ll need to have Python 3.x (preferably Python 3.10) installed on your PC. You’ll also need some libraries, which you can install using the Pip package manager.</p>
<p>You should also have basic knowledge of NLP such as text preprocessing and text representation techniques. You can learn more <a target="_blank" href="https://www.freecodecamp.org/news/natural-language-processing-techniques-for-beginners/">here</a>.</p>
<p>You can also <a target="_blank" href="https://github.com/ibrahim-ogunbiyi/Semantic_Matching">fork the repo</a> which contains all the code in this article so you can follow along.</p>
<p>To install everything using Pip, type the following command:</p>
<pre><code class="lang-bash">// to install with pip
pip install pypdf2 keybert sentence-transformers
</code></pre>
<h2 id="heading-problem-definition">Problem Definition</h2>
<p>Suppose you’re a data scientist who’s part of a curriculum development team and want to know if a particular concept (word or phrase), say <strong>birth control</strong>, is being taught in a curriculum that’s in a pdf document.</p>
<p>One way you could do this is to open the pdf using a pdf tool and then use the ctrl + f (find) method to check if the phrase birth control is in the pdf.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736408224052/2e6dacef-ef92-4113-a574-ec355e99e6f6.png" alt="The PDF we're working with here" class="image--center mx-auto" width="1600" height="853" loading="lazy"></p>
<p>You could also do it programmatically, as shown below: </p>
<pre><code class="lang-python"><span class="hljs-comment"># import library</span>
<span class="hljs-keyword">import</span> PyPDF2

<span class="hljs-comment"># use PDFreader from PyPDF2 to read pdf content.</span>
pdf_reader = PyPDF2.PdfReader(<span class="hljs-string">"Relationships_Education_RSE_and_Health_Education.pdf"</span>)

<span class="hljs-comment"># join all the content in the pdf pages together and lowercase the letters</span>
pdf_document = <span class="hljs-string">" "</span>.join([page.extract_text().lower() <span class="hljs-keyword">for</span> page <span class="hljs-keyword">in</span> pdf_reader.pages])

<span class="hljs-comment"># check if the string 'birth control' is in the document [Returns False]</span>
<span class="hljs-string">"birth control"</span> <span class="hljs-keyword">in</span> pdf_document
</code></pre>
<p>Below is the output of the above code:</p>
<pre><code class="lang-python"><span class="hljs-literal">False</span>
</code></pre>
<p>As shown above, you can see that both the programmatic way of searching and the pdf tool say that the phrase “birth control” doesn't exist in the pdf document.</p>
<p>Well, this might be true, but because this is a traditional way of NLP searching (that matches word for word in exact form) let’s not fully trust it. As I explained earlier, some words might be in different forms or have a different spelling, but they might mean the same thing contextually or semantically.</p>
<p>So how do we solve this issue? This is where semantic matching comes into play.</p>
<h2 id="heading-what-is-semantic-matching">What is Semantic Matching?</h2>
<p>Semantic Matching is a technique used to determine if two elements have the same meaning. An element can be a word, phrase, sentence, document, or even a corpus. It refers to matching elements based on meaning or context and not just matching based on exact form.</p>
<p>In order to perform semantic matching in NLP, there are certain things you need to know and do. Let’s go through them now:</p>
<h3 id="heading-what-is-word-embedding">What is Word Embedding?</h3>
<p>Word embedding is an advanced text representation technique used to represent words in a lower-dimensional vector representation. This vector representation captures inter-word semantic and syntactic information. This means that words that have similar meanings – even though they might be spelled differently – will have close to similar vector representations.</p>
<h4 id="heading-what-does-lower-dimensional-vector-representation-mean">What does Lower-Dimensional Vector representation mean?</h4>
<p>In NLP, traditional ways of representing text in a way machines can understand (that is, numerical vector representations) are Bag of Words, Term-Frequency and Inverse Document Frequency (TF-IDF), and One-hot encoding. But these techniques usually generate high dimensions (usually the size of the vocabulary) for a particular word representation and are sparse (meaning there will be lots of zeros).</p>
<p>So, for example, if a word is to be represented as a numerical vector and the document or corpus the word belongs to has 10,000 vocabularies, the size of the dimension of that word would be 10,000 (making it high).</p>
<p>The disadvantages of these techniques are high dimensions, sparsity, and their non-capability in capturing semantic information. So, advancements in NLP led to the development of word embedding techniques that simply create lower (also known as more dense) vector representations of words and can capture inter-word semantic information.</p>
<p>Word embedding is the holy grail in NLP and language technology, serving as the foundation for advanced language representation models such as GPT (Generative Pre-trained Transformer).</p>
<p>There is also sentence embedding that represents sentences in a lower-dimension vector representation.</p>
<h3 id="heading-how-do-we-measure-if-two-vectors-are-similar">How do we measure if two vectors are similar?</h3>
<p>This is where cosine similarity comes into play. Cosine similarity is a mathematical technique that we use to know how similar two vectors are to each other.</p>
<p>In NLP, it usually outputs a value between 0 to 1. A value close to 1 means that the two vectors are highly similar.</p>
<p>For example, to understand how cosine similarity works, let’s create a word embedding vector representation for three words: Man, Woman, and Cat. Then we’ll use cosine similarity to figure out which vectors are similar.</p>
<p>Based on our own instincts, we know that Man should be closer to Woman than Cat. So, let’s use NLP to help us validate this.</p>
<p>Thanks to advancements in NLP, there are numerous models we can use to create word embeddings, which you can find on the Hugging Face repository. In this article, we are going to use the ⁣<code>all-mpnet-base-v2</code> model from the ⁣<code>SentenceTransformer</code> library. According to ⁣<code>SentenceTransformer</code>, it provides the best quality performance in terms of sentence embedding, and you can also use it to create word embeddings.</p>
<p>The below code allows us to validate our claim using NLP. So, firstly, we initialize the <code>SentenceTransformer</code> with <code>all-mpnet-base-v2</code> and then use the encode method to get the embedding of each word. Then, finally, we’ll use the <code>cos_sim</code> class, also from <code>SentenceTransformer</code>, to determine which vectors are similar.</p>
<pre><code class="lang-python"><span class="hljs-comment"># import library</span>
<span class="hljs-keyword">from</span> sentence_transformers <span class="hljs-keyword">import</span> SentenceTransformer <span class="hljs-comment"># sentence transformer</span>
<span class="hljs-keyword">from</span> sentence_transformers.util <span class="hljs-keyword">import</span> cos_sim <span class="hljs-comment"># cosine similarity</span>

<span class="hljs-comment"># initialize sentence transformer with the 'all-mpnet-base-v2' model</span>
model = SentenceTransformer(<span class="hljs-string">"all-mpnet-base-v2"</span>)
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># get the embedding vector of the man, woman, and cat words.</span>
man_vector = model.encode(<span class="hljs-string">"man"</span>)
woman_vector = model.encode(<span class="hljs-string">"woman"</span>)
cat_vector = model.encode(<span class="hljs-string">"cat"</span>)

<span class="hljs-comment"># get the similarity between man and woman</span>
similarity = cos_sim(man_vector, woman_vector)

<span class="hljs-comment"># get the similarity between man and cat</span>
cat_similarity = cos_sim(man_vector, cat_vector)

print(<span class="hljs-string">"The Similarity between Man vector and Woman Vector:"</span>, similarity, <span class="hljs-string">"\n"</span>)

print(<span class="hljs-string">"The Similarity between Man vector and Cat Vector:"</span>, cat_similarity)
</code></pre>
<p>// Result</p>
<pre><code class="lang-plaintext">The Similarity between Man vector and Woman Vector: tensor([[0.3501]]) 

The Similarity between Man vector and Cat Vector: tensor([[0.2553]])
</code></pre>
<p>As you can see, the similarity score between man and woman (0.35) is higher than that of man and cat (0.26). This shows the beauty of word embedding and cosine similarity together.</p>
<p>Now let’s get back to our business.</p>
<h2 id="heading-how-to-perform-semantic-matching-on-a-pdf-document">How to Perform Semantic Matching on a PDF Document</h2>
<p>Now we are going to use semantic matching to look for a word or phrase in the document that matches the <strong>birth control</strong> phrase.</p>
<h3 id="heading-how-to-get-words-from-the-pdf-using-keybert">How to Get Words from the PDF using KeyBERT</h3>
<p>Word embedding generates embeddings for individual words. Our PDF document contains a <strong>large volume of textual components</strong>, including digits, special characters, symbols, stopwords, and the actual words we want to match. So, to save time on preprocessing, we are going to utilize <code>KeyBERT</code>. This is a library that allows us to get meaningful keywords (words or phrases) from a particular document in a minimal way.</p>
<p>Keep in mind that by default, <code>KeyBERT</code> extracts single keywords – but we can also tell it to extract phrases with two or more words. We’ll use it here to extract single-word and 2-word phrases. Below is the implementation of using <code>KeyBERT</code> to extract keywords from our document:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> keybert <span class="hljs-keyword">import</span> KeyBERT
<span class="hljs-comment"># initialize model</span>
keybert_model =  KeyBERT()

<span class="hljs-comment"># extract all keywords (single word and 2 word phrase) from the pdf</span>
all_keywords = keybert_model.extract_keywords(docs=pdf_document, top_n=<span class="hljs-number">-1</span>, keyphrase_ngram_range=(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>))
<span class="hljs-comment"># print length of keywords extracted                                             </span>
print(len(all_keywords))
<span class="hljs-comment"># show the first 5 keywords</span>
print(all_keywords[:<span class="hljs-number">5</span>])
</code></pre>
<p>The above code imports <code>KeyBERT</code> from the <code>keybert</code> library. It then initializes <code>KeyBERT</code>, and extracts all keywords (that is, single word and 2-word phrases) from the document. Then the next line prints the number of keywords extracted. Lastly, the code prints the first five 5 keywords out of all the keywords extracted from the PDF.</p>
<p>Below is the output of the above code:</p>
<pre><code class="lang-python"><span class="hljs-number">8669</span>
[(<span class="hljs-string">'education guidance'</span>, <span class="hljs-number">0.5954</span>),
 (<span class="hljs-string">'schools guidance'</span>, <span class="hljs-number">0.5542</span>),
 (<span class="hljs-string">'education policies'</span>, <span class="hljs-number">0.5405</span>),
 (<span class="hljs-string">'sex education'</span>, <span class="hljs-number">0.5228</span>),
 (<span class="hljs-string">'education safeguarding'</span>, <span class="hljs-number">0.5001</span>)]
</code></pre>
<p>As you can see above, KeyBERT extracted 8,669 keywords from the PDF. Also, the <code>KeyBERT</code> model usually returns the keywords extracted along with a score of each word. We don’t need the score, so we will only extract each keyword from the tuple it is enclosed in.</p>
<pre><code class="lang-python"><span class="hljs-comment"># remove score from each keyword</span>

all_keywords = [keyword[<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> keyword <span class="hljs-keyword">in</span> all_keywords]
all_keywords[:<span class="hljs-number">5</span>]
</code></pre>
<p>Below is the output of the above code:</p>
<pre><code class="lang-python">[<span class="hljs-string">'education guidance'</span>,
 <span class="hljs-string">'schools guidance'</span>,
 <span class="hljs-string">'education policies'</span>,
 <span class="hljs-string">'sex education'</span>,
 <span class="hljs-string">'education safeguarding'</span>]
</code></pre>
<h3 id="heading-embedding-of-the-birth-control-phrase-and-the-keywords-extracted-from-the-pdf">Embedding of the Birth Control Phrase and the Keywords Extracted from the PDF</h3>
<p>Now that we’ve extracted these keywords from the document, the next step is to get the embedding of our phrase and the keywords from the document.</p>
<p>The below code lets us do this:</p>
<pre><code class="lang-python"><span class="hljs-comment"># initialize sentence transformer with the 'all-mpnet-base-v2' model</span>
model = SentenceTransformer(<span class="hljs-string">"all-mpnet-base-v2"</span>)

<span class="hljs-comment"># get the embedding of the 'birth control' phrase</span>
birth_control_embedding = model.encode(<span class="hljs-string">"birth control"</span>)

<span class="hljs-comment"># get the embedding of all the keywords in the document</span>
keywords_embedding =  model.encode(all_keywords)
</code></pre>
<h3 id="heading-cosine-similarity-of-birth-control-phrase-and-keywords-in-pdf">Cosine Similarity of Birth Control Phrase and Keywords in PDF</h3>
<p>After getting the embedding of the phrase and the keywords, the next step is to get the similarity score of the phrase and the keywords. This will help us know which keyword in the document is highly similar to the phrase.</p>
<p>The below code allows us to get the cosine similarity of the phrase and the keywords’ embedding vector.</p>
<pre><code class="lang-python"><span class="hljs-comment"># calculate the cosine similarity of the birth control word and each word in the document</span>
cosine_similarity_result = cos_sim(birth_control_embedding, keywords_embedding)
<span class="hljs-comment"># print the shape (equal to the number of keywords)</span>
print(cosine_similarity_result.shape)
<span class="hljs-comment"># show the top 5 similarities</span>
print(cosine_similarity_result[:<span class="hljs-number">5</span>])
</code></pre>
<p>Below is the output of the above code:</p>
<pre><code class="lang-python">torch.Size([<span class="hljs-number">1</span>, <span class="hljs-number">2034</span>])
tensor([[<span class="hljs-number">0.2166</span>, <span class="hljs-number">0.1977</span>, <span class="hljs-number">0.0998</span>,  ..., <span class="hljs-number">0.1634</span>, <span class="hljs-number">0.1082</span>, <span class="hljs-number">0.2194</span>]])
</code></pre>
<p>Now that we have the similarity score of the phrase and the keywords, the total size of the resulting tensor will be the number of keywords, as shown above. Then we can use the <code>argmax()</code> method to get the index of the element of the tensor with the highest score. This index will help us filter out the particular keyword in the <code>all_keywords</code> list variable. The below code achieves this:</p>
<pre><code class="lang-python"><span class="hljs-comment"># return the index number of the high similarity score</span>
index = cosine_similarity_result.argmax()
print(index)
</code></pre>
<p>Below is the output of the above code. It tells us that the keyword with the highest similarity to the <strong>Birth Control phrase</strong> is at index 1490.</p>
<pre><code class="lang-python">tensor(<span class="hljs-number">1490</span>)
</code></pre>
<p>Now, let’s look at the keyword at index 1490 in the <code>all_keywords</code> variable.</p>
<pre><code class="lang-python"><span class="hljs-comment"># print the keyword at index 1490 </span>
print(all_keywords[index])
</code></pre>
<p>Below is the output of the above code:</p>
<pre><code class="lang-python">contraceptive
</code></pre>
<p>After examining it, we found that "contraceptive" was the word with the highest similarity, which makes sense because "birth control" and "contraceptive" mean the same thing. This demonstrates the elegance of semantic matching in finding similar words.</p>
<h3 id="heading-lets-also-explore-top-5-keywords-in-the-pdf-that-match-with-the-phrase-birth-control">Let’s Also Explore Top 5 Keywords in the PDF that Match with the Phrase “Birth Control”</h3>
<p>Let’s explore the 5 top keywords with the highest similarity score to “birth control” to see what the result would look like.</p>
<p>To do that, we can use the <code>topk()</code> method to get the top 5 indices. Then we can then loop through these indices to get the actual keywords:</p>
<pre><code class="lang-python"><span class="hljs-comment"># extract the top 5 indices</span>
top_5_indices = cosine_similarity_result.topk(<span class="hljs-number">5</span>)[<span class="hljs-number">1</span>].tolist()[<span class="hljs-number">0</span>]

print(top_5_indices)
</code></pre>
<p>Below is the result of the above code:</p>
<pre><code class="lang-python">[<span class="hljs-number">1490</span>, <span class="hljs-number">1972</span>, <span class="hljs-number">871</span>, <span class="hljs-number">1199</span>, <span class="hljs-number">1944</span>]
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment"># get top 5 keywords</span>
top_5_keywords = [all_keywords[index] <span class="hljs-keyword">for</span> index <span class="hljs-keyword">in</span> top_5_indices]
print(top_5_keywords)
</code></pre>
<p>Below is the output of the above code:</p>
<pre><code class="lang-python">[<span class="hljs-string">'contraceptive'</span>, <span class="hljs-string">'contraception'</span>, <span class="hljs-string">'contraceptive choices'</span>, <span class="hljs-string">'range contraceptive'</span>, <span class="hljs-string">'cover contraception'</span>]
</code></pre>
<p>There, we can see that the top five results relate to contraception and contraceptives. This demonstrates that semantic matching is an effective way to find related elements in a document.</p>
<h2 id="heading-wrapping-up">Wrapping Up</h2>
<p>In this article, you learned what semantic matching is and its advantages compared to traditional NLP search methods. You also encountered concepts such as word embeddings and cosine similarity and learned how they help us perform semantic matching. Then we implemented semantic matching by finding a phrase in a document.</p>
<p>Thank you for reading this article, and I will see you in the next one.</p>
<h3 id="heading-references">References</h3>
<ol>
<li><p><a target="_blank" href="https://sbert.net/">https://sbert.net/</a></p>
</li>
<li><p><a target="_blank" href="https://maartengr.github.io/KeyBERT/guides/quickstart.html">https://maartengr.github.io/KeyBERT/guides/quickstart.html</a></p>
</li>
<li><p><a target="_blank" href="https://huggingface.co/spaces/mteb/leaderboard">https://huggingface.co/spaces/mteb/leaderboard</a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Natural Language Processing Techniques for Topic Identification – Explained with Examples ]]>
                </title>
                <description>
                    <![CDATA[ There's a lot of textual information available these days. It ranges from articles to social media posts and research papers. So our ability to distill meaningful insights is key. This helps us make informed decisions in a wide array of contexts. For... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/topic-identification-using-natural-language-processing/</link>
                <guid isPermaLink="false">66d45f44052ad259f07e4af0</guid>
                
                    <category>
                        <![CDATA[ natural language processing ]]>
                    </category>
                
                    <category>
                        <![CDATA[ nlp ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Thu, 25 Jan 2024 16:16:15 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/pexels-wallace-chuck-3109168.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>There's a lot of textual information available these days. It ranges from articles to social media posts and research papers. So our ability to distill meaningful insights is key. This helps us make informed decisions in a wide array of contexts.</p>
<p>For example, you can analyze a large volume of textual content to extract a common theme. Companies and businesses utilize this technique to understand public opinion about their brand. This lets them make informed decisions and improve their services.</p>
<p>The ability to extract themes from a large amount of textual data is referred to as topic identification.</p>
<p>In this article, you will learn how to utilize NLP techniques for topic identification, enhancing your skillset as a data scientist. So sit back, because it's gonna be an interesting journey.</p>
<h2 id="heading-what-is-topic-identification">What is Topic Identification?</h2>
<p>Topic identification, simply put, is a sub-field under natural language processing. It involves the process of automatically discovering and organizing the main themes or topics present in a collection of textual data.</p>
<p>There are several Natural Language Processing (NLP) techniques you can use to identify themes in text, from simple ones to more algorithm based techniques. In this article we will look at the common NLP techniques used for topic identification. We'll discuss these in more detail below.</p>
<p>I recently tweeted about the essence of NLP. It really is purely statistics, because there are different manipulations you can do to ensure that numbers serve as representations for text (since computers don't understand text).</p>
<div class="embed-wrapper">
        <blockquote class="twitter-tweet">
          <a href="https://twitter.com/Ibrahim_Geek/status/1742877290227187989?s=20"></a>
        </blockquote>
        <script defer="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></div>
<p> </p>
<h2 id="heading-requirements-for-this-project">Requirements for this Project</h2>
<p>In order for you to be able to follow along and get hands-on practical experience while learning, you should have Python 3.x installed on your machine.</p>
<p>We'll also use the following libraries: Gensim, Scikit-Learn, and NLTK. You can install them using the Pip package installer with the following command:</p>
<pre><code class="lang-bash">pip install gensim nltk scikit-learn
</code></pre>
<h2 id="heading-techniques-used-in-nlp-for-topic-identification">Techniques Used in NLP for Topic Identification</h2>
<p>There are various techniques you can use for topic identification. In this article, you will learn about some common NLP techniques that work quite well, from simple and effective methods to more advanced ones.</p>
<h3 id="heading-bag-of-words">Bag of Words</h3>
<p>Bag of Words (BoW) is a common representation used in NLP for textual data. You can use it to count the frequency at which each word occurs in a document.</p>
<p>BoW, in the context of topic identification, is based on the assumption that the more frequently a word occurs in a document, the more important it is. Then you can use those more common words to infer what the document is all about.</p>
<p>Bag of words is the simplest technique used to identify topics in NLP. While Bag of Words is simple and efficient, it is highly affected by stop words, which are common words in text data (like "the," "and," "is," and so on).</p>
<p>But once you eliminate the issue of stop words from the text, allowing you to perform effective text processing (using techniques like normalization), BoW can still prove effective in identifying some main topics.</p>
<p>Let's look at how you can use BoW to identify the topic below.</p>
<h4 id="heading-how-to-implement-of-bag-of-words-in-python">How to implement of Bag of Words in Python</h4>
<p>A bit of background about the example article we'll use here: I got it from the BBC, and it's titled "US lifts ban on imports of latest Apple watch." The article discusses the lifted ban on Apple's latest watches, Ultra 2 and Series 9.</p>
<p>Now let's go over how to implement the bag of words in Python. I'll break this code block up into sections and explain each part as I go to make it a bit more easy to digest.</p>
<pre><code class="lang-python"><span class="hljs-comment">#import necessary libraries</span>
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter
<span class="hljs-keyword">from</span> nltk.tokenize <span class="hljs-keyword">import</span> word_tokenize
<span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords

article = <span class="hljs-string">"Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities.\
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row.\
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level.\
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week.\
Apple had said it strongly disagrees with the ruling.\
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."</span>
</code></pre>
<p>In the above code, we're importing the necessary libraries that we'll use to implement the BoW.</p>
<p>We'll use the Counter library to count the frequency of each word, and the word_tokenize library to tokenize the document into individual word tokens so they can be counted. Lastly, the stopwords library will remove stop words from the document.</p>
<pre><code class="lang-python">
<span class="hljs-comment"># Initialize english stopwords</span>
english_stopwords = stopwords.words(<span class="hljs-string">"english"</span>)

<span class="hljs-comment">#convert article to tokens</span>
tokens = word_tokenize(article)

<span class="hljs-comment">#extract alpha words and convert to lowercase</span>
alpha_lower_tokens = [word.lower() <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> tokens <span class="hljs-keyword">if</span> word.isalpha()]

<span class="hljs-comment">#remove stopwords</span>
alpha_no_stopwords = [word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> alpha_lower_tokens <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> english_stopwords]

<span class="hljs-comment">#Count word</span>
BoW = Counter(alpha_no_stopwords)

<span class="hljs-comment">#3 Most common words</span>
BoW.most_common(<span class="hljs-number">3</span>)
</code></pre>
<p>In the above code, we use the first line of code to extract all stop words in the English language. Then, the second line tokenizes the article string into individual words. The third line of code normalizes each word into lowercase and only extracts alphabetic words from the article. The last two lines of code are used to count the frequency of each word and select the most common three words.</p>
<p>Below is the output of the BoW model:</p>
<pre><code class="lang-javascript">[(<span class="hljs-string">'watches'</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">'us'</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">'apple'</span>, <span class="hljs-number">3</span>), (<span class="hljs-string">'emergency'</span>, <span class="hljs-number">2</span>)]
</code></pre>
<p>From this, we can infer that the article is all about "Apple's watches in the US". As you can see, with the simplicity in reasoning behind the bag of words, it is still possible to infer a bit of knowledge about the article.</p>
<h3 id="heading-latent-dirichlet-allocation">Latent Dirichlet Allocation</h3>
<p>Latent Dirichlet Allocation, or LDA for short, is a popular probabilistic model used in NLP and machine learning for topic modeling (using algorithms to identify topics). It is based on the assumption that documents are mixtures of topics, and topics are mixtures of words.</p>
<p>Simply put, LDA is an NLP technique used to identify the topic to which a document belongs based on the words contained in the document.</p>
<p>LDA operates on the bag-of-words representation of documents, where each document is represented as a vector of word frequencies. You can implement LDA using the Gensim library in Python (which is an open source library used for topic modelling and document similarity analysis).</p>
<p>Steps for implementing LDA include:</p>
<ul>
<li><p><strong>Import Libraries:</strong> First step is to import the necessary libraries you will be utilizing.</p>
</li>
<li><p><strong>Data Preparation:</strong> Convert raw data to a document format then tokenize, remove stop words, and optionally perform stemming or lemmatization.</p>
</li>
<li><p><strong>Create Dictionary and Corpus</strong>: Build a dictionary with unique word IDs. Then form a bag of words corpus representing document-word frequency.</p>
</li>
<li><p><strong>Train LDA Model</strong>: Use the document-word frequency and dictionary to train the LDA model, setting the desired number of topics.</p>
</li>
<li><p><strong>Print Topics</strong>: Explore and print the discovered topics.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Import the necessary libraries</span>
<span class="hljs-keyword">from</span> gensim.corpora.dictionary <span class="hljs-keyword">import</span> Dictionary
<span class="hljs-keyword">from</span> gensim.models <span class="hljs-keyword">import</span> LdaModel
<span class="hljs-keyword">from</span> nltk <span class="hljs-keyword">import</span> sent_tokenize, word_tokenize
<span class="hljs-keyword">from</span> nltk.corpus <span class="hljs-keyword">import</span> stopwords

article = <span class="hljs-string">"Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities. \
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row. \
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level. \
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week. \
Apple had said it strongly disagrees with the ruling. \
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."</span>
</code></pre>
<p>The above lines of code include the necessary libraries that we'll use to implement the LDA.</p>
<p>The first line of code contains the Dictionary object. Then, the second line imports the LDA model, and the third line of code contains the <code>sent_tokenize</code>, which we'll use to convert the article into document. After that, <code>word_tokenize</code> will tokenize the document into individual words. Lastly, we have the <code>stop_words</code> library.</p>
<pre><code class="lang-python"><span class="hljs-comment"># convert article to documents</span>
documents = sent_tokenize(article)

<span class="hljs-comment">#toeknize and normalize the document</span>
tokenized_words = [word_tokenize(doc.lower()) <span class="hljs-keyword">for</span> doc <span class="hljs-keyword">in</span> documents]

<span class="hljs-comment"># remove stops words and onl extract alphabets</span>
cleaned_token = [[word <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> sentence <span class="hljs-keyword">if</span> word <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> english_stopwords <span class="hljs-keyword">and</span> word.isalpha()]
                 <span class="hljs-keyword">for</span> sentence <span class="hljs-keyword">in</span> tokenize_words]

<span class="hljs-comment"># create a dictionary</span>
dictionary = Dictionary(cleaned_token)

<span class="hljs-comment"># Create a corpus from the document</span>
corpus = [dictionary.doc2bow(text) <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> cleaned_token]
</code></pre>
<p>The above lines of code include the preprocessing steps that will be performed on the article, including converting the article to a document, normalizing, and tokenizing the document into individual words.</p>
<p>The next part removes stopwords from the text and then extracts words and numbers from the document. After that, we create a dictionary, which is a map between each word and its numerical identifier. The last line of code then creates a corpus of the document.</p>
<pre><code class="lang-javascript"># Build the LDA model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=<span class="hljs-number">3</span>)

# Print the topics
print(<span class="hljs-string">"Identified Topics:"</span>)
<span class="hljs-keyword">for</span> idx, topic <span class="hljs-keyword">in</span> lda_model.print_topics():
    print(f<span class="hljs-string">"Topic {idx + 1}: {topic}"</span>)
</code></pre>
<p>The above code is used to train the model on the corpus and then prints the top 3 topics from the article.</p>
<p>Below is the output of the LDA Model:</p>
<pre><code class="lang-javascript">Identified Topics:
Topic <span class="hljs-number">1</span>: <span class="hljs-number">0.045</span>*<span class="hljs-string">"9"</span> + <span class="hljs-number">0.045</span>*<span class="hljs-string">"ultra"</span> + <span class="hljs-number">0.044</span>*<span class="hljs-string">"sales"</span> + <span class="hljs-number">0.044</span>*<span class="hljs-string">"2"</span> + <span class="hljs-number">0.043</span>*<span class="hljs-string">"series"</span> + <span class="hljs-number">0.043</span>*<span class="hljs-string">"watches"</span> + <span class="hljs-number">0.029</span>*<span class="hljs-string">"apple"</span> + <span class="hljs-number">0.028</span>*<span class="hljs-string">"ruling"</span> + <span class="hljs-number">0.028</span>*<span class="hljs-string">"disagrees"</span> + <span class="hljs-number">0.028</span>*<span class="hljs-string">"said"</span>
Topic <span class="hljs-number">2</span>: <span class="hljs-number">0.051</span>*<span class="hljs-string">"maker"</span> + <span class="hljs-number">0.035</span>*<span class="hljs-string">"ban"</span> + <span class="hljs-number">0.035</span>*<span class="hljs-string">"us"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"emergency"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"made"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"successful"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"court"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"lifted"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"request"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"proved"</span>
Topic <span class="hljs-number">3</span>: <span class="hljs-number">0.055</span>*<span class="hljs-string">"apple"</span> + <span class="hljs-number">0.054</span>*<span class="hljs-string">"us"</span> + <span class="hljs-number">0.054</span>*<span class="hljs-string">"watches"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"sales"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"technology"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"imports"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"authorities"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"barred"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"appeal"</span> + <span class="hljs-number">0.031</span>*<span class="hljs-string">"filed"</span>
</code></pre>
<p>The LDA technique shows some improvement as compared to BoW method. We can still obtain a more information that the article is all about a ban related to Apple ultra series watches in the US.</p>
<h3 id="heading-non-negative-matrix-factorization">Non-Negative Matrix Factorization</h3>
<p>Non-Negative Matrix Factorization (NMF), just like LDA, is another topic modeling technique that uncovers latent topics in a collection of documents.</p>
<p>But instead of relying on BoW, it relies on the Term Frequency-Inverse Document Frequency (TF-IDF) representation to capture and retrieve hidden themes or topics from the documents.</p>
<p>By incorporating TF-IDF information, NMF is able to weigh the importance of terms, thereby identifying more hidden patterns. You can perform NMF using the Scikit-learn library.</p>
<h3 id="heading-steps-for-performing-nmf">Steps for performing NMF</h3>
<ul>
<li><p>Import necessary libraries</p>
</li>
<li><p>Data Preparation: Convert text into document, then perform necessary data preparation like removing stop words. The TF-IDF function in Scikit-Learn has as an argument that does that.</p>
</li>
<li><p>Convert the document to a TF-IDF matrix using the TF-IDF vectorizer in Scikit-learn</p>
</li>
<li><p>Apply the NMF function on the TF-IDF matrix and specify the numbers of topic you want and the number of words in each topic</p>
</li>
<li><p>Lastly, interpret your result.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># import the necessary libraries</span>
<span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> TfidfVectorizer
<span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> NMF

article = <span class="hljs-string">"Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities. \
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row. \
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level. \
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week. \
Apple had said it strongly disagrees with the ruling. \
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."</span>
</code></pre>
<p>The above code contains the libaries that we'll use to implement NMF and the article itself.</p>
<pre><code class="lang-python"><span class="hljs-comment"># convert article to documents</span>
documents = sent_tokenize(article)

<span class="hljs-comment"># Create a TF-IDF vectorizer</span>
tfidf_vectorizer = TfidfVectorizer(stop_words=<span class="hljs-string">'english'</span>).fit_transform(document)

<span class="hljs-comment"># Apply NMF</span>
num_topics = <span class="hljs-number">5</span>  <span class="hljs-comment"># Set the number of topics you want to identify</span>
nmf_model = NMF(n_components=num_topics, init=<span class="hljs-string">'random'</span>, random_state=<span class="hljs-number">42</span>)
nmf_matrix = nmf_model.fit_transform(tfidf)
</code></pre>
<p>The above code converts the article into documents. Then it creates a Term-Frequency Inverse Document Frequency matrix of the article document. The last three lines of code then define the number of topics and create the topics from the document matrix using the NMF.</p>
<p>Below is the output of the NMF Model:</p>
<pre><code class="lang-javascript">Topic #<span class="hljs-number">1</span>: ultra, series, sales, watches, row, halted, patent, white, house, effect
Topic #<span class="hljs-number">2</span>: lifted, court, iphone, getting, request, successful, proved, appeals, ban, maker
Topic #<span class="hljs-number">3</span>: disagrees, strongly, ruling, said, apple, body, blood, level, trade, oxygen
Topic #<span class="hljs-number">4</span>: filed, resume, appeal, latest, tech, authorities, sold, smart, company, emergency
Topic #<span class="hljs-number">5</span>: technology, apple, accused, masimo, device, staff, poaching, maker, trade, level
</code></pre>
<p>You can see that NMF reveals more insights concerning the themes of the document. For example, you can tell that another company called Masimo is accusing Apple of a patent infringement in their Ultra series watches.</p>
<h2 id="heading-how-to-choose-which-technique-to-use">How to Choose Which Technique to Use?</h2>
<p>I recommend experimenting with all the approaches in order to gain different perspectives concerning the contents of your document.</p>
<p>Bag of Words and LDA are based on how frequently words occur, making these techniques useful for inferring the biggest/most general themes about the document.</p>
<p>On the other hand, when using NMF, which is based on TF-IDF, less frequent words can be used to infer additional topics and provide a different perspective on the document.</p>
<p>For example, NMF was able to identify key terms like "Masimo" and "accused," whereas LDA was not able to do this. So depending on your needs, go ahead and experiment with all the approaches to see which one is able to yield better results.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this article, you've learned about topic identification and how you can use it to extract themes or topics from a large document.</p>
<p>We covered some different techniques you can use to identify topic including simple ones like BoW and more advanced ones like LDA and NMF.</p>
<p>Happy learning, and see you in the next one.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Write Common Date Functions in SQL with Examples ]]>
                </title>
                <description>
                    <![CDATA[ When querying data from a database, you will frequently encounter the date datatype. Depending on what you want to achieve, you may need to extract subset information from the date column, perform some operation, and so on. SQL provides a variety of ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/common-date-functions-in-sql-with-examples/</link>
                <guid isPermaLink="false">66d45f31052ad259f07e4ade</guid>
                
                    <category>
                        <![CDATA[ database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Mon, 13 Mar 2023 16:49:25 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/pexels-bich-tran-760710.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When querying data from a database, you will frequently encounter the date datatype. Depending on what you want to achieve, you may need to extract subset information from the date column, perform some operation, and so on.</p>
<p>SQL provides a variety of date functions that can assist you with your task. In this tutorial, we will look at various common date functions in SQL and some examples to show how they work. Without further ado let's get started.</p>
<p>Note: There are numerous SQL flavors available, and the functions for completing a specific task may differ between flavors. This tutorial will concentrate on three of the most popular SQL flavors: <strong>PostgreSQL, MySQL, and SQL server</strong>. We will start with PostgreSQL functions and then present the variants of the other flavors if they differ from PostgreSQL.</p>
<h2 id="heading-date-data-types">Date Data types</h2>
<p>Date data types are one of the built in data types in SQL that you use to store date values. A date value is usually stored across all database management systems or flavors in the timestamp format, that is <code>YYYY-MM-DD HH:MM:SS</code> – for example <code>2022-01-01 10:08:56</code>.</p>
<p>Before we get started, we will be using this table we created to explain the function we will be talking about later in the article. You can create it using the following query. Note the SQL flavor we are using is PostgreSQL.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">EXISTS</span> student;

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> student (
  student_id <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
  student_name <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">30</span>),
  admitted_date <span class="hljs-built_in">DATE</span>
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">11</span>, <span class="hljs-string">'Ibrahim'</span>, <span class="hljs-string">'2012-10-01'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">7</span>, <span class="hljs-string">'Taiwo'</span>, <span class="hljs-string">'2013-12-01'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">9</span>, <span class="hljs-string">'Nurain'</span>, <span class="hljs-string">'2012-11-21'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">8</span>, <span class="hljs-string">'Joel'</span>, <span class="hljs-string">'2012-10-31'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">10</span>, <span class="hljs-string">'Mustapha'</span>, <span class="hljs-string">'2015-11-01'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">5</span>, <span class="hljs-string">'Muritadoh'</span>, <span class="hljs-string">'2011-09-01'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">2</span>, <span class="hljs-string">'Yusuf'</span>, <span class="hljs-string">'2022-05-03'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">3</span>, <span class="hljs-string">'Habeebah'</span>, <span class="hljs-string">'2012-11-01'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'Tomiwa'</span>, <span class="hljs-string">'2013-04-01'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">4</span>, <span class="hljs-string">'Gbadebo'</span>, <span class="hljs-string">'2008-10-01'</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">12</span>, <span class="hljs-string">'Tolu'</span>, <span class="hljs-string">'2009-11-21'</span>);


<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> student;
</code></pre>
<h2 id="heading-common-sql-date-functions">Common SQL Date Functions</h2>
<p>Let's look at the common date functions you will work with on a daily basis.</p>
<h3 id="heading-how-to-use-the-now-function">How to use the <code>Now()</code> function</h3>
<p>You use the <code>Now()</code> function to return the current timestamp (date +time) of the computer system where the database management system is currently hosted. In PostgreSQL it also includes the time zone of the timestamp as shown below.</p>
<pre><code class="lang-javascript">SELECT NOW();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/image-58.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The function for getting the current timestamp in MySQL is also the same as in PostgreSQL – <code>Now()</code>. But in SQL server, you use the function <code>CURRENT_TIMESTAMP</code>.</p>
<h3 id="heading-how-to-use-the-currentdate-function">How to use the <code>current_date</code> function</h3>
<p>This function, as the name implies, gets the current date of the computer system on which the SQL database is running. When retrieving the current date in PostgreSQL, you do not need to use a parenthesis, as you can see below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">current_date</span>;
</code></pre>
<p>In MySQL, you use the <code>[CURDATE](https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_curdate)()</code> function to get the current date, but SQLServer uses <code>[GETDATE](https://learn.microsoft.com/en-us/sql/t-sql/functions/getdate-transact-sql?view=sql-server-ver16) ()</code>.</p>
<h3 id="heading-how-to-use-the-extract-or-datepart-functions">How to use the <code>Extract()</code> or <code>Date_Part()</code> functions</h3>
<p>You use the Extract or date part functions to extract a certain part or unit of a date or date column.</p>
<p>Let's start with the Extract function. Its syntax looks like this:</p>
<pre><code class="lang-sql">EXTRACT(unit FROM date/date_column)
</code></pre>
<p>The unit part of the Extract function is a unit you can extract from a date such as <code>DAY</code>, <code>WEEK</code> , <code>YEAR</code> , <code>QUARTER</code> , and so on. Click <a target="_blank" href="https://dev.mysql.com/doc/refman/8.0/en/expressions.html#temporal-intervals">here</a> to see the list of units that you can extract from a date or date column in SQL.</p>
<p>Say for instance in the above student table we've created earlier you wish to extract the year the students were admitted from the admitted_date column you can achieve that using the <code>EXTRACT()</code> function as shown below.</p>
<pre><code class="lang-javascript">SELECT 
    *,
    EXTRACT(YEAR FROM admitted_date) As <span class="hljs-string">"Year of Admission"</span>
FROM student;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/image-59.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The <code>EXTRACT()</code> function is only available in only PostgreSQL and MySQL and works similarly. Another Function that works like <code>EXTRACT()</code> is <code>DATEPART()</code> and it is also available in PostgreSQL and SQLServer. Let's look at how the <code>DATEPART()</code> function works.</p>
<p>The syntax for Datepart in PostgreSQL looks a little bit different from the one SQLServer uses in that it has an underscore between the date and part. You also need to pass in the unit in a single quote as shown below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> DATE_PART(<span class="hljs-string">'Year'</span>, admitted_date)
<span class="hljs-keyword">FROM</span> student;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/image-60.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>For SQLServer there won't be any underscore between the date and part, and the unit will not be enclosed in single quotes. For example the above result can be generated in SQLServer as shown below.</p>
<pre><code class="lang-javascript">SELECT DATEPART(YEAR, admitted_date)
FROM student;
</code></pre>
<h3 id="heading-how-to-add-intervals-or-parts-to-dates">How to add intervals or parts to dates</h3>
<p>Intervals are units that you can add to a date – for example a days interval, time interval, and so on.</p>
<p>For example, say you want to add 1 day interval to all the dates in a particular table. In PostgreSQL there is no dedicated function you can use to add an interval to a particular date. Instead, you can do this using arithmetic operations.</p>
<p>The syntax for achieving that is shown below:</p>
<pre><code class="lang-javascript">SELECT date/date_column + INTERVAL <span class="hljs-string">"# unit"</span>
</code></pre>
<p>Where # is an integer such as 3, 4, and so on, and unit can be Days, Year, and so on. Click <a target="_blank" href="https://dev.mysql.com/doc/refman/8.0/en/expressions.html#temporal-intervals">here</a> for a list of units that can be passed as an interval.</p>
<p>Say, for instance, that you want to add an interval of 3 days to the <code>admitted_date</code> column in the student table. You can do this in PostgreSQL using the following query:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    *,
    admitted_date + <span class="hljs-built_in">INTERVAL</span> <span class="hljs-string">'3 Days'</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"3_daysadded"</span>
<span class="hljs-keyword">FROM</span> student;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/image-93.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now that you've seen how to add intervals to dates in PostgreSQL, let's see how it is done in MySQL and SQLServer. In MySQL and SQLServer there are functions that you can use to add intervals to dates.</p>
<p>In my SQL, the name of the function is called <code>DATE_ADD()</code> and the syntax is shown below:</p>
<pre><code class="lang-javascript">DATE_ADD(date/date_column, INTERVAL value unit)
</code></pre>
<p>For example, you can get the above table using MySQL by typing the following code:</p>
<pre><code class="lang-javascript">SELECT *,
    DATE_ADD(admitted_date, INTERVAL <span class="hljs-number">3</span> DAY) AS <span class="hljs-string">"3_daysadded"</span>
FROM student;
</code></pre>
<p>In SQLServer, the function you use is similar to the one in MySQL but with a small difference. The syntax for the function used is shown below:</p>
<pre><code class="lang-javascript">DATEADD (datepart/unit , number , date/date_column)
</code></pre>
<p>You can replicate the above table in SQLServer like this:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> *,
    <span class="hljs-keyword">DATEADD</span> (<span class="hljs-keyword">day</span> , <span class="hljs-number">3</span> , admitted_date) <span class="hljs-keyword">AS</span> <span class="hljs-string">"3_daysadded"</span>
<span class="hljs-keyword">FROM</span> student;
</code></pre>
<h3 id="heading-how-to-subtract-intervals-from-dates">How to subtract intervals from dates</h3>
<p>Subtracting intervals from dates in PostgreSQL works like adding intervals, except that the operator changes from plus to minus. For example, say you want to subtract 3 days from the admitted_date column. You can do this using the below code:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    *,
    admitted_date - <span class="hljs-built_in">INTERVAL</span> <span class="hljs-string">'3 Days'</span> <span class="hljs-keyword">AS</span> <span class="hljs-string">"3_dayssubtracted"</span>
<span class="hljs-keyword">FROM</span> student;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/image-95.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>In MySQL, you use the DATESUB function to subtract intervals from the date. You can replicate the above table in MySQL using the following query:</p>
<pre><code class="lang-javascript">SELECT *,
    DATE_SUB(admitted_date, INTERVAL <span class="hljs-number">3</span> DAY) AS <span class="hljs-string">"3_dayssubtracted"</span>
FROM student;
</code></pre>
<p>In SQLServer, you still use the DATEADD function, but instead of specifying a positive value in the function parameter, you use a negative value. It looks like this:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> *,
    <span class="hljs-keyword">DATEADD</span> (<span class="hljs-keyword">day</span> , <span class="hljs-number">-3</span> , admitted_date) <span class="hljs-keyword">AS</span> <span class="hljs-string">"3_dayssubtracted"</span>
<span class="hljs-keyword">FROM</span> student;
</code></pre>
<h3 id="heading-how-to-subtract-two-dates">How to subtract two dates</h3>
<p>To subtract two dates in PostgreSQL, there is also not a dedicated function. But you can use arithmetic operators to achieve your desired result.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> <span class="hljs-string">'2012-10-31'</span>::<span class="hljs-built_in">date</span> -<span class="hljs-string">'2012-05-01'</span>::<span class="hljs-built_in">date</span> <span class="hljs-keyword">AS</span> <span class="hljs-keyword">days</span>;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/03/image-96.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>In MySQL, there is a function called <code>[DATE_DIFF](https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_datediff)()</code> that you can use to achieve this, while for SQLServer you use the <code>[DATEDIFF](https://learn.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-ver16)()</code> function. Click here to learn more about it.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial you've learned some common date functions you will use with when working with dates in SQL.</p>
<p>You learned how to get the current timestamp, get the current date, extract parts from a date, and how to add or subtract dates. Also you learned how each date function differs across different SQL flavors.</p>
<p>Thank you for reading. You can check out the below resources to learn more about date function across the three different SQL flavor discussed in this article.</p>
<ol>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/sql/t-sql/functions/functions?view=sql-server-ver16">Microsoft SQL database functions</a></p>
</li>
<li><p><a target="_blank" href="https://www.postgresql.org/docs/current/functions-datetime.html">Postgres date/time functions and operators</a></p>
</li>
<li><p><a target="_blank" href="https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html">MySQL date and time functions reference manual</a></p>
</li>
</ol>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Window Functions in SQL – with Example Queries ]]>
                </title>
                <description>
                    <![CDATA[ Window functions are an advanced type of function in SQL. They let you work with observations more easily. Window functions give you access to features like advanced analytics and data manipulation without the need to write complex queries. In this l... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/window-functions-in-sql/</link>
                <guid isPermaLink="false">66d45f48706b9fb1c166b969</guid>
                
                    <category>
                        <![CDATA[ database ]]>
                    </category>
                
                    <category>
                        <![CDATA[ SQL ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Thu, 09 Feb 2023 21:47:41 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/02/windows-image.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Window functions are an advanced type of function in SQL. They let you work with observations more easily.</p>
<p>Window functions give you access to features like advanced analytics and data manipulation without the need to write complex queries.</p>
<p>In this lesson you will learn about what window functions are and how they work. Without further ado let's get started.</p>
<h2 id="heading-what-is-a-window-function">What is a Window Function?</h2>
<p>Before learning exactly what a window function is, let's define the meaning of a term that will appear frequently in this article: result set.</p>
<p>In SQL, a result set is the data or result that is returned from a query. That is, it's the result (table) of running the code of a select statement.</p>
<p>For you to understand what a window function is, let's break the words down into pieces.</p>
<h3 id="heading-what-exactly-is-a-window-in-sql">What exactly is a window in SQL?</h3>
<p>A window is basically a set of rows or observations in a table or result set. In a table you may have more than one window depending on how you specify the query – you will learn about this shortly. A window is defined using the <code>OVER()</code> clause in SQL.</p>
<p>You will learn how to determine the number of windows in a result set later in this article.</p>
<h3 id="heading-what-is-a-function">What is a Function?</h3>
<p>Functions are predefined in SQL and you use them to perform operations on data. They let you do things like aggregating data, formatting strings, extracting dates, and so on.</p>
<p>So windows functions are SQL functions that enable us to perform operations on a window – that is, a set of records.</p>
<p>The interesting thing about window functions is that with them you can specify the windows you want to apply the function on. For example, we can partition the full result set into various groups/windows.</p>
<p>Before we go into the syntax of Window functions, let's have a look at the categories of window functions.</p>
<h2 id="heading-different-types-of-window-functions">Different Types of Window Functions</h2>
<p>There are a lot of window functions that exist in SQL but they are primarily categorized into 3 different types:</p>
<ul>
<li><p>Aggregate window functions</p>
</li>
<li><p>Value window functions</p>
</li>
<li><p>Ranking window functions</p>
</li>
</ul>
<p>Aggregate window functions are used to perform operations on sets of rows in a window(s). They include <code>SUM()</code>, <code>MAX()</code>, <code>COUNT()</code>, and others.</p>
<p>Rank window functions are used to rank rows in a window(s). They include <code>RANK()</code>, <code>DENSE_RANK()</code>, <code>ROW_NUMBER()</code>, and others.</p>
<p>Value window functions are like aggregate window functions that perform multiple operations in a window, but they're different from aggregate functions. They include things like <code>LAG()</code>, <code>LEAD()</code>, <code>FIRST_VALUE()</code>, and others. We will see their usefulness later in the section.</p>
<h2 id="heading-sample-table">Sample Table</h2>
<p>In this tutorial you will be working with a table called <code>student_score</code> which contains data such as <code>student_id</code>, <code>student_name</code>, <code>dep_name</code> and <code>score</code>.</p>
<p>You can create the table using the following code:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">DROP</span> <span class="hljs-keyword">TABLE</span> <span class="hljs-keyword">IF</span> <span class="hljs-keyword">EXISTS</span> student_score;

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> student_score (
  student_id <span class="hljs-built_in">SERIAL</span> PRIMARY <span class="hljs-keyword">KEY</span>,
  student_name <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">30</span>),
  dep_name <span class="hljs-built_in">VARCHAR</span>(<span class="hljs-number">40</span>),
  score <span class="hljs-built_in">INT</span>
);

<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">11</span>, <span class="hljs-string">'Ibrahim'</span>, <span class="hljs-string">'Computer Science'</span>, <span class="hljs-number">80</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">7</span>, <span class="hljs-string">'Taiwo'</span>, <span class="hljs-string">'Microbiology'</span>, <span class="hljs-number">76</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">9</span>, <span class="hljs-string">'Nurain'</span>, <span class="hljs-string">'Biochemistry'</span>, <span class="hljs-number">80</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">8</span>, <span class="hljs-string">'Joel'</span>, <span class="hljs-string">'Computer Science'</span>, <span class="hljs-number">90</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">10</span>, <span class="hljs-string">'Mustapha'</span>, <span class="hljs-string">'Industrial Chemistry'</span>, <span class="hljs-number">78</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">5</span>, <span class="hljs-string">'Muritadoh'</span>, <span class="hljs-string">'Biochemistry'</span>, <span class="hljs-number">85</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">2</span>, <span class="hljs-string">'Yusuf'</span>, <span class="hljs-string">'Biochemistry'</span>, <span class="hljs-number">70</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">3</span>, <span class="hljs-string">'Habeebah'</span>, <span class="hljs-string">'Microbiology'</span>, <span class="hljs-number">80</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">1</span>, <span class="hljs-string">'Tomiwa'</span>, <span class="hljs-string">'Microbiology'</span>, <span class="hljs-number">65</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">4</span>, <span class="hljs-string">'Gbadebo'</span>, <span class="hljs-string">'Computer Science'</span>, <span class="hljs-number">80</span>);
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> student_score <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">12</span>, <span class="hljs-string">'Tolu'</span>, <span class="hljs-string">'Computer Science'</span>, <span class="hljs-number">67</span>);
</code></pre>
<h3 id="heading-syntax-for-window-functions">Syntax for Window Functions</h3>
<p>In a simple expression, a window function looks like this:</p>
<pre><code class="lang-sql">function(expression|column) OVER(
    [ PARTITION BY expr_list optional]
    [ ORDER BY order_list optional]
)
</code></pre>
<p>Let's go over the syntax piece by piece:</p>
<p><code>function(expression|column)</code> is the window function such as <code>SUM()</code> or <code>RANK()</code>.</p>
<p><code>OVER()</code> specifies that the function before it is a window function not an ordinary one. So when the SQL engine sees the over clause it will know that the function before the over clause is a window function.</p>
<p>The <code>OVER</code>() clause has some parameters which are optional depending on what you want to achieve. The first one being <code>PARTITION BY</code>.</p>
<p>The <code>PARTITION BY</code> divides the result set into different partitions/windows. For example if you specify the <code>PARTITION BY</code> clause by a column(s) then the result-set will be divided into different windows of the value of that column(s).</p>
<p>The <code>expr_list</code> in the <code>PARTITION BY</code> clause is:</p>
<pre><code class="lang-javascript">expression | column_name [, expr_list ]
</code></pre>
<p>Which means that the <code>PARTITION BY</code> can have an expression, a column, or more than one occurrence or an expression or column which must be separated by a comma. For example <code>PARTITION BY column1, column2</code>.</p>
<p>The next parameter <code>ORDER BY</code> is used to sort the observations in a window. The <code>ORDER BY</code> clause takes <code>order_list</code> which is:</p>
<pre><code class="lang-sql">expression | column_name [ ASC | DESC ]
[ NULLS FIRST | NULLS LAST ][, order_list ]
</code></pre>
<p>where <code>order_list</code> can be a expression or column name and you can also specify the sort order (either ascending or descending), or you can sort any null values first or last. Also the order by can take many expressions or column names.</p>
<p>As stated earlier, the <code>OVER()</code> clause is used to specify the window in a result set. Now one thing to note is if any parameter is not specified in the <code>OVER()</code> clause the default number of windows in the result set will be one.</p>
<p>You use the <code>PARTITION BY</code> and <code>ORDER BY</code> parameters to determine or specify the numbers of windows. Let's go over an example.</p>
<h2 id="heading-how-to-use-a-window-function-example">How to Use a Window Function – Example</h2>
<p>Let's go over an example of how to use a window function. Say for instance you want to compare the minimum score and maximum score from all the records in the table we created earlier. You can do that using a window function as shown below.</p>
<p>Remember that not specifying a partition clause in the <code>OVER</code> clause will cause all the windows to span through the entire dataset.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    *,
    <span class="hljs-keyword">MAX</span>(score) <span class="hljs-keyword">OVER</span>() <span class="hljs-keyword">AS</span> maximum_score,
    <span class="hljs-keyword">MIN</span>(score) <span class="hljs-keyword">OVER</span>() <span class="hljs-keyword">AS</span> minimum_score

<span class="hljs-keyword">FROM</span> student_score;
</code></pre>
<p>As you can see, we have the minimum and maximum salary across the entire dataset.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/image-43.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Table showing result of window function</em></p>
<p>Also, note that the above query can be also achieved using subqueries like this:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> *,
    (<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">MAX</span>(score) <span class="hljs-keyword">FROM</span> student_score) <span class="hljs-keyword">AS</span> maximum_score,
    (<span class="hljs-keyword">SELECT</span> <span class="hljs-keyword">MIN</span>(score) <span class="hljs-keyword">FROM</span> student_score) <span class="hljs-keyword">AS</span> minimum_score
<span class="hljs-keyword">FROM</span> student_score;
</code></pre>
<p>As you can see, the window function is easier to comprehend compared to the subquery method which looks a bit more advanced.</p>
<h2 id="heading-how-to-use-a-window-function-with-partition-by">How to Use a Window Function with <code>PARTITION BY</code></h2>
<p>Say, for instance, that you want to split the dataset into different partitions. Then you want to compare each record in each partition with an aggregate value or a calculated value of each partition. You can specify the <code>PARTITION BY</code> clause in the <code>OVER</code> function.</p>
<p>For example, say you want to compare the maximum score and average score in each department with the individual score. You can do this by specifying the <code>PARTITION BY</code> clause in the <code>OVER</code> statement and also use it with the aggregate function you want to use to achieve your desired result.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> 
    *,
    <span class="hljs-keyword">MAX</span>(score)<span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> dep_name) <span class="hljs-keyword">AS</span> dep_maximum_score,
    <span class="hljs-keyword">ROUND</span>(<span class="hljs-keyword">AVG</span>(score)<span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> dep_name), <span class="hljs-number">2</span>) <span class="hljs-keyword">AS</span> dep_average_score
<span class="hljs-keyword">FROM</span> student_score;
</code></pre>
<p>You can see that the <code>PARTITION BY</code> clause specified in the <code>OVER()</code> clause split the result set into 4 different partitions. This is because there are 4 different departments in the <code>dep_name</code> column (which are <code>Biochemistry, Computer Science, Industrial Chemistry, and Microbiology</code>).</p>
<p>Now after the <code>PARTITION BY</code> clause, you can then calculate the aggregate function for each record in the different departments.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/image-26.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can see from the above image that the aggregate function <code>MAX()</code> and <code>AVG()</code> is calculated for each partition.</p>
<h2 id="heading-other-examples-of-window-functions">Other Examples of Window Functions</h2>
<p>Let's go over some of the common window functions you will work with in SQL.</p>
<h3 id="heading-how-to-use-the-rownumber-function">How to Use the <code>ROW_NUMBER</code> Function</h3>
<p>You use <code>ROW_NUMBER()</code> to assign serial numbers to records in a window. Say we want to assign serial numbers to the records in a partition. For example, we want to add row numbers to the dataset based on their names in alphabetical order. You can do that using the following code:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    *,
    ROW_NUMBER() <span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> student_name) <span class="hljs-keyword">AS</span> name_serial_number
<span class="hljs-keyword">FROM</span> student_score;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/image-29.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>As you can see from the above image, the <code>student_name</code> with the smallest value (that is, the one that falls earliest in the alphabet) is <code>Gbadebo</code> since it starts with <code>G</code>. Then 1 is added as its row number which is followed by the name that begins with <code>H</code>, and so on.</p>
<h3 id="heading-how-to-use-the-rank-function">How to Use the <code>RANK</code> Function</h3>
<p><code>RANK()</code>, as the name implies, lets you rank observations in a window but with gaps. Let's see what this means:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    *,
    <span class="hljs-keyword">RANK</span>()<span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> dep_name <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> score <span class="hljs-keyword">DESC</span>)    
<span class="hljs-keyword">FROM</span> student_score;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Untitled-design--11-.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>As you can see in the above code, the result set was partitioned into different windows based on the department column. Then we used the <code>ORDER BY</code> clause to sort the student records based on their score in descending order in each partition. After that, we applied the <code>RANK</code> function.</p>
<p>Now concerning the gaps, as you can see in the highlighted part in the above image, two records in the Computer Science department have the same score (<code>80</code>). This caused both to be ranked with the value <code>2</code> (instead of one being ranked 2 and the other 3). So it doesn't know how to handle a tie, basically.</p>
<p>You can avoid this scenario using another window function called <code>DENSE_RANK</code> that ranks observations in a window without these gaps.</p>
<h3 id="heading-how-to-use-the-denserank-function">How to Use the <code>DENSE_RANK</code> Function</h3>
<p><code>DENSE_RANK</code> is similar to <code>RANK</code> except that it ranks observations in a window without gaps.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    *,
    <span class="hljs-keyword">DENSE_RANK</span>()<span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> dep_name <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> score <span class="hljs-keyword">DESC</span>)    
<span class="hljs-keyword">FROM</span> student_score;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/Untitled-design--10-.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>As you can see in the output above, when using <code>DENSE_RANK</code>, the next rank number (which is <code>3</code>) was assigned to <code>Tolu</code> (unlike when using <code>RANK</code> which assigned Tolu a rank of <code>4</code>, skipping 3 because of the tie).</p>
<h3 id="heading-how-to-use-the-lag-function">How to Use the <code>LAG</code> Function</h3>
<p><code>LAG</code> is used to return the offset row before the current row within a window. By default it returns the previous row before the current row.</p>
<p>You typically use <code>LAG</code> when you want to compare the value of a previous row with the current row. It's commonly applied in <a target="_blank" href="https://www.tableau.com/learn/articles/time-series-analysis#:~:text=Time%20series%20analysis%20is%20a,data%20points%20intermittently%20or%20randomly.">time-series analysis</a>. For example:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    *,
    LAG(score) <span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">PARTITION</span> <span class="hljs-keyword">BY</span> dep_name <span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> score)    
<span class="hljs-keyword">FROM</span> student_score;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/image-32.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>As shown in the first partition, the first record in the biochemistry partition (Yusuf's) does not have a previous value (that is, no record comes before it) so that's why null was returned. Then moving to the next record – Muritadoh's – it has a previous record, so it returns the previous value which is <code>70</code>.</p>
<h2 id="heading-how-to-use-the-frame-clause-in-order-by">How to Use the Frame Clause in <code>ORDER BY</code></h2>
<p>Now you've learned some common window functions you might work with on a daily basis. So let's move on to learning another key concept related to the <code>ORDER BY</code> clause called the frame clause.</p>
<p>A frame clause, as the name implies, provides the frame (that is, the set of rows in a window) on which the function is to be applied. You use it to provide the offset of rows to be included or calculated with the current row (that is, the rows before or after the current row – the SQL engine process row one after the other).</p>
<p>Now before we look into how to specify a frame clause, let's look at some of the frame clause's assumptions:</p>
<ol>
<li><p>First, a frame clause does not apply to ranking functions. The ranking function only ranks the observation in the window based on the <code>ORDER BY</code> clause.</p>
</li>
<li><p>When using an aggregate window function, you may not include the <code>ORDER BY</code> clause. But when you use the <code>ORDER BY</code> clause, it's a best practice to specify the frame clause for accurate results. What this means is say you want to use an aggregate window function and you want to also order the observations in that window by a column. It's best practice is to specify a frame clause so that you will get an accurate result. But if you are not ordering the observations in the window when using an aggregate function, you don't need to specify a frame clause.</p>
</li>
</ol>
<p>You can specify a frame clause using two things – <code>ROWS</code> and <code>RANGE</code>. But in this part you will learn how to use the <code>ROWS</code> keyword since it is commonly used to specify a frame clause. The <code>RANGE</code> keyword is beyond the scope of this article.</p>
<p>The <code>ROWS</code> clause defines the frame in terms of the physical offset rows from the current rows. That is, it is used to specify the rows that will be used in conjunction with the current row for calculation.</p>
<p>For example the following frame clause <code>ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING</code> defines a frame that includes the current row, 1 row preceding it and 1 row following it.</p>
<p>Let's look at the keywords that you can use in conjunction with the <code>ROWS</code> clause:</p>
<ol>
<li><p><code>N PRECEDING</code> is a keyword you use to specify the N rows that will be included in the calculation along with the current row. For example <code>3 PRECEDING</code> means 3 rows preceding the current row.</p>
</li>
<li><p><code>N FOLLOWING</code> works like <code>N PRECEDING</code> excepts that it works in an opposite manner. <code>N FOLLOWING</code> specifies the numbers of row after the current row.</p>
</li>
<li><p><code>UNBOUNDED PRECEDING</code> means all rows before the current row.</p>
</li>
<li><p><code>UNBOUNDED FOLLOWING</code> means all rows after the current row.</p>
</li>
<li><p><code>CURRENT ROW</code> is used to specify the current row.</p>
</li>
</ol>
<p>For example, let's look at the below frame clause:</p>
<p><code>ROWS BETWEEN 2 PRECEDING AND CURRENT ROW</code> will use less than or equal to 2 rows before the current row, along with the current row for the calculation.</p>
<h3 id="heading-frame-clause-example">Frame clause example</h3>
<p>Let's look at an example. Say for instance you want to get the cumulative sum of all the student scores. You can do that by using a frame clause.</p>
<p>So first, to be able to do this, you need to first know the types of keywords you will specify in the frame clause.</p>
<p>Since you want to sum up all rows before the current row and the current row itself, you can use the <code>UNBOUNDED PRECEDING</code> keyword. Remember that this gets all rows before the current row and also uses the current row itself.</p>
<p>So the code to achieve that task is shown below:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span>
    *,
    <span class="hljs-keyword">SUM</span>(score)<span class="hljs-keyword">OVER</span>(<span class="hljs-keyword">ORDER</span> <span class="hljs-keyword">BY</span> student_id <span class="hljs-keyword">ROWS</span> <span class="hljs-keyword">BETWEEN</span> <span class="hljs-keyword">UNBOUNDED</span> <span class="hljs-keyword">PRECEDING</span> <span class="hljs-keyword">AND</span> <span class="hljs-keyword">CURRENT</span> <span class="hljs-keyword">ROW</span>) <span class="hljs-keyword">AS</span> cummulative_sum
<span class="hljs-keyword">FROM</span> student_score
</code></pre>
<p>Let's break down the window function code:</p>
<pre><code class="lang-sql">SUM(score)OVER(ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cummulative_sum
</code></pre>
<p>Firstly in the <code>OVER()</code> clause, we sort the entire window – which is the whole dataset – using the student id.</p>
<p>Then we specify the frame clause which is <code>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</code>. This is all rows before the current row and the current row will be used for calculation.</p>
<p>The result is shown in the below image:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2023/02/image-6.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The first row in the dataset does not have any row before it. But since we also specify the <code>CURRENT ROW</code> keyword as the last frame, then the SQL engine finds its sum which equals <code>65</code>.</p>
<p>Then moving to the second row. It has 1 row before it. So the SQL engine sums the score of the first row <code>65</code> with the current row which is <code>70</code>. That is why the result is <code>135</code>...and so on down the table.</p>
<h3 id="heading-when-to-use-a-window-function">When to Use a Window Function</h3>
<p>You've learned what window functions are in this tutorial. Some practical cases where you can use them are:</p>
<ol>
<li><p>When you want to compare an aggregate value in a window with individual records in that window.</p>
</li>
<li><p>When you want to do things like ranking, percentile, cumulative sum or running total, moving average, and so on.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, you've learned what window functions are, and you've also looked at some of the clauses you can add in Windows functions. One example is the PARTITION BY clause, which divides the result set into separate partitions or windows.</p>
<p>You also learned how to utilize the ORDER BY clause to order observations in a window and you saw various common examples of window functions.</p>
<p>Finally, you learned another advanced clause that you can use with window functions, the frame clause, which allows you to access more features of a window.</p>
<p>Thank you for reading all the way to the end. You can use the tutorial listed below to learn about more SQL window functions.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.postgresql.org/docs/current/functions-window.html">https://www.postgresql.org/docs/current/functions-window.html</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ What is Stratified Random Sampling? Definition and Python Example ]]>
                </title>
                <description>
                    <![CDATA[ When we wish to conduct an experiment on a population – for example, the entire population of a country – it is not always practical or realistic to include every subject (citizen) in the experiment. Instead, we rely on a sample, which is a subset of... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-stratified-random-sampling-definition-and-python-example/</link>
                <guid isPermaLink="false">66d45f46f855545810e93474</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ statistics ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Tue, 15 Nov 2022 16:33:52 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/11/pexels-viktorya-sergeeva--------10275085.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When we wish to conduct an experiment on a population – for example, the entire population of a country – it is not always practical or realistic to include every subject (citizen) in the experiment.</p>
<p>Instead, we rely on a sample, which is a subset of the population, and then draw conclusions about the population based on the sample's results.</p>
<p>Now, drawing a sample from a population is known as sampling technique, and the manner in which the sample is drawn is essential to the result.</p>
<p>There are lot of sampling techniques out there, but in this tutorial we will look at one of them called stratified random sampling and how it works. Without further ado, let's get started.</p>
<h2 id="heading-what-is-stratified-random-sampling">What is Stratified Random Sampling?</h2>
<p>Before we go into the details of stratified random sampling, let's break the term down into bits so we can grasp it better. Let's start with stratified.</p>
<p>In the context of sampling, <strong>stratified</strong> means splitting the population into smaller groups or strata based on a characteristic. To put it another way, you divide a population into groups based on their features.</p>
<p><strong>Random</strong> <strong>sampling</strong> entails randomly selecting subjects (entities) from a population. Each subject has an equal probability of being chosen from the population to form a sample (subpopulation) of the overall population.</p>
<p>So therefore, <strong>stratified random sampling</strong> is a sampling approach in which the population is separated into groups or strata depending on a particular characteristic. Then subjects from each stratum (the singular of strata) are randomly sampled.</p>
<p>You divide the population into groups based on a characteristic and then choose a subject or entity at random from each group.</p>
<h2 id="heading-types-of-stratified-random-sampling">Types of Stratified Random Sampling</h2>
<p>Stratified sampling is divided into two categories, which are:</p>
<ul>
<li><p>Proportionate stratified random sampling.</p>
</li>
<li><p>Disproportionate stratified random sampling.</p>
</li>
</ul>
<p><strong>Proportionate stratified random sampling</strong> is a type of sampling in which the size of the random sample obtained from each stratum is proportionate to the size of the entire stratum's population.</p>
<p>In other words, the proportion of the entire stratum equals the proportion of the sample stratum. Consider the following example:</p>
<pre><code class="lang-python">students = {

    <span class="hljs-string">"Name"</span>: [<span class="hljs-string">"Ibrahim"</span>, <span class="hljs-string">"Ganiyat"</span>, <span class="hljs-string">"Joel"</span>, <span class="hljs-string">"Elijah"</span>, <span class="hljs-string">"Yusuf"</span>, <span class="hljs-string">"Nurain"</span>, 
            <span class="hljs-string">"Dayo"</span>, <span class="hljs-string">"David"</span>, <span class="hljs-string">"Olu"</span>, <span class="hljs-string">"Tobi"</span>],

    <span class="hljs-string">"ID"</span>:  [<span class="hljs-string">'001'</span>, <span class="hljs-string">'002'</span>, <span class="hljs-string">'003'</span>, <span class="hljs-string">'004'</span>, <span class="hljs-string">'005'</span>, <span class="hljs-string">'006'</span>,<span class="hljs-string">'007'</span>, <span class="hljs-string">'008'</span>, <span class="hljs-string">'009'</span>, <span class="hljs-string">'010'</span>],

    <span class="hljs-string">"Grade"</span>: [<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'A'</span>, <span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'A'</span>],

    <span class="hljs-string">"Category"</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">3</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">3</span>]
}
df = pd.DataFrame(students)
&gt;&gt;
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-35.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The above dataframe contains students' names, IDs, grades, and categories. Assume we wish to stratify students based on their grade characteristics and sample 60% of students from each group. That means we will have three strata in the above dataframe, because we have three different grades.</p>
<p>We can sample it by typing the following:</p>
<pre><code class="lang-python">df_sample = df.groupby(<span class="hljs-string">"Grade"</span>, group_keys=<span class="hljs-literal">False</span>).apply(<span class="hljs-keyword">lambda</span> x:x.sample(frac=<span class="hljs-number">0.6</span>))
</code></pre>
<p>Now what we did above is to group the dataframe into different strata using the <code>groupby()</code> method. Then we passed in the <code>Grade</code> feature. For each group (stratum) we randomly sampled out <code>0.6(60%)</code> of observation from it.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-36.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now if we look at the proportion for <code>df_sample</code> and <code>df</code>, we will see that the proportions for both dataframes are the same.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-37.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><strong>Disproportionate stratified random sampling</strong>, on the other hand, involves randomly selecting strata without regard for proportion. In other words, sampling is done based on a specified number. Let's look at an example.</p>
<pre><code class="lang-python">df.groupby(<span class="hljs-string">'Grade'</span>, group_keys=<span class="hljs-literal">False</span>).apply(<span class="hljs-keyword">lambda</span> x: x.sample(n=<span class="hljs-number">2</span>))
</code></pre>
<p>In this code, you can see that we only specified the actual number of samples we want to achieve.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-38.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Most of the time, you'll use proportionate stratified sampling. Disproportionate requires more expert knowledge. When performing stratified sampling you will most likely use proportionate sampling.</p>
<h2 id="heading-applications-of-stratified-random-sampling">Applications of Stratified Random Sampling</h2>
<h3 id="heading-1-sampling-based-on-shared-characteristic">1. Sampling Based on Shared Characteristic:</h3>
<p>When one or more subjects in an experiment share characteristics, it suggests they are members of the same group (one subject can only be in a particular group).</p>
<p>For example, suppose 50 students take a test, and the grade range for the examination is merely A-E. So we can have students who are in the same grade group, for example, students who received an A (and it is impossible for a student to have two grades). As a result, they share the same characteristic or feature, which is grade.</p>
<p>So when you want to sample subjects based on shared characteristics, you should use stratified random sampling. This ensures that a member of a specific group will be included.</p>
<p>This is because stratified random sampling differs from simple random sampling, which is also a sampling technique. Stratified random sampling randomly samples out the population with no characteristics (that is, each subject of the population has equal chances of being picked).</p>
<p>As a result, simple random sampling cannot guarantee that a certain member of a particular group will be included in the sample.</p>
<p>Let's have a look at an example to see what we're talking about. Let's say we want to sample out 60% of students using both stratified and simple random sampling.</p>
<p>We can see the result for stratified random sampling below:</p>
<pre><code class="lang-python">df.groupby(<span class="hljs-string">'Grade'</span>, group_keys=<span class="hljs-literal">False</span>).apply(<span class="hljs-keyword">lambda</span> x: x.sample(frac=<span class="hljs-number">0.6</span>))
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-39.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>And this is the result of simple random sampling:</p>
<pre><code class="lang-python">df.sample(frac= <span class="hljs-number">0.6</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-40.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see that students with C grades are not included in the sample. This is because in simple random sampling, every observation has an equal chance of being chosen because we are not sampling based on characteristics. This means that there is a chance that an observation will not be chosen.</p>
<p>In stratified random sampling, on the other hand, we consider all the groups we want to sample and then randomly sample from each group.</p>
<h3 id="heading-2-imbalanced-dataset">2. Imbalanced Dataset:</h3>
<p>An imbalanced dataset is a machine learning classification problem in which the two class labels in the target variable are not proportional to one another. In other words, one class has a higher count than the other, resulting in an imbalance.</p>
<p>In machine learning, stratified sampling is also used to obtain the same sample proportion for a train and test set if there is an imbalance in the dataset.</p>
<p>For example, a chronic disease dataset has an imbalance label as shown below. You can click <a target="_blank" href="https://www.kaggle.com/datasets/mansoordaku/ckdisease/download?datasetVersionNumber=1">here</a> to download the dataset.</p>
<pre><code class="lang-python">df = pd.read_csv(<span class="hljs-string">"kidney_disease.csv"</span>)
df.head()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-41.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>If we check the proportion label feature which is <code>classification</code>, we can see that it is imbalanced.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-42.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now let's say we want to split the train and test set using simple random sampling. We won't achieve the same proportion for the train and test set as the population proportion.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
X = df.drop(columns = [<span class="hljs-string">"classification"</span>])
y = df[<span class="hljs-string">"classification"</span>]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=<span class="hljs-number">0.8</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-43.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see that the label proportion for both <code>y_train</code> and <code>y_test</code> is not the same as the population proportion. To achieve the same proportion we can make use of the <code>stratify</code> parameter in <code>train_test_split</code> as shown below:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
X = df.drop(columns = [<span class="hljs-string">"classification"</span>])
y = df[<span class="hljs-string">"classification"</span>]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=<span class="hljs-number">0.8</span>, stratify=y)
</code></pre>
<p>The above code shows that the dataset was stratified on the label. So with that we will achieve the same proportion as the population proportion.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/11/image-44.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial, we looked at stratified sampling and how you can use it in statistics and machine learning. We also looked at the types of stratified sampling.</p>
<p>Thank you for your time.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Perform Customer Segmentation in Python – Machine Learning Tutorial ]]>
                </title>
                <description>
                    <![CDATA[ Before I get into what this post is all about, I'd like to share the motivation that prompted me to write it. I'm writing this article because I recall the first time I learned about customer segmentation or clustering. I didn't fully grasp what I wa... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/customer-segmentation-python-machine-learning/</link>
                <guid isPermaLink="false">66d45f33f855545810e9345e</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Wed, 02 Nov 2022 18:56:39 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/11/-GetPaidStock.com--635e3fa0c561f.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Before I get into what this post is all about, I'd like to share the motivation that prompted me to write it.</p>
<p>I'm writing this article because I recall the first time I learned about customer segmentation or clustering. I didn't fully grasp what I was doing back then.</p>
<p>All I remembered was dumping all the features into <code>KMeans</code> and <strong>voilà</strong> – I'd developed a customer segmentation. I didn't understand the model's attributes for each segment.</p>
<p>So that for that reason, I'm sharing my knowledge of how I've come to grasp customer segmentation so hopefully you can gain from it.</p>
<p>In this tutorial, you will learn how to build an effective customer segmentation as well as how to perform effective Exploratory Data Analysis (EDA). These are the ingredients that will make your customer segmentation result delicious to eat 😋. Without further ado let's get started.</p>
<h2 id="heading-what-is-customer-segmentation">What is Customer Segmentation?</h2>
<p>We've been talking about customer segmentation since the beginning of the article – but you might not know what it means.</p>
<p>Note that it is important to try and understand this theoretical part before we move into coding part of the tutorial. This foundation will help you build the segmentation model effectively.</p>
<p>Ok, back to defining what segmentation is:</p>
<p>Segmentation means grouping entities together based on similar properties. Entities could be customers, products, and so on.</p>
<p>For example <strong>customer segmentation</strong>, in particular, means grouping customers together based on similar features or properties.</p>
<p>Now there's one thing to note is when grouping customers based on properties: the properties you choose to group the customers must be relevant to the criteria based on which you want to group them.</p>
<p>For example, assume you want to categorize customers depending on what they buy. In this scenario, the customer's gender attribute may not be optimal or relevant for segmentation.</p>
<p>Knowing how to select appropriate attributes for customer segmentation is crucial.</p>
<p>Let's look at the different types of Customer Segmentation:</p>
<ul>
<li><p>Demographic Segmentation.</p>
</li>
<li><p>Behavioral Segmentation.</p>
</li>
<li><p>Geographic Segmentation.</p>
</li>
<li><p>Psychographic Segmentation.</p>
</li>
<li><p>Technographic Segmentation.</p>
</li>
<li><p>Needs-based Segmentation.</p>
</li>
<li><p>Value-based Segmentation.</p>
</li>
</ul>
<p>The most typical types of consumer segmentation you will work on when performing segmentation revolve around Demographic and Behavioral segmentation.</p>
<p><strong>Demographic Segmentation</strong> is the process of grouping customers based on their demography – that is, grouping customers based on their age, income, education, marital status, and so on.</p>
<p><strong>Behavioral Segmentation</strong> means grouping customer based on their behavior. For example how frequently they purchase as a group, the total amount they spend on a goods, when they last bought a product, and so on.</p>
<p>To learn more about other types of Customer Segmentation, you can read <a target="_blank" href="https://blog.hubspot.com/service/customer-segmentation">this article</a>.</p>
<h2 id="heading-criteria-for-customer-segmentation">Criteria for Customer Segmentation</h2>
<p>When grouping customers, you should select relevant features that are tailored to what you want to segment them on. But in some circumstances, combining features from several types of customers segmentation to generate another type of segmentation makes sense.</p>
<p>For example, you can combine features from demographic and behavioral segmentation to create a new segmentation. That is precisely what you will learn in this article – we will build a customer segmentation using demographic features and behavioral features.</p>
<p>Now enough talking – let's get down to business.</p>
<h2 id="heading-understanding-the-business-problem">Understanding the Business Problem.</h2>
<p>The business problem is to segment customers based on their personalities (demographic) and the amount they spend on products (behavioral). This will help the company gain a better understanding of their customers' personalities and habits.</p>
<h3 id="heading-tools-well-use-for-this-project">Tools We'll Use for this Project</h3>
<p>Of course we're using Python to build our project – but these are the tools and libraries that we will also be using to help us out.</p>
<ol>
<li><p>Jupyter environment (Jupyter Lab or Jupyter notebook) – for experimenting with our project.</p>
</li>
<li><p>Pandas – for loading data as a dataframe and wrangling the data.</p>
</li>
<li><p>Numpy and Scipy – for performing some basic mathematical computations.</p>
</li>
<li><p>Scikit-Learn – for building our Customer Segmentation Model.</p>
</li>
<li><p>Seaborn, Matplotlib and Plotly Express – for data visualization.</p>
</li>
</ol>
<p>If you don't have some or any of these libraries, you can check out their official documentations online to see how to install them.</p>
<h3 id="heading-dataset-well-use-for-this-project">Dataset We'll Use for this Project</h3>
<p>The dataset we'll use in this project comes from Kaggle. You can go <a target="_blank" href="https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis/download?datasetVersionNumber=1">here</a> to download it.</p>
<p>Here's a little information about the dataset:</p>
<p>To put it simply, the dataset contains the demographics of customers and their behavior as it relates to the company. The features of the dataset are:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/Customer-Personality-Features.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-customer-personality-analysis-features">Customer Personality Analysis Features</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>People</td><td>Promotion</td><td>Product</td><td>Place</td></tr>
</thead>
<tbody>
<tr>
<td>Year Birth</td><td>NumberDealPurchase</td><td>MntWines</td><td>NumWebPurchases</td></tr>
<tr>
<td>Title</td><td>AcceptedCmp1</td><td>MntFruits</td><td>NumCatalogPurchases</td></tr>
<tr>
<td>Education</td><td>AcceptedCmp2</td><td>MntMeatProducts</td><td>NumStorePurchases</td></tr>
<tr>
<td>Marital_Status</td><td>AcceptedCmp3</td><td>MntFishProducts</td><td>NumWebVisitsMonth</td></tr>
<tr>
<td>Income</td><td>AcceptedCmp4</td><td>MntSweetProducts</td><td></td></tr>
<tr>
<td>Kidhome</td><td>AcceptedCmp5</td><td>MntGoldProds</td><td></td></tr>
<tr>
<td>Teenhome</td><td>Response</td><td></td><td></td></tr>
<tr>
<td>Dt_customer, Recency,</td><td></td><td></td><td></td></tr>
<tr>
<td>and Complain</td><td></td><td></td></tr>
</tbody>
</table>
</div><p>To get the most out of this tutorial, you can download the entire Jupyter notebook beforehand so you can follow along easily. You can go <a target="_blank" href="https://github.com/ibrahim-ogunbiyi/Customer-Segmentation">here</a> to fork the repo.</p>
<h2 id="heading-exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</h2>
<p>As you might know, EDA is the key to performing well as a data analyst or data scientist. It gives you first-hand information about the whole dataset, and it helps you understand all the relationships between the features in your dataset.</p>
<p>We will perform the three phases of EDA in this tutorial which are:</p>
<ol>
<li><p>Univariate Analysis.</p>
</li>
<li><p>Bivariate Analysis.</p>
</li>
<li><p>Multivariate Analysis</p>
</li>
</ol>
<p>Firstly we need to import all the necessary libraries we will use in this project. We also need to load the dataset into a dataframe so we can see all the features that are present in it.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> plotly.express <span class="hljs-keyword">as</span> px
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> scipy.stats <span class="hljs-keyword">import</span> iqr
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler
<span class="hljs-keyword">from</span> sklearn.cluster <span class="hljs-keyword">import</span> KMeans


df = pd.read_csv(<span class="hljs-string">"data/marketing_campaign.csv"</span>, sep=<span class="hljs-string">"\t"</span>)
df.head()
</code></pre>
<p>To begin, there are many features in the dataset – but because we want to focus on customer demographics and behavior, we will only perform EDA on features related to those categories.</p>
<p>Keep in mind that the EDA conducted in this article is simply a subset of the one in the Jupyter Notebook. I did it this way to keep the article from becoming too buggy. To find the entire EDA in the notebook, fork the repo by clicking this <a target="_blank" href="https://github.com/ibrahim-ogunbiyi/Customer-Segmentation">link</a>.</p>
<p>Age, income, marital status, education, total children, and amount spent on products are the attributes that belong to this category.</p>
<p>First, since the segmentation is based on the total amount customers have spent, we'll add the amount spent on the product:</p>
<pre><code class="lang-python">df[<span class="hljs-string">"TotalAmountSpent"</span>] = df[<span class="hljs-string">"MntFishProducts"</span>] + df[<span class="hljs-string">"MntFruits"</span>] + df[<span class="hljs-string">"MntGoldProds"</span>] + df[<span class="hljs-string">"MntSweetProducts"</span>] + df[<span class="hljs-string">"MntMeatProducts"</span>] + df[<span class="hljs-string">"MntWines"</span>]
</code></pre>
<p>After that's done we can now begin our EDA. An effective EDA always has three stages, as I mentioned above. Again, they are as follows:</p>
<ol>
<li><p>Univariate Analysis</p>
</li>
<li><p>Bivariate Analysis.</p>
</li>
<li><p>Multivariate Analysis.</p>
</li>
</ol>
<h3 id="heading-univariate-analysis">Univariate analysis</h3>
<p>Univariate analysis entails evaluating a single feature in order to get insights about it. So, the initial step in performing EDA is to undertake univariate analysis, which includes evaluating descriptive or summary statistics about the feature.</p>
<p>For example you might check a feature distribution, proportion of a feature, and so on.</p>
<p>In our case, we will check the distribution of customer's ages in the dataset. We can do that by typing the following:</p>
<pre><code class="lang-python">sns.histplot(data=df, x=<span class="hljs-string">"Age"</span>, bins = list(range(<span class="hljs-number">10</span>, <span class="hljs-number">150</span>, <span class="hljs-number">10</span>)))
plt.title(<span class="hljs-string">"Distribution of Customer's Age"</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/Age-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see from the above summary that most of the customers belong in the age range of <code>40-60</code>.</p>
<h3 id="heading-bivariate-analysis">Bivariate Analysis</h3>
<p>After you've performed univariate analysis on all your feature of interest, the next step is to perform bivariate analysis. This involves comparing two attributes at the same time.</p>
<p>Bivariate analysis entails determining the correlation between two features, for example.</p>
<p>In our case, some of the bivariate analysis we'll perform in the project include observing the average total spent across different client age groups, determining a correlation between customer income and total amount spent, and so on, as shown below.</p>
<p>For example, in our case we want to check the relationship between a Customer's <code>Income</code> and <code>TotalAmountSpent</code>. We can do that by typing the following:</p>
<pre><code class="lang-python">fig = px.scatter(data_frame=df_cut, x=<span class="hljs-string">"Income"</span>,
                 y=<span class="hljs-string">"TotalAmountSpent"</span>,
                 title=<span class="hljs-string">"Relationship Between Customer's Income and Total Amount Spent"</span>,
                height=<span class="hljs-number">500</span>,
                color_discrete_sequence = px.colors.qualitative.G10[<span class="hljs-number">1</span>:])
fig.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/newplot--14-.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Analysis of relationship between customer's income and total amount spent.</em></p>
<p>We can see from the above analysis that as the <code>Income</code> increases so does the <code>TotalAmountSpent</code>. So from the analysis we can postulate that <code>Income</code> is one of key factor that determines how much a customer might spend.</p>
<h3 id="heading-multivariate-analysis">Multivariate Analysis</h3>
<p>After you've completed univariate (analysis of single feature) and bivariate (analysis of two features) analysis, the last phase of EDA is to perform Multivariate Analysis.</p>
<p>Multivariate Analysis consists of understanding the relationship between two or more variables.</p>
<p>In our project, one of the multivariate analysis we'll do is to understand the relationship between <code>Income</code>, <code>TotalAmountSpent</code>, and Customer's <code>Education</code>.</p>
<pre><code class="lang-python">fig = px.scatter(
    data_frame=df_cut,
    x = <span class="hljs-string">"Income"</span>,
    y= <span class="hljs-string">"TotalAmountSpent"</span>,
    title = <span class="hljs-string">"Relationship between Income VS Total Amount Spent Based on Education"</span>,
    color = <span class="hljs-string">"Education"</span>,
    height=<span class="hljs-number">500</span>
)
fig.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/newplot--15-.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Analysis of relationship between income, total amount spent, and education.</em></p>
<p>We can see from the analysis that customers with an Undergraduate education level generally spend less than other customers with higher levels of education. This is because undergraduate customers typically earn less than other customers, which affects their spending habits.</p>
<h2 id="heading-how-to-build-the-segmentation-model">How to Build the Segmentation Model</h2>
<p>After we've finished our analysis, the next step is to create the model that will segment the customers. <code>KMeans</code> is the model we'll use. It is a popular segmentation model that is also quite effective.</p>
<p>The <code>KMeans</code> model is an unsupervised machine learning model that works by simply splitting N observations into K numbers of clusters. The observations are grouped into these clusters based on how close they are to the mean of that cluster, which is commonly referred to as centroids.</p>
<p>When you fit the features into the model and specify the number of clusters or segments you want, <code>KMeans</code> will output the cluster label to which each observation in the feature belongs.</p>
<p>Let's talk about the features you might want to fit into a <code>KMeans</code> model. There are no limits to the number of features you can use to build a Customer segmentation model – but in my opinion, fewer's better. This is because you will be able to grasp and interpret the outcomes of each segment more easily and clearly with fewer features.</p>
<p>In our scenario, we will first construct the <code>KMeans</code> model with two features and then build the final model with three features. But, before we get started, let's go over the <code>KMeans</code> assumptions, which are as follows:</p>
<ul>
<li><p>The features must be numerical.</p>
</li>
<li><p>The features you're fitting into <code>KMeans</code> must be normally distributed. This is because <code>KMeans</code> (since it calculates average distance) is affected by outliers (values that deviate a lot from the others). As a result, any skewed feature must be changed in order to be normally distributed. Fortunately, we can use Numpy's logarithm transformation package <code>np.log()</code></p>
</li>
<li><p>The features must also be of the same scale. For this, we'll use the Scikit-learn <code>StandardScaler()</code> module.</p>
</li>
</ul>
<p>We'll design our <code>KMeans</code> model now that we've grasped the main concept. So, for our first model, we'll use the <code>Income</code> and <code>TotalAmountSpent</code> features.</p>
<p>To begin, because the <code>Income</code> feature has missing values, we will fill it with the median number.</p>
<pre><code class="lang-python">df[<span class="hljs-string">"Income"</span>].fillna(df[<span class="hljs-string">"Income"</span>].median(), inplace=<span class="hljs-literal">True</span>)
</code></pre>
<p>After that, we'll assign the features we want to work with, <code>Income</code> and <code>TotalAmountSpent</code>, to a variable called <code>data</code>.</p>
<pre><code class="lang-python">data = df[[<span class="hljs-string">"Income"</span>, <span class="hljs-string">"TotalAmountSpent"</span>]]
</code></pre>
<p>Once that's done we will transform features and save the result into a variable called <code>data_log</code>.</p>
<pre><code class="lang-python">df_log = np.log(data)
</code></pre>
<p>Then we will scale the result using Scikit-learn <code>StandardScaler()</code>:</p>
<pre><code class="lang-python">std_scaler = StandardScaler()
df_scaled = std_scaler.fit_transform(df_log)
</code></pre>
<p>Once that's done we can then build the model. So the <code>KMeans</code> model requires two parameters. The first is <code>random_state</code> and the second one is <code>n_clusters</code>where:</p>
<ul>
<li><p><code>n_clusters</code> represents the number of clusters or segments to be derived from <code>KMeans</code>.</p>
</li>
<li><p><code>random_state</code>: is required for reproducible results.</p>
</li>
</ul>
<p>So, in a business setting, you might know the number of clusters you want to segment customers into ahead of time. But if not, you will need to experiment with different numbers of clusters to find the optimal one.</p>
<p>Since we're not in a business setting, we will experiment with different numbers of clusters.</p>
<p>The elbow method is the strategy we'll use to select the best cluster. It works simply by plotting the error from each cluster and looking for a spot that forms an elbow on the plot. As a result, the ideal cluster is the one that produces that elbow.</p>
<p>Here's the code that will help us achieve that:</p>
<pre><code class="lang-python">errors = []
<span class="hljs-keyword">for</span> k <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>):
    model = KMeans(n_clusters=k, random_state=<span class="hljs-number">42</span>)
    model.fit(df_scaled)
    error.append(model.inertia_)


plt.title(<span class="hljs-string">'The Elbow Method'</span>)
plt.xlabel(<span class="hljs-string">'k'</span>); plt.ylabel(<span class="hljs-string">'Error of Cluster'</span>)
sns.pointplot(x=list(range(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>), y=errors)
plt.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/Elbow.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's summarize what the above code does. We specified the number of clusters to experiment with, which is in the <code>range(1, 11)</code>. Then we fit the features on those clusters and added the error to the list we created before above.</p>
<p>Following that, we plot the error for each cluster. The diagram shows that the cluster that creates the elbow is three. So three clusters is the best value for our model. As a result, we will build the <code>KMeans</code> model utilizing three clusters.</p>
<pre><code class="lang-python">model = KMeans(n_clusters = <span class="hljs-number">3</span>, random_state=<span class="hljs-number">42</span>)
model.fit(df_scaled)
</code></pre>
<p>Now we've built our model. The next thing will be to assign the cluster label for each observation. So we will assign the label to the original feature we didn't processed. That is, where we assigned <code>Income</code> and <code>TotalAmountSpent</code> to the variable <code>data</code></p>
<pre><code class="lang-python">data = data.assign(ClusterLabel = model.labels_)
</code></pre>
<h3 id="heading-how-to-interpret-the-cluster-result">How to Interpret the Cluster Result</h3>
<p>Now that we've built the model, the next thing will be to interpret the result from each cluster.</p>
<p>There are numerous way you can summarize the results of your cluster depending on what you want to achieve. The most common summary is using central tendency which includes mean, median, and mode.</p>
<p>For our case we will make use of median. We're using median because the original features have outliers and the mean is very sensitive to outliers.</p>
<p>So we will aggregate the cluster labels and find the median for <code>Income</code> and <code>TotalAmountSpent</code>. We can make use of Pandas <code>groupby</code> method for that.</p>
<pre><code class="lang-python">data.groupby(<span class="hljs-string">"ClusterLabel"</span>)[[<span class="hljs-string">"Income"</span>, <span class="hljs-string">"TotalAmountSpent"</span>]].median()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/image-265.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see that there is a trend within the clusters:</p>
<ul>
<li><p>Cluster 0 translates to customers who earn less and spend less.</p>
</li>
<li><p>Cluster 1 represent customers that earn more and spend more.</p>
</li>
<li><p>Cluster 2 represents customers that earn moderate and spend moderate.</p>
</li>
</ul>
<p>We can also visualize the relationship by entering the following code:</p>
<pre><code class="lang-python">fig = px.scatter(
    data_frame=data,
    x = <span class="hljs-string">"Income"</span>,
    y= <span class="hljs-string">"TotalAmountSpent"</span>,
    title = <span class="hljs-string">"Relationship between Income VS Total Amount Spent"</span>,
    color = <span class="hljs-string">"ClusterLabel"</span>,
    height=<span class="hljs-number">500</span>
)
fig.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/newplot--10-.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Analysis of relationship between income and total amount spent</em></p>
<p>Now in the same way we built the formal model, we will build the KMeans model using 3 features (the elbow method also depicts that 3 clusters is the optimal one).</p>
<pre><code class="lang-python">data = df[[<span class="hljs-string">"Age"</span>, <span class="hljs-string">"Income"</span>, <span class="hljs-string">"TotalAmountSpent"</span>]]
df_log = np.log(data)
std_scaler = StandardScaler()
df_scaled = std_scaler.fit_transform(df_log)
</code></pre>
<pre><code class="lang-python">model = KMeans(n_clusters=<span class="hljs-number">3</span>, random_state=<span class="hljs-number">42</span>)
model.fit(df_scaled)

data = data.assign(ClusterLabel= model.labels_)

result = df_result.groupby(<span class="hljs-string">"ClusterLabel"</span>).agg({<span class="hljs-string">"Age"</span>:<span class="hljs-string">"mean"</span>, <span class="hljs-string">"Income"</span>:<span class="hljs-string">"median"</span>, <span class="hljs-string">"TotalAmountSpent"</span>:<span class="hljs-string">"median"</span>}).round()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/image-249.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see from the above summary that:</p>
<ul>
<li><p>Cluster 0 depicts young customers that earn a lot and also spend a lot.</p>
</li>
<li><p>Cluster 1 translates to older customers that earn a lot and also spend a lot.</p>
</li>
<li><p>Cluster 2 depicts young customers that earn less and also spend less.</p>
</li>
</ul>
<p>We can also visualize our result by typing the following code:</p>
<pre><code class="lang-python">fig = px.scatter_3d(data_frame=data, x=<span class="hljs-string">"Income"</span>, 
                    y=<span class="hljs-string">"TotalAmountSpent"</span>, z=<span class="hljs-string">"Age"</span>, color=<span class="hljs-string">"ClusterLabel"</span>, height=<span class="hljs-number">550</span>,
                   title = <span class="hljs-string">"Visualizing Cluster Result Using 3 Features"</span>)
fig.show()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/10/newplot--17-.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Cluster results using three features</em></p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In this tutorial, you learnt how to build a customer segmentation model. There are a lot of features we didn't touch on in this article. But I suggest that you experiment with it and create customer segmentation models using different features.</p>
<p>I hope you learn more from doing that. Thank you for reading the article. Happy Coding!</p>
<p>The link to the full code can be found below. And <a target="_blank" href="https://www.freecodecamp.org/news/how-to-build-and-train-k-nearest-neighbors-ml-models-in-python/">here's an article on K-Means Clustering if you want to learn more</a>.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/ibrahim-ogunbiyi/Customer-Segmentation">https://github.com/ibrahim-ogunbiyi/Customer-Segmentation</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How the Python Lambda Function Works – Explained with Examples ]]>
                </title>
                <description>
                    <![CDATA[ One of the beautiful things about Python is that it is generally one of the most intuitive programming languages out there. Still, certain concepts can be difficult to grasp and comprehend. The lambda function is one of them. I've been there. When I ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/python-lambda-function-explained/</link>
                <guid isPermaLink="false">66d45f3d4a7504b7409c340d</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Tue, 25 Oct 2022 20:37:45 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/10/pexels-pixabay-45246--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>One of the beautiful things about Python is that it is generally one of the most intuitive programming languages out there. Still, certain concepts can be difficult to grasp and comprehend. The lambda function is one of them.</p>
<p>I've been there. When I first started learning Python, I skipped the lambda function because it wasn't clear to me. But with time, I began to understand it. So don't worry – if you're struggling with it, too, I've got you covered.</p>
<p>This tutorial will teach you what a lambda function is, when to use it, and we'll go over some common use cases where the lambda function is commonly applied. Without further ado let's get started.</p>
<h2 id="heading-what-is-a-lambda-function">What is a Lambda Function?</h2>
<p>Lambda functions are similar to user-defined functions but without a name. They're commonly referred to as anonymous functions.</p>
<p>Lambda functions are efficient whenever you want to create a function that will only contain simple expressions – that is, expressions that are usually a single line of a statement. They're also useful when you want to use the function once.</p>
<h2 id="heading-how-to-define-a-lambda-function">How to Define a Lambda Function</h2>
<p>You can define a lambda function like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">lambda</span> argument(s) : expression
</code></pre>
<ol>
<li><p><code>lambda</code> is a keyword in Python for defining the anonymous function.</p>
</li>
<li><p><code>argument(s)</code> is a placeholder, that is a variable that will be used to hold the value you want to pass into the function expression. A lambda function can have multiple variables depending on what you want to achieve.</p>
</li>
<li><p><code>expression</code> is the code you want to execute in the lambda function.</p>
</li>
</ol>
<p>Notice that the anonymous function does not have a return keyword. This is because the anonymous function will automatically return the result of the expression in the function once it is executed.</p>
<p>Let's look at an example of a lambda function to see how it works. We'll compare it to a regular user-defined function.</p>
<p>Assume I want to write a function that returns twice the number I pass it. We can define a user-defined function as follows:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">f</span>(<span class="hljs-params">x</span>):</span>
  <span class="hljs-keyword">return</span> x * <span class="hljs-number">2</span>

f(<span class="hljs-number">3</span>)
&gt;&gt; <span class="hljs-number">6</span>
</code></pre>
<p>Now for a lambda function. We'll create it like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">lambda</span> x: x * <span class="hljs-number">3</span>
</code></pre>
<p>As I explained above, the lambda function does not have a return keyword. As a result, it will return the result of the expression on its own. The x in it also serves as a placeholder for the value to be passed into the expression. You can change it to whatever you want.</p>
<p>Now if you want to call a lambda function, you will use an approach known as immediately invoking the function. That looks like this:</p>
<pre><code class="lang-python">(<span class="hljs-keyword">lambda</span> x : x * <span class="hljs-number">2</span>)(<span class="hljs-number">3</span>)

&gt;&gt; <span class="hljs-number">6</span>
</code></pre>
<p>The reason for this is that since the lambda function does not have a name you can invoke (it's anonymous), you need to enclose the entire statement when you want to call it.</p>
<h2 id="heading-when-should-you-use-a-lambda-function">When Should You Use a Lambda Function?</h2>
<p>You should use the lambda function to create simple expressions. For example, expressions that do not include complex structures such as if-else, for-loops, and so on.</p>
<p>So, for example, if you want to create a function with a for-loop, you should use a user-defined function.</p>
<h2 id="heading-common-use-cases-for-lambda-functions">Common Use Cases for Lambda Functions</h2>
<h3 id="heading-how-to-use-a-lambda-function-with-iterables">How to Use a Lambda Function with Iterables</h3>
<p>An iterable is essentially anything that consists of a series of values, such as characters, numbers, and so on.</p>
<p>In Python, iterables include strings, lists, dictionaries, ranges, tuples, and so on. When working with iterables, you can use lambda functions in conjunction with two common functions: <code>filter()</code> and <code>map()</code>.</p>
<h4 id="heading-filter"><code>Filter()</code></h4>
<p>When you want to focus on specific values in an iterable, you can use the filter function. The following is the syntax of a filter function:</p>
<pre><code class="lang-python">filter(function, iterable)
</code></pre>
<p>As you can see, a filter function requires another function that contains the expression or operations that will be performed on the iterable.</p>
<p>For example, say I have a list such as <code>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]</code>. Now let's say that I’m only interested in those values in that list that have a remainder of 0 when divided by 2. I can make use of <code>filter()</code> and a lambda function.</p>
<p>Firstly I will use the lambda function to create the expression I want to derive like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">lambda</span> x: x % <span class="hljs-number">2</span> == <span class="hljs-number">0</span>
</code></pre>
<p>Then I will insert it into the filter function like this:</p>
<pre><code class="lang-python">list1 = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>, <span class="hljs-number">10</span>]
filter(<span class="hljs-keyword">lambda</span> x: x % <span class="hljs-number">2</span> == <span class="hljs-number">0</span>, list1)

&gt;&gt; &lt;filter at <span class="hljs-number">0x1e3f212ad60</span>&gt; <span class="hljs-comment"># The result is always filter object so I will need to convert it to list using list()</span>

list(filter(<span class="hljs-keyword">lambda</span> x: x % <span class="hljs-number">2</span> == <span class="hljs-number">0</span>, list1))
&gt;&gt; [<span class="hljs-number">2</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>, <span class="hljs-number">10</span>]
</code></pre>
<h4 id="heading-map"><code>Map()</code></h4>
<p>You use the <code>map()</code> function whenever you want to modify every value in an iterable.</p>
<pre><code class="lang-python">map(function, iterable)
</code></pre>
<p>For example, let's say I want to raise all values in the below list to the power of 2. I can easily do that using the lambda and map functions like this:</p>
<pre><code class="lang-python">list1 = [<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>]

list(map(<span class="hljs-keyword">lambda</span> x: pow(x, <span class="hljs-number">2</span>), list1))
&gt;&gt; [<span class="hljs-number">4</span>, <span class="hljs-number">9</span>, <span class="hljs-number">16</span>, <span class="hljs-number">25</span>]
</code></pre>
<h3 id="heading-pandas-series">Pandas Series</h3>
<p>Another place you'll use lambda functions is in data science when creating a data frame from Pandas. A series is a data frame column. You can manipulate all of the values in a series by using the lambda function.</p>
<p>For example, if I have a data frame with the following columns and want to convert the values in the name column to lower case, I can do so using the Pandas apply function and a Python lambda function like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

df = pd.DataFrame(
    {<span class="hljs-string">"name"</span>: [<span class="hljs-string">"IBRAHIM"</span>, <span class="hljs-string">"SEGUN"</span>, <span class="hljs-string">"YUSUF"</span>, <span class="hljs-string">"DARE"</span>, <span class="hljs-string">"BOLA"</span>, <span class="hljs-string">"SOKUNBI"</span>],
     <span class="hljs-string">"score"</span>: [<span class="hljs-number">50</span>, <span class="hljs-number">32</span>, <span class="hljs-number">45</span>, <span class="hljs-number">45</span>, <span class="hljs-number">23</span>, <span class="hljs-number">45</span>]
    }
)
</code></pre>
<p><img src="https://user-images.githubusercontent.com/73393430/188447505-9ae1baa2-9225-4834-a630-c32b9d1a29f3.png" alt="image" width="1057" height="257" loading="lazy"></p>
<pre><code class="lang-python">df[<span class="hljs-string">"lower_name"</span>] = df[<span class="hljs-string">"name"</span>].apply(<span class="hljs-keyword">lambda</span> x: x.lower())
</code></pre>
<p>The apply function will apply each element of the series to the lambda function. The lambda function will then return a value for each element based on the expression you passed to it. In our case, the expression was to lowercase each element.</p>
<p><img src="https://user-images.githubusercontent.com/73393430/188447749-a483bbad-a91f-40df-b008-5695efe05073.png" alt="image" width="951" height="276" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In this tutorial you learnt the basics of the lambda function and how you can commonly apply it. Thank you for taking your time to read this.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Top Python Concepts to Know Before Learning Data Science ]]>
                </title>
                <description>
                    <![CDATA[ If you're interested in learning data science, you've likely heard the buzzword "Python,". It's a popular programming language often used in data science. But Python is a general-purpose programming language. This means that it's not limited to data ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/top-python-concepts-for-data-science/</link>
                <guid isPermaLink="false">66d45f4038f2dc3808b790a9</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Wed, 24 Aug 2022 17:53:30 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/08/python-data-science-concepts.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you're interested in learning data science, you've likely heard the buzzword <strong>"Python,"</strong>. It's a popular programming language often used in data science.</p>
<p>But Python is a general-purpose programming language. This means that it's not limited to data science alone. You can use it to develop web and mobile applications too, among other things.</p>
<p>So, when learning Python for data science, one of the most common mistakes beginners make is learning it "incorrectly" — that is, not learning Python in preparation for Data Science. This can result in a loss of time and effort.</p>
<p>In this article, we'll go through the top Python concepts you should know before delving into data science. Now relax and follow along because this will be an exciting journey.</p>
<p>To have a quick overview of what the journey is going to be all about, here's what we'll cover:</p>
<ul>
<li><p><a class="post-section-overview" href="#heading-integers-and-floating-point-numbers-in-python">Integers and Floating-Point Numbers in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-strings-in-python">Strings in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-boolean-values-in-python">Boolean values in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-arithmetic-operators-in-python">Arithmetic operators in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-comparison-operator-in-python">Comparison Operator in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-logical-operators-in-python">Logical Operators in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-membership-operator-in-python">Membership Operator in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-f-string-formatting-in-python">F-string formatting in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-lists-in-python">Lists in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-tuples-in-python">Tuples in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-dictionaries-in-python">Dictionaries in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-zip-function-in-python"><code>Zip()</code> Function in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-enumerate-function-in-python"><code>Enumerate()</code> Function in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-counter-function-in-python"><code>Counter()</code> Function in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-if-else-statements-in-python">If-else Statements in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-range-function-in-python"><code>Range()</code> Function in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-list-comprehension-in-python">List Comprehension in Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-user-defined-functions-in-python">User Defined Functions in Python</a></p>
</li>
</ul>
<h2 id="heading-top-python-concepts-to-know-for-data-science">Top Python Concepts to Know for Data Science</h2>
<h3 id="heading-why-these-concepts-are-important-to-know">Why these concepts are important to Know</h3>
<p>To put it bluntly, these concepts are what you will need to kickstart your data science journey when you want to use Python as your language for data science. You will be working with them in your day-to-day work as a data scientist, so it's good to have a firm grip on how they work.</p>
<h3 id="heading-integers-and-floating-point-numbers-in-python">Integers and Floating-Point Numbers in Python</h3>
<p>Numbers are one of the most fundamental concepts in data science. And Python contains representations (data types) for the various types of numbers that can exist. These are mostly classified into:</p>
<ul>
<li><p>Integers: these are whole numbers that are either positive or negative in Python. Examples include 200, -100, 67, and so forth.</p>
</li>
<li><p>Floating-point numbers: these are decimal values that are either positive or negative. Examples include 200.65, -14.34, 53.0002, and so on.</p>
</li>
</ul>
<h3 id="heading-strings-in-python">Strings in Python</h3>
<p>In Python, strings contain alphanumeric values that are usually enclosed in single or double quotation marks.</p>
<p>An example includes <code>"FreeCodeCamp has a lot of rich resources"</code>.</p>
<p>Python has a lot of methods that you can use to manipulate strings. For example if you wish to convert a string from uppercase to lowercase, you can use the <code>.lower()</code> method in Python as shown below.</p>
<pre><code class="lang-python">string = <span class="hljs-string">"FREECODECAMP IS COOL"</span>
print(string.lower())
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-string">'freecodecamp is cool'</span>
</code></pre>
<p>You often work with strings in Data Science to create or manipulate any textual data in your dataset.</p>
<p>To learn more about strings and their methods, <a target="_blank" href="https://www.freecodecamp.org/news/python-string-manipulation-handbook/">check out this helpful handbook</a>.</p>
<h3 id="heading-boolean-values-in-python">Boolean values in Python</h3>
<p>Boolean values are also known as binary values. They are values represented by two numbers. <code>True and False</code>, or <code>0 and 1</code>.</p>
<h3 id="heading-arithmetic-operators-in-python">Arithmetic operators in Python</h3>
<p>You use arithmetic operators to perform mathematical operations on two numerical operands or values. They include the following:</p>
<ul>
<li><p>The plus symbol <code>+</code> represents addition.</p>
</li>
<li><p>The dash symbol <code>-</code> represents subtraction</p>
</li>
<li><p>The asterisk symbol <code>*</code> represents multiplication.</p>
</li>
<li><p>The slash symbol <code>/</code> represents division.</p>
</li>
<li><p>The percentage symbol <code>%</code> is <a target="_blank" href="https://www.freecodecamp.org/news/the-python-modulo-operator-what-does-the-symbol-mean-in-python-solved/">used to express the modulus</a></p>
</li>
<li><p>The double asterisk symbol <code>**</code> represents an exponent.</p>
</li>
<li><p>The double slash symbol <code>//</code> represents floor division.</p>
</li>
</ul>
<p>The first four operators are quite straightforward because we deal with them on a daily basis. However, the following require a bit more explanation:</p>
<h4 id="heading-what-is-the-modulus-operator">What is the modulus operator?</h4>
<p>The modulus operator (<code>%</code>) returns the remainder when performed on two separate numbers. For example, 8 % 3 will return 2 since 3 can only go in 8 twice, leaving a remainder of 2.</p>
<h4 id="heading-what-is-the-exponential-operator">What is the exponential operator?</h4>
<p>You use the exponential operator <code>**</code> to raise a number to the power of another. For example, <code>2**3</code> equals 8, because 2 is raised (or multiplied by itself) three times: <code>2*2*2 = 8</code></p>
<h4 id="heading-what-is-the-floor-division-operator">What is the floor division operator?</h4>
<p>You use the floor division <code>/</code> operator to divide. But unlike the other division operators which produce a decimal number, floor division returns the whole number portion of the division.</p>
<p>For example, <code>5//2</code> will result in 2 (because 2 goes into 5 two times evenly). The floor division does not approximate as well.</p>
<h4 id="heading-how-to-perform-arithmetic-operations-on-a-string">How to perform arithmetic operations on a string</h4>
<p>Also, you can also perform arithmetic operations on a string. Addition and multiplication are two arithmetic operations that you can perform on a string.</p>
<ul>
<li>Addition operator <code>+</code>: you use the addition operator to concatenate two strings operands together (that is, you join two strings together). For example:</li>
</ul>
<pre><code class="lang-python"><span class="hljs-string">"Folks"</span> + <span class="hljs-string">"connect"</span> 
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-string">"Folksconnect"</span>
</code></pre>
<ul>
<li>Multiplication operator <code>*</code>: you use the multiplication operator to repeat a string (but note that one of the operands must be a number). For example:</li>
</ul>
<pre><code class="lang-python"><span class="hljs-number">2</span> * <span class="hljs-string">"Folks"</span> 
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-string">"FolksFolks"</span>
</code></pre>
<h3 id="heading-comparison-operator-in-python">Comparison Operator in Python</h3>
<p>You use comparison operators to compare two operands. When the comparison operators are performed on two operands they return a boolean value of either true or false. The comparison operators include:</p>
<ul>
<li><p>Greater than sign <code>&gt;</code></p>
</li>
<li><p>Less than sign <code>&lt;</code></p>
</li>
<li><p>Equality sign <code>==</code></p>
</li>
<li><p>Not equal sign <code>!=</code></p>
</li>
<li><p>Greater than or equals to <code>&gt;=</code></p>
</li>
<li><p>Less than or equals to <code>&lt;=</code></p>
</li>
</ul>
<p>Here are some examples: <code>2==2</code> will result in <code>True</code>. Also <code>5&gt;= 5</code> will result in <code>True</code> since 5 is also equal to 5.</p>
<h3 id="heading-logical-operators-in-python">Logical Operators in Python</h3>
<p>You use logical operators to combine conditional statements. They include <code>and</code> <code>or</code> and <code>not</code>.</p>
<p>For example <code>4&lt;5</code> and <code>3&gt;2</code> will return <code>True</code>, because <code>4 &lt;5</code> is a condition which is True and <code>3 &gt; 2</code> is also another condition which is True. So <code>True</code> and <code>True</code> according to the logic gate will result to true.</p>
<p>Before we move on, I want to define a term that I will be using mostly in the rest of the article – iterables. An iterable is basically something that consists of a sequence of values, for example characters, numbers and so on. Iterables include strings, lists, dictionaries, ranges, tuples, and so on in Python.</p>
<h3 id="heading-membership-operator-in-python">Membership Operator in Python</h3>
<p>You use the membership operation to determine whether a value belongs in a sequence/iterable. A sequence can be a string of characters, a list of numbers, or anything else.</p>
<p>Membership operator includes the <code>in</code> operator and the <code>not in</code> operator.</p>
<p>For example let's say I want to check if the character <code>b</code> is in the string <code>"What a time to be alive"</code> – I can do that by typing the following statement and the result from it will be a Boolean value.</p>
<pre><code class="lang-python"><span class="hljs-string">"b"</span> <span class="hljs-keyword">in</span> <span class="hljs-string">"what a time to be alive"</span>


<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-literal">True</span>
</code></pre>
<p>To learn more about the operators in Python check out <a target="_blank" href="https://www.freecodecamp.org/news/basic-operators-in-python-with-examples/">these</a> <a target="_blank" href="https://www.freecodecamp.org/news/operators-in-python-how-to-use-logical-operators-in-python/">articles</a>.</p>
<h3 id="heading-f-string-formatting-in-python">F-string formatting in Python</h3>
<p>In some cases, you may want to insert a variable value within a string. Assume you don't know the value ahead of time but want it to be within a string. String formatting can help you achieve this.</p>
<p>There are several ways to format strings in Python, but we will focus on one of them: the f-literal format.</p>
<p>Let's look at an example: I have two variables, name and age, and I want to include them in a string and then print out the entire string.</p>
<pre><code class="lang-python">age = <span class="hljs-number">10</span>
name = <span class="hljs-string">"Eagle"</span>

string = <span class="hljs-string">f"There are some birds of prey such as <span class="hljs-subst">{name}</span> that are older than <span class="hljs-subst">{age}</span> years."</span>

print(string)

<span class="hljs-meta">&gt;&gt;&gt; </span>There are some birds of prey such <span class="hljs-keyword">as</span> Eagle that are older than <span class="hljs-number">10</span> years.
</code></pre>
<p>So the first thing to do is you must had an f to the front of the string you wish to format using the f-literal. Also, the variable you wish to format must be inside curly braces.</p>
<p>To learn more about string formatting using f-literals, check out this article from <a target="_blank" href="https://www.freecodecamp.org/news/python-f-strings-tutorial-how-to-use-f-strings-for-string-formatting/">Bala Priya that explains it</a>. Also, you can learn more about other types of string formatting <a target="_blank" href="https://www.geeksforgeeks.org/string-formatting-in-python/">here</a>.</p>
<h3 id="heading-lists-in-python">Lists in Python</h3>
<p>You use lists to store or organize data in a sequential order. This data can be a string, numbers, or iterables like a list.</p>
<p>A list is also mutable, which means that it can expand and change after you declare it (you add new elements to it).</p>
<p>In Python, you can create a list with square brackets and then save it to a variable. For instance:</p>
<pre><code class="lang-python">lst_of_num = [<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">2</span>].
</code></pre>
<p>As we can see, the preceding is a list of numbers. The beauty of a list is that it allows you to have duplicate values in the list. As previously stated, you can create a list of different data types, such as a list of numbers, strings, and lists.</p>
<pre><code class="lang-python">diverse_lst = [<span class="hljs-number">4</span>, <span class="hljs-string">"Folks"</span>, [<span class="hljs-string">"2"</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>]]
</code></pre>
<p>To get to a list item or element, you use indexing. In Python, the first element of any iterable is always at the zero-th index position. In other words, a list's position begins with 0. As an example, the <code>lst_of_number</code> variable elements in the following index or position.</p>
<pre><code class="lang-python">lst_of_num = [<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">2</span>]. 

<span class="hljs-number">2</span> -- index <span class="hljs-keyword">or</span> position <span class="hljs-number">0</span>
<span class="hljs-number">3</span> -- index <span class="hljs-keyword">or</span> position <span class="hljs-number">1</span>
<span class="hljs-number">4</span> -- index <span class="hljs-keyword">or</span> position <span class="hljs-number">2</span>
<span class="hljs-number">2</span> -- index <span class="hljs-keyword">or</span> position <span class="hljs-number">3</span>
</code></pre>
<p>You can access a list element using the following approach:</p>
<p><code>name_of_list[index or position]</code></p>
<p>For our case, if you want to access the element in the 3rd position you can do that by typing:</p>
<pre><code class="lang-python">print(lst_of_num[<span class="hljs-number">3</span>])
&gt;&gt; <span class="hljs-number">2</span>
</code></pre>
<p>Lists are your friend that you'll use a lot in data science. You will need them when you wish to have a sequence of values in a container.</p>
<p>To learn how to add, remove, or update a list, check out this helpful tutorial by Ihechikara Vincent Abba on <a target="_blank" href="https://www.freecodecamp.org/news/how-to-make-a-list-in-python-declare-lists-in-python-example/">how to make a list in Python</a>.</p>
<h3 id="heading-tuples-in-python">Tuples in Python</h3>
<p>A tuple is another data collection type in Python. You also use it to store and organize data in the form of a list.</p>
<p>The only difference is that it is immutable, which means it cannot expand (you can't add new elements to it) like a list.</p>
<p>In Python, you can make a tuple by using parentheses.</p>
<pre><code class="lang-python">my_tuple = (<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">5</span>) <span class="hljs-comment"># This is a tuple of number.</span>

Also a tuple can contain different data types:

diverse_tuple = (<span class="hljs-number">2</span>, <span class="hljs-string">"Golang"</span>, [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">2</span>], (<span class="hljs-string">"day"</span>, <span class="hljs-string">"night"</span>))
</code></pre>
<p>To access elements in a tuple, you do the same thing as with a list:</p>
<pre><code class="lang-python">my_tuple[<span class="hljs-number">2</span>]
<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-number">5</span>
</code></pre>
<p>When you need a Python collection that you don't need to add a new elements to once it is created, tuples come in handy.</p>
<p>If you want to know more about tuples check out this <a target="_blank" href="https://www.w3schools.com/python/python_tuples.asp">article</a>. Also if you want to know more about the differences between lists and tuples, check out <a target="_blank" href="https://www.freecodecamp.org/news/python-tuple-vs-list-what-is-the-difference/">this helpful article by Dionysia Lemonaki that explains it</a>.</p>
<h3 id="heading-dictionaries-in-python">Dictionaries in Python</h3>
<p>A dictionary is a Python collection that stores data as key-value pairs. You can create a dictionary using curly braces. Also dictionaries are mutable. For example:</p>
<pre><code class="lang-python">my_dict = {<span class="hljs-string">"names"</span>:[<span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Dave"</span>, <span class="hljs-string">"Jack"</span>], <span class="hljs-string">"scores"</span>:[<span class="hljs-number">45</span>, <span class="hljs-number">56</span>, <span class="hljs-number">70</span>]}
</code></pre>
<p>The value before the column is referred to as the key and can only contain immutable datatype such as strings, integers, or tuples. The value after the column is just called a value and can contain mutable and immutable datatypes like lists, dictionaries, and so on.</p>
<p>You can access a dictionary's values through keys. For example, say I want to get the name of a student from the above dictionary. I can just do that easily through the use of keys, like this:</p>
<pre><code class="lang-python">print(my_dict[<span class="hljs-string">"names"</span>])
<span class="hljs-meta">&gt;&gt;&gt; </span>[<span class="hljs-string">"Grace"</span>, <span class="hljs-string">"Dave"</span>, <span class="hljs-string">"Jack"</span>]
</code></pre>
<p>You will often need dictionaries for key-value pairs-related tasks or when you wish to transform something into a series/dataframe in Pandas (a library you will work with mostly for data manipulation).</p>
<p>To learn more about dictionaries and how to add, update, or delete from a dictionary, check out <a target="_blank" href="https://www.freecodecamp.org/news/create-a-dictionary-in-python-python-dict-methods/">this helpful tutorial by Dionysia Lemonaki that explains them</a>. Here's also a <a target="_blank" href="https://www.freecodecamp.org/news/python-dictionary-methods-dictionaries-in-python/">helpful article from Kolade Chris about dictionaries</a>.</p>
<h3 id="heading-zip-function-in-python"><code>Zip()</code> Function in Python</h3>
<p>You use the zip function to zip (combine) two iterables such as a list, tuple, dictionary, and so on. And each element of each iterable is paired together.</p>
<p>To put it another way, the first element of the first iterable is paired with the first element of the second iterable. You typically use the zip function to merge two lists or tuples into a dictionary. Let's see how that goes.</p>
<p>Let's say I have a list that contains the name of a student and another list that contains the score of each student. Now If I want to map the name of each student to their respective score, I can do that using the zip function.</p>
<pre><code class="lang-python">name = [<span class="hljs-string">"Dave"</span>, <span class="hljs-string">"Jerry"</span>, <span class="hljs-string">"Sasha"</span>]
score = [<span class="hljs-number">43</span>, <span class="hljs-number">56</span>, <span class="hljs-number">78</span>]
result = zip(name, score)
</code></pre>
<p>Now we are finished – but if you print the result from the above code, it's always an Iterator object. The last thing we will need to do is to make use of a dict function – which you use to convert an iterable into a dictionary.</p>
<pre><code class="lang-python">print(dict(result)
<span class="hljs-meta">&gt;&gt;&gt; </span>{<span class="hljs-string">"Dave"</span>:<span class="hljs-number">43</span>, <span class="hljs-string">"Jerry"</span>:<span class="hljs-number">56</span>, <span class="hljs-string">"Sasha"</span>:<span class="hljs-number">78</span>}
</code></pre>
<p>You will often use the <code>zip()</code> function to join list into a dictionary in Data Science.</p>
<p>To learn more about <code>zip()</code> function check out this helpful tutorial by Ihechikara Vincent Abba <a target="_blank" href="https://www.freecodecamp.org/news/python-zip-zip-function-in-python/">here.</a></p>
<h3 id="heading-enumerate-function-in-python"><code>Enumerate()</code> Function in Python</h3>
<p>In Python, you use the enumerate function to assign or pair index or position values to the values in an iterable (remember, index values start at 0).</p>
<p>Once those index values are paired to the iterable values, you can decide to turn it into a dictionary where the index values will now serve as a key for the values in the iterable.</p>
<p>Let's look at an example to see how it works.</p>
<pre><code class="lang-python">lst = [<span class="hljs-string">"Free"</span>, <span class="hljs-string">"Code"</span>, <span class="hljs-string">"Camp"</span>]
result = dict(enumerate(s))
print(result)
<span class="hljs-meta">&gt;&gt;&gt; </span>{<span class="hljs-number">0</span>: <span class="hljs-string">'Free'</span>, <span class="hljs-number">1</span>: <span class="hljs-string">'Code'</span>, <span class="hljs-number">2</span>: <span class="hljs-string">'Camp'</span>, <span class="hljs-number">3</span>: <span class="hljs-string">'Code'</span>}
</code></pre>
<p>You will often use the <code>Enumerate()</code> function to assign an index to a list and then turn it to a dictionary.</p>
<h3 id="heading-counter-function-in-python"><code>Counter()</code> Function in Python</h3>
<p>The counter function, as the name implies, lets you count the number of times the values in an iterable occurs.</p>
<p>The counter function produces a counter object in the form of a dictionary. To use the counter() we will need to import it from the collection module. Let's see how that works.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter
lst = [<span class="hljs-string">"Free"</span>, <span class="hljs-string">"Code"</span>, <span class="hljs-string">"Camp"</span>, <span class="hljs-string">"Code"</span>, <span class="hljs-string">"Free"</span>]
print(Counter(lst))
<span class="hljs-meta">&gt;&gt;&gt; </span>Counter({<span class="hljs-string">'Free'</span>: <span class="hljs-number">2</span>, <span class="hljs-string">'Code'</span>: <span class="hljs-number">2</span>, <span class="hljs-string">'Camp'</span>: <span class="hljs-number">1</span>})
</code></pre>
<p>You will often use the <code>Counter</code> function when performing natural language processing in data science.</p>
<h3 id="heading-if-else-statements-in-python">If-else Statements in Python</h3>
<p>You use if-else statements when you want to execute a task based on a certain condition. In real life, for example, if you pass your exam, you will be promoted. But if you fail, you will have to take it again in order to be promoted.</p>
<p>This type of expression, it turns out, can also be executed in Python using the if-else statement. This is how you write an if else statement:</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> condition:
    execute statement
<span class="hljs-keyword">else</span>:
    execute statement
</code></pre>
<p>In our exam example, the condition for the above expression is whether you pass or not, and the executable statement is whether you pass or not.</p>
<p>Now what the above expression does is if the condition is evaluated to true, the executable statement inside the if block gets executed. If the condition is not true, the executable statement inside the else block gets executed.</p>
<p>Let's go over an example so we can grok what we just talked about.</p>
<p>Assume I have a list of numbers like <code>[4, 5, 6, 8, 10]</code>, and I have a variable <code>i</code> with the value <code>6</code>. Now I need to write an if-else statement that will print whether or not the <code>i</code> is in the list.</p>
<p>As you might expect, our condition will be whether or not <code>i</code> is in the list, and our executable statement will be to print a message to us. You can do this using the code provided above like this:</p>
<pre><code class="lang-python">lst = [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>, <span class="hljs-number">10</span>]
i = <span class="hljs-number">6</span>

<span class="hljs-keyword">if</span> i <span class="hljs-keyword">in</span> lst:
    print(<span class="hljs-string">"Yes 6 is present in the list"</span>)
<span class="hljs-keyword">else</span>:
    print(<span class="hljs-string">"No 6 is not present in the list"</span>)

<span class="hljs-meta">&gt;&gt;&gt; </span><span class="hljs-string">"Yes 6 is present in the list"</span>
</code></pre>
<p>The <code>i in lst</code> is the conditional statement that evaluates to <code>True</code> or <code>False</code>. If <code>i</code> was not present in the list then the executable statement in the else block gets printed.</p>
<p>You will often need if-else statements to perform conditional operations in Data science.</p>
<p>To learn more about if-else statements, check out this article written by Dionysia Lemonaki that <a target="_blank" href="https://www.freecodecamp.org/news/python-else-if-statement-example/">explains Python if-else statements simply</a>.</p>
<h3 id="heading-range-function-in-python"><code>Range()</code> Function in Python</h3>
<p>The range function, as the name implies, provides a sequence of values within a specific range when needed. It basically works like this: (start, end-1). That is, it will not include the last value.</p>
<p>So, let's say I want a list of numbers ranging from 2 to 10. So I can easily do that with the range function and then convert the result to a list instead of creating a list and then typing out those items. For example:</p>
<pre><code class="lang-python"><span class="hljs-comment"># rememeber it's end-1 so it will display values from 2 to 10</span>
no_range = range(<span class="hljs-number">2</span>, <span class="hljs-number">11</span>)
print(list(no_range))
<span class="hljs-meta">&gt;&gt;&gt; </span>[<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>, <span class="hljs-number">10</span>]
</code></pre>
<p>You will often need the <code>range()</code> function when you need to get a list of numbers with a long range in data science.</p>
<p>To learn more about range function check out this helpful tutorial from Bala Priya <a target="_blank" href="https://www.freecodecamp.org/news/python-range-function-explained-with-code-examples/">here</a>.</p>
<h3 id="heading-for-loops-in-python">For-Loops in Python</h3>
<p>The for loop statement allows you to repeat a task a predefined number of times. The syntax for a for-loop basically looks like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> iterable:
    execute statement


where i <span class="hljs-keyword">is</span> a variable (you can change its name to anything you prefer) which stands <span class="hljs-keyword">as</span> a place holder to access all the items <span class="hljs-keyword">in</span> the iterable (<span class="hljs-keyword">for</span> example dictionary, list, string, etc.)
</code></pre>
<p>Assume I have a list containing the names of thousands of students and I want to print those names. Now instead of doing it the manual way (where I access the names in the list through indexing like <code>print(names[10])</code> up to the <code>1000th</code> element), I can easily employ a for-loop since I want to perform the same task repeatedly.</p>
<p>For example:</p>
<pre><code class="lang-python">lst  = [<span class="hljs-string">"Free"</span>, <span class="hljs-string">"Code"</span>, <span class="hljs-string">"Camp"</span>, <span class="hljs-string">"is"</span>, <span class="hljs-string">"the"</span>, <span class="hljs-string">"best"</span>, <span class="hljs-string">"place"</span>, <span class="hljs-string">"to"</span>, <span class="hljs-string">"learn"</span>]
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> lst:
    print(i)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/08/image-123.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You will often need for loops in Data Science to iterate through an iterable and perform some certain task.</p>
<p>We can see that the <code>i</code> variable serves as a placeholder to access each item in the list. To learn more about for-loops and all their applications check out this helpful tutorial by Kolade Chris <a target="_blank" href="https://www.freecodecamp.org/news/python-for-loop-example-how-to-write-loops-in-python/">here</a>.</p>
<h3 id="heading-list-comprehension-in-python">List Comprehension in Python</h3>
<p>A list comprehension is a simple method of generating a new list from another iterable using specific operations.</p>
<p>Assume I have a tuple with some values and want to make a new list from it that only contains values from the tuple that can be divided by 3.</p>
<p>One method is to create an empty list and then use a for loop to iterate through all of the elements in the tuple. You also create an if-else statement to match the condition you want and then append the values that match that condition to the empty list you initialized. Here's what that looks like in code:</p>
<pre><code class="lang-python">my_tuple = (<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">10</span>, <span class="hljs-number">12</span>)
my_new_lst = []
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> my_tuple:
    <span class="hljs-keyword">if</span> i % <span class="hljs-number">3</span> == <span class="hljs-number">0</span>:
        my_new_lst.append(i)
print(my_new_lst)
<span class="hljs-meta">&gt;&gt;&gt; </span>[<span class="hljs-number">3</span>, <span class="hljs-number">6</span>, <span class="hljs-number">12</span>]
</code></pre>
<p>I can also do that using list comprehension in just one line of code. Let's see how that's done:</p>
<pre><code class="lang-python">my_tuple = (<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">10</span>, <span class="hljs-number">12</span>)

my_new_lst = [i <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> my_tuple <span class="hljs-keyword">if</span> i % <span class="hljs-number">3</span> == <span class="hljs-number">0</span>]
print(my_new_lst)

&gt;&gt;&gt;[<span class="hljs-number">3</span>, <span class="hljs-number">6</span>, <span class="hljs-number">12</span>]
</code></pre>
<p>So far, we've seen that the list comprehension resembles the above line of code.</p>
<p>To begin, we use the for-loop to iterate through the tuple, with <code>i</code> acting as a placeholder for each item in the tuple. Now <code>i</code> will be evaluated to see if the condition is met (that is for each element <code>i</code> represents in the tuple). So if <code>i</code> condition evaluates to true, <code>i</code> will be added to the newly created list.</p>
<p>You will often need list comprehension in Data Science when you need a simple way to create a new list from an existing list.</p>
<p>To learn more about list comprehension check out this helpful tutorial by Dionysia Lemonaki <a target="_blank" href="https://www.freecodecamp.org/news/list-comprehension-in-python-with-code-examples/">here</a>.</p>
<h3 id="heading-user-defined-functions-in-python">User Defined Functions in Python</h3>
<p>User defined means functions you create yourself from scratch.</p>
<p>You use functions to modularize or group a large amount of code into smaller pieces. Functions are useful when you need to execute a set of code repeatedly. Instead of typing out that code again and again whenever you need it, you can easily modularize it into a function and then call the function (which is just a one-line statement) whenever you need it.</p>
<p>In Python, you create a function in the following manner.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">function_name</span>(<span class="hljs-params">parameter1, parameter2, ....</span>):</span>

    //execute statement

    <span class="hljs-keyword">return</span> value
</code></pre>
<ul>
<li><p><code>Parameter</code> in the function serves as a placeholder to hold any value you want to pass inside the function executable statement. You can have more than one parameter depending on what you wish to achieve.</p>
</li>
<li><p><code>Execute statement</code> means the code that you wish to execute any time you call the function.</p>
</li>
<li><p><code>return</code> is a keyword. It's not compulsory for a function to return a value. You might decide not to return anything.</p>
</li>
</ul>
<p>Let's look at an example of how to write a function. For example, suppose you want to run some Python code that asks for a person's name and age. You also want to create a conditional statement that prints a message based on the person's age.</p>
<p>Now you wish to execute this code over and over again because you want to try it out on different people. You can easily write a function that will group this code into a piece, which you can then call whenever you need it.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">print_func</span>(<span class="hljs-params">person_name, person_age</span>):</span>
    <span class="hljs-keyword">if</span> person_age &gt; <span class="hljs-number">10</span>:
        print(<span class="hljs-string">f"Hi <span class="hljs-subst">{person_name}</span> you are more than your denary age and your name contains <span class="hljs-subst">{len(person_name)}</span> characters."</span>)
    <span class="hljs-keyword">else</span>:
         print(<span class="hljs-string">f"Hi <span class="hljs-subst">{person_name}</span> you are still in your denary age and your name contains <span class="hljs-subst">{len(person_name)}</span> characters."</span>)
</code></pre>
<p>Now let's go over what we have above. We created a function named <code>print_func</code> which requires two parameters that we want to pass into it: they are <code>person_name</code> and <code>person_age</code>.</p>
<p>Also the executable statement is the if -else statement we created inside it which will print out a message if a person's age is greater than 10 and another message if it is not.</p>
<p>You can see that we make use of string formatting to print the person's name and the length of the person's name. Also we decided not to return anything since we just want to print a value to the console.</p>
<p>Now if you wish to call this function, you will call it with its name and the parameters it requires. In our case it requires name and age.</p>
<pre><code class="lang-python">name = <span class="hljs-string">"Ibrahim"</span>
age = <span class="hljs-number">12</span>
print_func(name, age)

<span class="hljs-meta">&gt;&gt;&gt; </span>Hi Ibrahim you are more than your denary age <span class="hljs-keyword">and</span> your name contains <span class="hljs-number">7</span> characters.
</code></pre>
<p>You will often need functions to modularize your code in Data Science.</p>
<p>To learn more about how to create function check out this helpful tutorial on functions for beginners from Bala Priya <a target="_blank" href="https://www.freecodecamp.org/news/functions-in-python-a-beginners-guide/">here</a>. Also check this one from Dionysia Lemonaki on how to declare and invoke functions with params <a target="_blank" href="https://www.freecodecamp.org/news/python-function-examples-how-to-declare-and-invoke-with-parameters-2/">here</a>.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>We've come to the end of this long journey. You may be wondering whether you should learn advanced topics like object-oriented programming (OOP), which includes concepts like classes, before learning data science.</p>
<p>To answer your question directly, it's not necessary. The majority of your data science work will revolve around these concepts we discussed in this tutorial, and you will primarily use functions to modularize your code.</p>
<p>Still, as your knowledge grows, it's useful to learn OOP in case you need to contribute to an open source project.</p>
<p>Thank you for taking the time to read this article. I hope you learned a thing or two.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Top Evaluation Metrics for Regression Problems in Machine Learning ]]>
                </title>
                <description>
                    <![CDATA[ A regression problem is a common type of supervised learning problem in Machine Learning. The end goal is to predict quantitative values – for example, continuous values such as the price of a car, the weight of a dog, and so on. But to be sure that ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/evaluation-metrics-for-regression-problems-machine-learning/</link>
                <guid isPermaLink="false">66d45f359208fb118cc6cfc3</guid>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ metrics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ #Regression ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Mon, 01 Aug 2022 14:37:27 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/07/regression-metrics-image.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A regression problem is a common type of supervised learning problem in Machine Learning. The end goal is to predict quantitative values – for example, continuous values such as the price of a car, the weight of a dog, and so on.</p>
<p>But to be sure that your model is doing well in its predictions, you need to evaluate the model.</p>
<p>There are some evaluation metrics that can help you determine whether the model’s predictions are accurate to a certain level of performance.</p>
<p>In this tutorial, you will learn the top evaluation metrics for regression problems, as well as when to use each of them. Without further ado let’s get started.</p>
<h2 id="heading-what-are-residuals">What are Residuals?</h2>
<p>Before we get into the top evaluation metrics, you need to understand what "residual" means when you're evaluating a regression model.</p>
<p>It is not ideal or possible for a model to accurately predict the value of a continuous variable in a regression problem. A regression model can only predict values that are lower or higher than the actual value. As a result, the only way to determine the model’s accuracy is through residuals.</p>
<p>Residuals are the difference between the actual and predicted values. You can think of residuals as being a distance. So, the closer the residual is to zero, the better our model performs in making its predictions.</p>
<p>Here's the formula for calculating residuals:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/residuals.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-javascript">In the above formula:

ei -- stands <span class="hljs-keyword">for</span> the residual value.
yi -- stands <span class="hljs-keyword">for</span> the actual value.
y^i -- stands <span class="hljs-keyword">for</span> the predicted value.

So say, <span class="hljs-keyword">for</span> instance, that the actual value <span class="hljs-keyword">in</span> the dataset is <span class="hljs-number">5</span> and the predicted value is <span class="hljs-number">8.</span> The residual value will be <span class="hljs-number">-3.</span>
</code></pre>
<h2 id="heading-top-evaluation-metrics-for-regression-problems">Top Evaluation Metrics for Regression Problems</h2>
<p>The top evaluation metrics you need to know for regression problems include:</p>
<h3 id="heading-r2-score">R2 Score</h3>
<p>The R2 score (pronounced R-Squared Score) is a statistical measure that tells us how well our model is making all its predictions on a scale of zero to one.</p>
<p>As mentioned above, it's not ideal for a model to predict the actual values in a regression problem (as opposed to a classification problem that has discrete levels of value).</p>
<p>But we can use the R2 score to determine the accuracy of our model in terms of distance or residual. You can calculate the R2 score using the formula below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/08/image.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-when-to-use-the-r2-score">When to Use the R2 Score</h4>
<p>You can use the R2 score to get the accuracy of your model on a percentage scale, that is 0–100, just like in a classification model.</p>
<p>Let’s go over how to implement the R2 score in Python. So we have a small dataset that contains the actual values and the predictions.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/1_mzvi2wZRSVv5W0pPmod3ag.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To implement the R2 score in Python we'll leverage the Scikit-Learn evaluation metrics library.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> r2_score
score = r2_score(data[<span class="hljs-string">"Actual Value"</span>], data[<span class="hljs-string">"Preds"</span>])
print(<span class="hljs-string">"The accuracy of our model is {}%"</span>.format(round(score, <span class="hljs-number">2</span>) *<span class="hljs-number">100</span>))
</code></pre>
<p>The <code>r2_score</code> requires two parameters – the actual value and the predicted values in which we have passed to it above. The result from the metrics is this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/1_0xW0Hg0DXj5vhFJoAGC_nw-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>So we can say that our model predicted those values with 82% accuracy.</p>
<h3 id="heading-mean-absolute-error-mae">Mean Absolute Error (MAE)</h3>
<p>The MAE is simply defined as the sum of all the distances/residual s(the difference between the actual and predicted value) divided by the total number of points in the dataset.</p>
<p>It is the absolute average distance of our model prediction.</p>
<p>You can calculate the MAE using the following formula:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/1_tu6FSDz_FhQbR3UHQIaZNg.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see that the above formula has two pipelines represented by the absolute symbol. The absolute symbol makes sure that the negative residual (which may be a result where the predicted value is greater than the actual value) is converted to positive so that it doesn’t cancel out other positive residuals.</p>
<h4 id="heading-when-to-use-mae">When to Use MAE</h4>
<p>If you want to know the model’s average absolute distance when making a prediction, you can use MAE. In other words, you want to know how close the predictions are to the actual model on average.</p>
<p>Just keep in mind that low MAE values indicate that the model is correctly predicting. Larger MAE values indicate that the model is poor at prediction.</p>
<p>Let’s now see how to implement MAE in Python. We will be working with the previous dataset we used to find the r2_score.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/1_mzvi2wZRSVv5W0pPmod3ag.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To implement the MAE in Python we'll leverage the Scikit-Learn evaluation metrics library.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_absolute_error
score = mean_absolute_error(data[<span class="hljs-string">"Actual Value"</span>], data[<span class="hljs-string">"Preds"</span>])
print(<span class="hljs-string">"The Mean Absolute Error of our Model is {}"</span>.format(round(score, <span class="hljs-number">2</span>)))
</code></pre>
<p>MAE also requires two parameters, the actual value and the predicted value.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/1_muu_mmrUYI6YFn2_LnD8Rw.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-root-mean-squared-error-rmse">Root Mean Squared Error (RMSE)</h3>
<p>Another commonly used metric is the root mean squared error, which is the square root of the average squared distance (difference between actual and predicted value).</p>
<p>RMSE is defined as the square root of all the squares of the distance divided by the total number of points.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/0_2IuTz3Tr_dYNc6Df.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>RMSE functions similarly to MAE (that is, you use it to determine how close the prediction is to the actual value on average), but with a minor difference.</p>
<p>You use the RMSE to determine whether there are any large errors or distances that could be caused if the model overestimated the prediction (that is the model predicted values that were significantly higher than the actual value) or underestimated the predictions (that is, predicted values less than actual prediction).</p>
<h4 id="heading-when-to-use-rmse">When to Use RMSE</h4>
<p>If you are concerned about large errors, RMSE is a good metric to use. If the model overestimated or underestimated some points in the prediction (because the residual will be square, resulting in a large error), you should use RMSE.</p>
<p>RMSE is a popular evaluation metric for regression problems because it not only calculates how close the prediction is to the actual value on average, but it also indicates the effect of large errors. Large errors will have an impact on the RMSE result.</p>
<p>Let’s take a look at how you can implement RMSE in Python.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/1_mzvi2wZRSVv5W0pPmod3ag-2.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The Scikit-learn evaluation metric library has no RMSE metric, but it does include the mean squared error method. The square root of the mean squared error is referred to as RMSE.</p>
<p>To get the RMSE, we can use the Numpy square root method to find the square root of mean squared error, and the result obtained is our RMSE.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
score = np.sqrt(mean_absolute_error(data[<span class="hljs-string">"Actual Value"</span>], data[<span class="hljs-string">"Preds"</span>]))
print(<span class="hljs-string">"The Mean Absolute Error of our Model is {}"</span>.format(round(score, <span class="hljs-number">2</span>)))
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/1_URsnCspxUYxXV5vlacxcew.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can see that the RMSE value is larger than the MAE. This is a result of some large errors in the dataset.</p>
<h2 id="heading-conclusion-and-learning-more">Conclusion and Learning More</h2>
<p>In this tutorial you’ve learned some of the top evaluation metrics for regression problems that you will use on a daily basis.</p>
<p>Thank you for reading. Here are some helpful resources I also included below.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://scikit-learn.org/stable/modules/model_evaluation.html">https://scikit-learn.org/stable/modules/model_evaluation.html</a></div>
<p> </p>
<p><a target="_blank" href="https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d">MAE and RMSE — Which Metric is Better? | by JJ | Human in a Machine World | Medium</a></p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Convert a String to a DateTime Object in Python ]]>
                </title>
                <description>
                    <![CDATA[ When you get dates from raw data, they're typically in the form of string objects. But in this form, you can't access the date's properties like the year, month, and so on. The solution to this problem is to parse (or convert) the string object into ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-convert-a-string-to-a-datetime-object-in-python/</link>
                <guid isPermaLink="false">66d45f37264384a65d5a9542</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Tue, 19 Jul 2022 20:40:28 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/07/pexels-christina-morillo-1181359--3-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When you get dates from raw data, they're typically in the form of string objects. But in this form, you can't access the date's properties like the year, month, and so on.</p>
<p>The solution to this problem is to parse (or convert) the string object into a datetime object so Python can recognized it as a date. And then you can extract any underlying attributes you want to get from it.</p>
<p>This tutorial will teach you how to convert a string to a datetime object in Python. Without further ado, let's get started.</p>
<h1 id="heading-datetime-format-codes">DateTime Format Codes</h1>
<p>Before we learn how to convert strings to dates, you should understand the formatting codes of datetime objects in Python.</p>
<p>These prerequisites will be useful whenever you need to convert a string to a date. We will look at some of the most common formatting codes you'll work with anytime you want to convert string to date.</p>
<p>Here are some of the most common:</p>
<ul>
<li><p>%Y — This is used to represent the Year and it ranges from 0001 to 9999</p>
</li>
<li><p>%m — This is used to represent the month of a year and it ranges from 01 to 12.</p>
</li>
<li><p>%d — This is used to represent the days of the month and ranges from 01 to 31.</p>
</li>
<li><p>%H — This is used to represent the hours of the day in a 24-hour format and ranges from 00 to 23.</p>
</li>
<li><p>%I — This is used to represent the hours of the day in a 12 hour format and ranges from 01 to 12.</p>
</li>
<li><p>%M — This is used to represents minutes in an hour and ranges from 00 to 59.</p>
</li>
<li><p>%S — This is used to represents the seconds in a minute and ranges from 00 to 59 as well.</p>
</li>
</ul>
<p>We'll stop here for date format codes, but there are many more in the Python documentation. You can click <a target="_blank" href="https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes">here</a> to see more.</p>
<h1 id="heading-how-to-convert-a-string-to-a-datetime-object">How to Convert a String to a DateTime Object</h1>
<p>Note that the first thing to consider whenever converting a string to date is to make sure that the string is in the right format.</p>
<p>In order to convert a string to a date, it must satisfy the following conditions.</p>
<ul>
<li><p>Firstly, each element in the string must be separated from the others either by a space, letter, or symbol such as / &amp; , % # - and so on.</p>
</li>
<li><p>The element in the string to be parsed as the year, month, or day must be of the same length as the format code. The element in the string must not exceed the range of the format code. For example the %Y code requires 4 numbers to be passed as the year and its range is 0001 – 9999 (so 09, for example, wouldn't work – you need 2009).</p>
</li>
</ul>
<p>Let's look at some examples of string-to-date conversions. First, we'll convert the string "2019/08/09" to the date.</p>
<p>We need to import the datetime library in Python to be able to achieve this. We can do that by typing the following:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime

date_string = <span class="hljs-string">"2018/08/09"</span>

format = %Y/%m/%d <span class="hljs-comment">#specifify the format of the date_string.</span>

date = datetime.strptime(date_string, format)
print(date)
</code></pre>
<p>Let's go over the above code again to make sure we understand what's going on.</p>
<p>The format variable declares the format of the date string to be passed to the parser (the function that will help us convert the date). We must be aware of the format ahead of time, that is before we pass it to the parser.</p>
<p>In this case, the string is in the format "2019/08/09".</p>
<p>The first element in the string represents a year, for which the format code is <code>%Y</code>. Then we have a forward slash followed the month, for which the format code is <code>%m</code>. Then we have another forward slash, and finally the day, for which the format code is <code>%d</code>.</p>
<p>As a result, we must include the forward slash symbol in the format variable in the same way that it appears in the string. If everything is done correctly, the format should be <code>"%Y/% m/%d."</code></p>
<p>The method <code>datetime.strptime</code> is the parser that will help us convert the <code>date_string</code> we passed into it as a date. It requires two parameters: the date string and the format.</p>
<p>When we print it after that, it will look like this.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-199.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Author</em></p>
<p>We can decide to retrieve any attributes we want from it. For example if we wish to get the year only, we can do that by typing <code>date.year</code> and it will print out just the year.</p>
<p>Now that we understand that, let’s go over one more example that is more complex than the above.</p>
<h3 id="heading-example-how-to-convert-a-string-to-a-date">Example – how to convert a string to a date</h3>
<p>We will convert this string object to a date: <code>"2018-11-15T02:18:49Z"</code>.</p>
<p>Now from the looks of it, we can see that this date string has year, month, day, hours, minutes and seconds. So all we need to do is create the proper format and the symbols in it.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime

date_string = <span class="hljs-string">"2018-11-15T02:18:49Z"</span>

format = <span class="hljs-string">"%Y-%m-%dT%H:%M:%SZ

date = datetime.strptime(date_string, format)
print(date)</span>
</code></pre>
<p>We can see that there is nothing too complex about it. Just follow the format for each part of the date and also pass in any respective symbols or letters you find in the date string.</p>
<p>Do not get distracted by the symbols or letters in the string. If you do everything correctly and print it you should have something like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-200.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Make sure you don't confuse the format code <code>%m</code> with <code>%M</code>. The small <code>%m</code> is used for months while the big <code>%M</code> is used for minutes.</p>
<h1 id="heading-conclusion-and-learning-more">Conclusion and Learning More</h1>
<p>Now we've gotten to the end of this tutorial. You learned how to convert a string into a date format.</p>
<p>Once you learn the format codes, you'll be good to go. Just make sure you adhere to the principles governing which kind of string can be converted.</p>
<p>For instance you have to remember that the string must be separated with something which can either be a space, letter, or symbol. Also, the string range must not be greater or smaller than the range of the format code.</p>
<p>Thank you for reading.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Web Scraping with Python – How to Scrape Data from Twitter using Tweepy and Snscrape ]]>
                </title>
                <description>
                    <![CDATA[ If you are a data enthusiast, you'll likely agree that one of the richest sources of real-world data is social media. Sites like Twitter are full of data. You can use the data you can get from social media in a number of ways, like sentiment analysis... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/python-web-scraping-tutorial/</link>
                <guid isPermaLink="false">66d45f3e4a7504b7409c3411</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ web scraping ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Tue, 12 Jul 2022 17:58:29 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/07/scraping-with-python-article-image.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you are a data enthusiast, you'll likely agree that one of the richest sources of real-world data is social media. Sites like Twitter are full of data.</p>
<p>You can use the data you can get from social media in a number of ways, like sentiment analysis (analyzing people's thoughts) on a specific issue or field of interest.</p>
<p>There are several ways you can scrape (or gather) data from Twitter. And in this article, we will look at two of those ways: using Tweepy and Snscrape.</p>
<p>We will learn a method to scrape public conversations from people on a specific trending topic, as well as tweets from a particular user.</p>
<p>Now without further ado, let’s get started.</p>
<h2 id="heading-tweepy-vs-snscrape-introduction-to-our-scraping-tools">Tweepy vs Snscrape – Introduction to Our Scraping Tools</h2>
<p>Now, before we get into the implementation of each platform, let's try to grasp the differences and limits of each platform.</p>
<h3 id="heading-tweepy">Tweepy</h3>
<p>Tweepy is a Python library for integrating with the Twitter API. Because Tweepy is connected with the Twitter API, you can perform complex queries in addition to scraping tweets. It enables you to take advantage of all of the Twitter API's capabilities.</p>
<p>But there are some drawbacks – like the fact that its standard API only allows you to collect tweets for up to a week (that is, Tweepy does not allow recovery of tweets beyond a week window, so historical data retrieval is not permitted).</p>
<p>Also, there are limits to how many tweets you can retrieve from a user's account. You can <a target="_blank" href="https://www.tweepy.org/">read more about Tweepy's functionalities here</a>.</p>
<h3 id="heading-snscrape">Snscrape</h3>
<p>Snscrape is another approach for scraping information from Twitter that does not require the use of an API. Snscrape allows you to scrape basic information such as a user's profile, tweet content, source, and so on.</p>
<p>Snscrape is not limited to Twitter, but can also scrape content from other prominent social media networks like Facebook, Instagram, and others.</p>
<p>Its advantages are that there are no limits to the number of tweets you can retrieve or the window of tweets (that is, the date range of tweets). So Snscrape allows you to retrieve old data.</p>
<p>But the one disadvantage is that it lacks all the other functionalities of Tweepy – still, if you only want to scrape tweets, Snscrape would be enough.</p>
<p>Now that we've clarified the distinction between the two methods, let's go over their implementation one by one.</p>
<h2 id="heading-how-to-use-tweepy-to-scrape-tweets">How to Use Tweepy to Scrape Tweets</h2>
<p>Before we begin using Tweepy, we must first make sure that our Twitter credentials are ready. With that, we can connect Tweepy to our API key and begin scraping.</p>
<p>If you do not have Twitter credentials, you can register for a Twitter developer account by going <a target="_blank" href="https://developer.twitter.com/">here</a>. You will be asked some basic questions about how you intend to use the Twitter API. After that, you can begin the implementation.</p>
<p>The first step is to install the Tweepy library on your local machine, which you can do by typing:</p>
<pre><code class="lang-javascript">pip install git+https:<span class="hljs-comment">//github.com/tweepy/tweepy.git</span>
</code></pre>
<h3 id="heading-how-to-scrape-tweets-from-a-user-on-twitter">How to Scrape Tweets from a User on Twitter</h3>
<p>Now that we’ve installed the Tweepy library, let’s scrape 100 tweets from a user called <code>john</code> on Twitter. We'll look at the full code implementation that will let us do this and discuss it in detail so we can grasp what’s going on:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tweepy

consumer_key = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer key </span>
consumer_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer Secret Key</span>
access_token = <span class="hljs-string">"XXXX"</span>    <span class="hljs-comment">#Your Access token key</span>
access_token_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your Access token Secret key</span>

<span class="hljs-comment">#Pass in our twitter API authentication key</span>
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

<span class="hljs-comment">#Instantiate the tweepy API</span>
api = tweepy.API(auth, wait_on_rate_limit=<span class="hljs-literal">True</span>)


username = <span class="hljs-string">"john"</span>
no_of_tweets =<span class="hljs-number">100</span>


<span class="hljs-keyword">try</span>:
    <span class="hljs-comment">#The number of tweets we want to retrieved from the user</span>
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    <span class="hljs-comment">#Pulling Some attributes from the tweet</span>
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] <span class="hljs-keyword">for</span> tweet <span class="hljs-keyword">in</span> tweets]

    <span class="hljs-comment">#Creation of column list to rename the columns in the dataframe</span>
    columns = [<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>]

    <span class="hljs-comment">#Creation of Dataframe</span>
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
<span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">'Status Failed On,'</span>,str(e))
    time.sleep(<span class="hljs-number">3</span>)
</code></pre>
<p>Now let's go over each part of the code in the above block.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tweepy

consumer_key = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer key </span>
consumer_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer Secret Key</span>
access_token = <span class="hljs-string">"XXXX"</span>    <span class="hljs-comment">#Your Access token key</span>
access_token_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your Access token Secret key</span>

<span class="hljs-comment">#Pass in our twitter API authentication key</span>
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

<span class="hljs-comment">#Instantiate the tweepy API</span>
api = tweepy.API(auth, wait_on_rate_limit=<span class="hljs-literal">True</span>)
</code></pre>
<p>In the above code, we've imported the Tweepy library into our code, then we've created some variables where we store our Twitter credentials (The Tweepy authentication handler requires four of our Twitter credentials). So we then pass in those variable into the Tweepy authentication handler and save them into another variable.</p>
<p>Then the last statement of call is where we instantiated the Tweepy API and passed in the require parameters.</p>
<pre><code class="lang-python">username = <span class="hljs-string">"john"</span>
no_of_tweets =<span class="hljs-number">100</span>


<span class="hljs-keyword">try</span>:
    <span class="hljs-comment">#The number of tweets we want to retrieved from the user</span>
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    <span class="hljs-comment">#Pulling Some attributes from the tweet</span>
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] <span class="hljs-keyword">for</span> tweet <span class="hljs-keyword">in</span> tweets]

    <span class="hljs-comment">#Creation of column list to rename the columns in the dataframe</span>
    columns = [<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>]

    <span class="hljs-comment">#Creation of Dataframe</span>
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
<span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">'Status Failed On,'</span>,str(e))
</code></pre>
<p>In the above code, we created the name of the user (the @name in Twitter) we want to retrieved the tweets from and also the number of tweets. We then created an exception handler to help us catch errors in a more effective way.</p>
<p>After that, the <code>api.user_timeline()</code> returns a collection of the most recent tweets posted by the user we picked in the <code>screen_name</code> parameter and the number of tweets you want to retrieve.</p>
<p>In the next line of code, we passed in some attributes we want to retrieve from each tweet and saved them into a list. To see more attributes you can retrieve from a tweet, read <a target="_blank" href="https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline">this</a>.</p>
<p>In the last chunk of code we created a dataframe and passed in the list we created along with the names of the column we created.</p>
<p>Note that the column names must be in the sequence of how you passed them into the attributes container (that is, how you passed those attributes in a list when you were retrieving the attributes from the tweet).</p>
<p>If you correctly followed the steps I described, you should have something like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-17.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Author</em></p>
<p>Now that we are done, let's go over one more example before we move into the Snscrape implementation.</p>
<h3 id="heading-how-to-scrape-tweets-from-a-text-search">How to Scrape Tweets from a Text Search</h3>
<p>In this method, we will be retrieving a tweet based on a search. You can do that like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> tweepy

consumer_key = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer key </span>
consumer_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your API/Consumer Secret Key</span>
access_token = <span class="hljs-string">"XXXX"</span>    <span class="hljs-comment">#Your Access token key</span>
access_token_secret = <span class="hljs-string">"XXXX"</span> <span class="hljs-comment">#Your Access token Secret key</span>

<span class="hljs-comment">#Pass in our twitter API authentication key</span>
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

<span class="hljs-comment">#Instantiate the tweepy API</span>
api = tweepy.API(auth, wait_on_rate_limit=<span class="hljs-literal">True</span>)


search_query = <span class="hljs-string">"sex for grades"</span>
no_of_tweets =<span class="hljs-number">150</span>


<span class="hljs-keyword">try</span>:
    <span class="hljs-comment">#The number of tweets we want to retrieved from the search</span>
    tweets = api.search_tweets(q=search_query, count=no_of_tweets)

    <span class="hljs-comment">#Pulling Some attributes from the tweet</span>
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] <span class="hljs-keyword">for</span> tweet <span class="hljs-keyword">in</span> tweets]

    <span class="hljs-comment">#Creation of column list to rename the columns in the dataframe</span>
    columns = [<span class="hljs-string">"User"</span>, <span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>]

    <span class="hljs-comment">#Creation of Dataframe</span>
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
<span class="hljs-keyword">except</span> BaseException <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">'Status Failed On,'</span>,str(e))
</code></pre>
<p>The above code is similar to the previous code, except that we changed the API method from <code>api.user_timeline()</code> to <code>api.search_tweets()</code>. We've also added <code>tweet.user.name</code> to the attributes container list.</p>
<p>In the code above, you can see that we passed in two attributes. This is because if we only pass in <code>tweet.user</code>, it would only return a dictionary user object. So we must also pass in another attribute we want to retrieve from the user object, which is <code>name</code>.</p>
<p>You can go <a target="_blank" href="https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user">here</a> to see a list of additional attributes that you can retrieve from a user object. Now you should see something like this once you run it:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-18.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Author.</em></p>
<p>Alright, that just about wraps up the Tweepy implementation. Just remember that there is a limit to the number of tweets you can retrieve, and you can not retrieve tweets more than 7 days old using Tweepy.</p>
<h2 id="heading-how-to-use-snscrape-to-scrape-tweets">How to Use Snscrape to Scrape Tweets</h2>
<p>As I mentioned previously, Snscrape does not require Twitter credentials (API key) to access it. There is also no limit to the number of tweets you can fetch.</p>
<p>For this example, though, we'll just retrieve the same tweets as in the previous example, but using Snscrape instead.</p>
<p>To use Snscrape, we must first install its library on our PC. You can do that by typing:</p>
<pre><code class="lang-javascript">pip3 install git+https:<span class="hljs-comment">//github.com/JustAnotherArchivist/snscrape.git</span>
</code></pre>
<h3 id="heading-how-to-scrape-tweets-from-a-user-with-snscrape">How to Scrape Tweets from a User with Snscrape</h3>
<p>Snscrape includes two methods for getting tweets from Twitter: the command line interface (CLI) and a Python Wrapper. Just keep in mind that the Python Wrapper is currently undocumented – but we can still get by with trial and error.</p>
<p>In this example, we will use the Python Wrapper because it is more intuitive than the CLI method. But if you get stuck with some code, you can always turn to the GitHub community for assistance. The contributors will be happy to help you.</p>
<p>To retrieve tweets from a particular user, we can do the following:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> snscrape.modules.twitter <span class="hljs-keyword">as</span> sntwitter
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Created a list to append all tweet attributes(data)</span>
attributes_container = []

<span class="hljs-comment"># Using TwitterSearchScraper to scrape data and append tweets to list</span>
<span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'from:john'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">100</span>:
        <span class="hljs-keyword">break</span>
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

<span class="hljs-comment"># Creating a dataframe from the tweets list above </span>
tweets_df = pd.DataFrame(attributes_container, columns=[<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweets"</span>])
</code></pre>
<p>Let's go over some of the code that you might not understand at first glance:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'from:john'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">100</span>:
        <span class="hljs-keyword">break</span>
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])


<span class="hljs-comment"># Creating a dataframe from the tweets list above </span>
tweets_df = pd.DataFrame(attributes_container, columns=[<span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweets"</span>])
</code></pre>
<p>In the above code, what the <code>sntwitter.TwitterSearchScaper</code> does is return an object of tweets from the name of the user we passed into it (which is john).</p>
<p>As I mentioned earlier, Snscrape does not have limits on numbers of tweets so it will return however many tweets from that user. To help with this, we need to add the enumerate function which will iterate through the object and add a counter so we can access the most recent 100 tweets from the user.</p>
<p>You can see that the attributes syntax we get from each tweet looks like the one from Tweepy. These are the list of attributes that we can get from the Snscrape tweet which was curated by Martin Beck.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/Sns.Scrape.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Credit: Martin Beck</em></p>
<p>More attributes might be added, as the Snscrape library is still in development. Like for instance in the above image, <code>source</code> has been replaced with <code>sourceLabel</code>. If you pass in only <code>source</code> it will return an object.</p>
<p>If you run the above code, you should see something like this as well:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-19.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Author</em></p>
<p>Now let's do the same for scraping by search.</p>
<h3 id="heading-how-to-scrape-tweets-from-a-text-search-with-snscrape">How to Scrape Tweets from a Text Search with Snscrape</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> snscrape.modules.twitter <span class="hljs-keyword">as</span> sntwitter
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Creating list to append tweet data to</span>
attributes_container = []

<span class="hljs-comment"># Using TwitterSearchScraper to scrape data and append tweets to list</span>
<span class="hljs-keyword">for</span> i,tweet <span class="hljs-keyword">in</span> enumerate(sntwitter.TwitterSearchScraper(<span class="hljs-string">'sex for grades since:2021-07-05 until:2022-07-06'</span>).get_items()):
    <span class="hljs-keyword">if</span> i&gt;<span class="hljs-number">150</span>:
        <span class="hljs-keyword">break</span>
    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

<span class="hljs-comment"># Creating a dataframe to load the list</span>
tweets_df = pd.DataFrame(attributes_container, columns=[<span class="hljs-string">"User"</span>, <span class="hljs-string">"Date Created"</span>, <span class="hljs-string">"Number of Likes"</span>, <span class="hljs-string">"Source of Tweet"</span>, <span class="hljs-string">"Tweet"</span>])
</code></pre>
<p>Again, you can access a lot of historical data using Snscrape (unlike Tweepy, as its standard API cannot exceed 7 days. The premium API is 30 days.). So we can pass in the date from which we want to start the search and the date we want it to end in the <code>sntwitter.TwitterSearchScraper()</code> method.</p>
<p>What we've done in the preceding code is basically what we discussed before. The only thing to bear in mind is that until works similarly to the range function in Python (that is, it excludes the last integer). So if you want to get tweets from today, you need to include the day after today in the "until" parameter.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-21.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image of Author.</em></p>
<p>Now you know how to scrape tweets with Snscrape, too!</p>
<h3 id="heading-when-to-use-each-approach">When to use each approach</h3>
<p>Now that we've seen how each method works, you might be wondering when to use which.</p>
<p>Well, there is no universal rule for when to utilize each method. Everything comes down to a matter preference and your use case.</p>
<p>If you want to acquire an endless number of tweets, you should use Snscrape. But if you want to use extra features that Snscrape cannot provide (like geolocation, for example), then you should definitely use Tweepy. It is directly integrated with the Twitter API and provides complete functionality.</p>
<p>Even so, Snscrape is the most commonly used method for basic scraping.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In this article, we learned how to scrape data from Python using Tweepy and Snscrape. But this was only a brief overview of how each approach works. You can learn more by exploring the web for additional information.</p>
<p>I've included some useful resources that you can use if you need additional information. Thank you for reading.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://github.com/JustAnotherArchivist/snscrape">https://github.com/JustAnotherArchivist/snscrape</a></div>
<p> </p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://docs.tweepy.org/en/stable/index.html">https://docs.tweepy.org/en/stable/index.html</a></div>
<p> </p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af">https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Handle Missing Data in a Dataset ]]>
                </title>
                <description>
                    <![CDATA[ Missing values are common when working with real-world datasets – not the cleaned ones available on Kaggle, for example. Missing data could result from a human factor (for example, a person deliberately failing to respond to a survey question), a pro... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-handle-missing-data-in-a-dataset/</link>
                <guid isPermaLink="false">66d45f3b3a8352b6c5a2aa77</guid>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Fri, 24 Jun 2022 21:14:52 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/06/randy-fath-G1yhU1Ej-9A-unsplash--1-.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Missing values are common when working with real-world datasets – not the cleaned ones available on Kaggle, for example.</p>
<p>Missing data could result from a human factor (for example, a person deliberately failing to respond to a survey question), a problem in electrical sensors, or other factors. And when this happens, you can lose significant information.</p>
<p>Now, there is no perfect way to handle missing values that will give you an accurate result as to what the missing value is. But there are several techniques that you can leverage that will give you decent performance.</p>
<p>In this article, we will look at how to handle missing data in the right way (the right way meaning selecting the appropriate technique for whatever scenario our data set might represent).</p>
<p>Remember that none of these methods are perfect – they still introduce some biases, such as favoring one class over another – but they are useful.</p>
<p>Before we begin, I'd like to start with a quote from George Box to back up the preceding statement:</p>
<blockquote>
<p>All models are approximations: Essentially all models are wrong but some are useful.</p>
</blockquote>
<p>Now without further ado let’s get started.</p>
<h2 id="heading-what-types-of-missing-data-are-there">What Types of Missing Data Are There?</h2>
<p>You may be wondering if missing values have types. Yes, they do – and in the real world, these missing values can be divided into three categories.</p>
<p>Understanding these categories will give you with some insights into how to approach the missing value(s) in your dataset.</p>
<p>Among the categories are:</p>
<ul>
<li><p>Missing Completely at Random (MCAR).</p>
</li>
<li><p>Missing at Random (MAR).</p>
</li>
<li><p>Not Missing at Random (NMAR).</p>
</li>
</ul>
<h3 id="heading-missing-data-thats-missing-completely-at-random-mcar">Missing Data that's Missing Completely at Random (MCAR)</h3>
<p>These are data that are missing completely at random. That is, the missingness is independent from the data. There is no discernible pattern to this type of data missingness.</p>
<p>This means that you cannot predict whether the value was missing due to specific circumstances or not. They are just completely missing at random.</p>
<h3 id="heading-missing-data-thats-missing-at-random-mar">Missing Data that's Missing at Random (MAR)</h3>
<p>These types of data are missing at random but not completely missing. The data's missingness is determined by the data you see.</p>
<p>Consider for instance that you built a smart watch that can track people's heart rates every hour. Then you distributed the watch to a group of individuals to wear so you can collect data for analysis.</p>
<p>After collecting the data, you discovered that some data were missing, which was due to some people being reluctant to wear the wristwatch at night. As a result, we can conclude that the missingness was caused by the observed data.</p>
<h3 id="heading-missing-data-thats-not-missing-at-random-nmar">Missing Data that's Not Missing at Random (NMAR)</h3>
<p>These are data that are not missing at random and are also known as ignorable data. In other words, the missingness of the missing data is determined by the variable of interest.</p>
<p>A common example is a survey in which students are asked how many cars they own. In this case, some students may purposefully fail to complete the survey, resulting in missing values.</p>
<h2 id="heading-how-should-you-handle-missing-data">How Should You Handle Missing Data?</h2>
<p>As we just learned, these techniques cannot be that precise in determining the missing value. They appear to have some biases.</p>
<p>Handling missing values falls generally into two categories. We will look at the most common in each category. The two categories are as follows:</p>
<ul>
<li><p>Deletion</p>
</li>
<li><p>Imputation</p>
</li>
</ul>
<h2 id="heading-how-to-handle-missing-data-with-deletion"><strong>How to Handle Missing Data with Deletion</strong></h2>
<p>One of the most prevalent methods for dealing with missing data is deletion. And one of the most commonly used methods in the deletion approach is using the list wise deletion method.</p>
<h3 id="heading-what-is-list-wise-deletion">What is List-Wise Deletion?</h3>
<p>In the list-wise deletion method, you remove a record or observation in the dataset if it contains some missing values.</p>
<p>You can perform list-wise deletion on any of the aforementioned missing value categories, but one of its disadvantages is potential information loss.</p>
<p>The general rule of thumb for when to perform list-wise deletion is when the number of observations with missing values exceeds the number of observations without missing values. This is because the dataset does not have a lot of information to feed the missing values, so it is better to drop those values or discard the dataset entirely.</p>
<p>You can implement list-wise deletion in Python by simply using the Pandas <code>.dropna</code> method like this:</p>
<pre><code class="lang-javascript">df.dropna(axis=<span class="hljs-number">1</span>, inplace=True)
</code></pre>
<h2 id="heading-how-to-handle-missing-data-with-imputation"><strong>How to Handle Missing Data with Imputation?</strong></h2>
<p>Another frequent general method for dealing with missing data is to fill in the missing value with a substituted value.</p>
<p>This methodology encompasses various methods, but we will focus on the most prevalent ones here.</p>
<h3 id="heading-prior-knowledge-of-an-ideal-number">Prior knowledge of an ideal number</h3>
<p>This method entails replacing the missing value with a specific value. To use it, you need to have domain knowledge of the dataset. You use this to populate the MAR and MCAR values.</p>
<p>To implement it in Python, you use the <code>.fillna</code> method in Pandas like this:</p>
<pre><code class="lang-javascript">df.fillna(inplace=True)
</code></pre>
<h3 id="heading-regression-imputation">Regression imputation</h3>
<p>The regression imputation method includes creating a model to predict the observed value of a variable based on another variable. Then you use the model to fill in the missing value of that variable.</p>
<p>This technique is utilized for the MAR and MCAR categories when the features in the dataset are dependent on one another. For example using a linear regression model.</p>
<h3 id="heading-simple-imputation">Simple Imputation</h3>
<p>This method involves utilizing a numerical summary of the variable where the missing value occurred (that is using the feature or variable's central tendency summary, such as mean, median, and mode).</p>
<p>When you use this strategy to fill in the missing values, you need to evaluate the variable's distribution to determine which central tendency summary to apply.</p>
<p>You use this method in the MCAR category. And you implement it in Python using the <code>SimpleImputer</code> transformer in the Scikit-learn library.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
#Specify the strategy to be the median <span class="hljs-class"><span class="hljs-keyword">class</span>
<span class="hljs-title">fea_transformer</span> </span>= SimpleImputer(strategy=<span class="hljs-string">"median"</span>)
values = fea_transformer.fit_transform(df[[<span class="hljs-string">"Distance"</span>]])
pd.DataFrame(values)
</code></pre>
<p><img src="https://lh5.googleusercontent.com/LA7UBjLpFXy3YUZ7RG1k9eyk8Y9jQTHEP3v1RarmMD1sHmmry0NDcZAUj1Yf7PxByjmalnxug-TvamDss85jmFiwQ8bSQ5IPPlpgVBNg2XkSUFCoyF_vKkPVUpQBT1_dva29EKKLxdyuE9IomA" alt="Image" width="1230" height="473" loading="lazy"></p>
<h3 id="heading-knn-imputation">KNN Imputation</h3>
<p>KNN imputation is a fairer approach to the Simple Imputation method. It operates by replacing missing data with the average mean of the neighbors nearest to it.</p>
<p>You can use KNN imputation for the MCAR or MAR categories. And to implement it in Python you use the KNN imputation transformer in ScikitLearn, as seen below:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> KNNImputer
# I specify the nearest neighbor to be <span class="hljs-number">3</span> 
fea_transformer = KNNImputer(n_neighbors=<span class="hljs-number">3</span>)
values = fea_transformer.fit_transform(df[[<span class="hljs-string">"Distance"</span>]])
pd.DataFrame(values)
</code></pre>
<p><img src="https://lh5.googleusercontent.com/EcAOhM2hrcL1nNyLTbry-76ADhEJ8aJliuae4SEaRzNxzN031BgBNT03iMv4PjoqkaTU2TmCwMuIY_M0ZGbvZCzKvQ-8PO_1h03LjjdFMZj_ZuW9zhNwq1TKQD3WZHKcry2_MpPD6ul-ykYpFg" alt="Image" width="1213" height="459" loading="lazy"></p>
<h3 id="heading-how-to-use-learning-algorithms">How to Use Learning Algorithms</h3>
<p>The final strategy we'll mention in this post is using machine learning algorithms to handle missing data.</p>
<p>Some learning algorithms allow us to fit the dataset with missing values. The dataset algorithm then searches for patterns in the dataset and uses them to fill in the missing values. Such algorithms include XGboost, Gradient Boosting, and others. But further discussion is out of the scope of this article.</p>
<h2 id="heading-conclusion-and-learning-more">Conclusion and Learning More</h2>
<p>In this article, we've covered some of the most prevalent techniques you'd use on a daily basis to handle missing data.</p>
<p>But the learning does not end here. There are several other techniques available to assist us in filling our dataset, but the key is to grasp the underlying mechanisms in those techniques so that we can manage missing values properly. Thanks for reading.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Statistics for Beginners – Top Stats Concepts to Know Before Getting into Data Science ]]>
                </title>
                <description>
                    <![CDATA[ You've probably heard that statistics is the gateway to data science and that the data science map starts with stats. Perhaps you've also heard from others that you have to learn statistics before learning data science. But then you ponder, "Since I'... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/top-statistics-concepts-to-know-before-getting-into-data-science/</link>
                <guid isPermaLink="false">66d45f4247a8245f78752a62</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ statistics ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Fri, 10 Jun 2022 16:33:29 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/06/who-s-denilo-3ECPkzvwlBs-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You've probably heard that statistics is the gateway to data science and that the data science map starts with stats.</p>
<p>Perhaps you've also heard from others that you have to learn statistics before learning data science. But then you ponder, "Since I'm not from a technical background like science, technology, engineering, or math (STEM), do I need to learn everything in statistics before getting into data science?" And those same people will tell you "Yes! You have to learn statistics."</p>
<p>Well, here's my answer: you don't need to learn all of statistics before beginning data science (though you do need to learn some fundamentals).</p>
<p>You can also learn as you go instead of wasting time learning statistics first before data science (that is, as you advance in your knowledge of data science, you can always learn more statistics concepts).</p>
<p>That being said, it is helpful to know statistics basics before jumping into data science. You can indeed say that stats is the gateway to data science because it will help you to have some intuition about your data and how to work with it.</p>
<p>In this article, we'll look at the top statistical concepts you need to know before diving into data science. I'll make it as simple as possible even if you don't come from a technical background. I can tell you're excited and ready to dive into the realm of data science. Let's get started.</p>
<h1 id="heading-what-is-statistics">What is Statistics?</h1>
<p>According to economist and sampling technique pioneer Arthur Lyon Bowley, Statistics is:</p>
<blockquote>
<p>"numerical statements of facts in any department of inquiry placed in relation to each other."</p>
</blockquote>
<p>That basically means that statistics helps us comprehend our data and also helps us convey the results in that data to others.</p>
<p>Statistical methods (that is, the techniques employed in dealing with data in statistics) are classified into two types:</p>
<ol>
<li><p>Descriptive Statistics</p>
</li>
<li><p>Inferential Statistics</p>
</li>
</ol>
<p><strong>Descriptive Statistics</strong> is a discipline of statistics that assists us in summarizing data through numerical values or graphical visualization.</p>
<p>Descriptive statistics helps us identify and understand some key properties in our data. It includes concepts such as central tendency, dispersion, boxplots, histograms, and so on, which we'll discuss later in the article.</p>
<p><strong>Inferential Statistics,</strong> on the other hand, is a branch of statistics that helps us make decisions or predictions based on the data that we have gathered.</p>
<p>Inferential statistics is a significantly more advanced topic because it requires a deep understanding of descriptive statistics. It includes concepts such as hypothesis, probability, and so forth.</p>
<h1 id="heading-top-statistical-concepts-to-know-before-learning-data-science">Top Statistical Concepts to Know Before Learning Data Science</h1>
<p>Since you're now familiar with the definition of statistics, let's have a look at some of the concepts you'll need to know in statistics that'll help guide you when you dive into the realm of statistics.</p>
<p>Among the most fundamental concepts are:</p>
<h2 id="heading-what-is-a-subject">What is a Subject?</h2>
<p>This is the specific thing we wish to observe. It could be a person, an animal, or something else. It is also known as observation.</p>
<h2 id="heading-what-is-a-population">What is a Population?</h2>
<p>Population refers to the entire set of topics in which we are interested (that is, that we want to observe). Assume you wish to count the number of females in a specific country.</p>
<h2 id="heading-what-is-a-sample">What is a Sample?</h2>
<p>In reality, observing a population is hardly an ideal situation (because it can be very expensive to perform, and also time-consuming).</p>
<p>Consider the following scenario: you wish to observe every female in the world. This type of observation can be costly to carry out. However, in statistics, we have something called a sample, which is a portion/subset of the population that you want to study. We can now make a decision (inferential statistic) about the full population using the sample.</p>
<h2 id="heading-what-are-parameters">What are Parameters?</h2>
<p>This is a property/summary of a population. Consider the following scenario: you are observing the entire country and you discover that 90% of the inhabitants are males while 10% are females. The numerical values, 90%, and 10% are a numerical summary (that is, descriptive statistics) of the entire population. As a result, the summary is known as the population parameter.</p>
<h2 id="heading-what-is-a-statistic">What is a Statistic?</h2>
<p>On the other hand, a statistic (not to be confused with statistic(s)) is about a sample's property. As stated in the preceding example, instead of working with the full population, we work with samples, so the numerical value is referred to as the statistic of the sample.</p>
<p>Hopefully you now have a decent understanding of what population, sample, statistic, and parameters are. Let's take a look at another concept with which we are all too familiar: <strong>"Data"</strong>.</p>
<p><strong>Data</strong>, as the term implies, represents factual information. That is, it conveys a message to us. It can, however, be divided into two categories:</p>
<ol>
<li><p>Quantitative data.</p>
</li>
<li><p>Qualitative data.</p>
</li>
</ol>
<h2 id="heading-what-is-quantitative-data">What is Quantitative Data?</h2>
<p>This is also known as numerical data. These data are a sort of data in which numerical values can be counted or measured. Quantitative data can be further classified into two types:</p>
<p><strong>Quantitative discrete data:</strong> These are numerical data that can be counted but cannot be measured. Counting the number of shoes in a shoe store is a common example.</p>
<p><strong>Quantitative continuous data:</strong> This is a type of numerical data that is based on measurement. For example, measuring the weight of a glass cylinder is continuous, not discrete.</p>
<h2 id="heading-what-is-qualitative-data">What is Qualitative Data?</h2>
<p>These are sorts of data that represent categories or groups of data. They are also known as categorical data. They are usually written in text. They can be characteristics, names, or anything else.</p>
<p>A common example is a person's name, dog breeds, and so on. However, there are some data that appear to be numerical data but are encoded as categorical data.</p>
<p>For example, suppose you wanted to group a certain group of people based on their age and discovered that the lowest and highest ages are 10 and 60, respectively. You then divided the ages into 5 categories (10-20, 21-30, 31-40, 41-50, 51-60) and assigned numerical values to each of those categories where 1 represents 10-20, 2 represents 21-30, and so on.</p>
<p>In this situation, the numerical values will be handled as categorical data rather than quantitative data. As your data science career progresses, you will learn how to work with categorical data.</p>
<p>Now you know the categories of data. Quantitative and qualitative data can be treated in statistics using these levels of measurement. Data in statistics can be classified into 4 levels of measurement which are:</p>
<ol>
<li><p>Nominal scale data</p>
</li>
<li><p>Ordinal Scale data</p>
</li>
<li><p>Interval Scale data</p>
</li>
<li><p>Ratio Scale data</p>
</li>
</ol>
<p>Qualitative data can be measured using:</p>
<p><strong>Nominal scale data:</strong> These are the type of categorical data that do not have an ordered sense. That is, they cannot be ordered.</p>
<p>Each piece of data represents a single unit. An example of such categorical data includes color. It is not very ideal to rank blue over yellow. When working with nominal data, each data point must be handled as a separate unit.</p>
<p><strong>Ordinal Scale data:</strong> Ordinal Scale data consists of ordered categorical data. When data is ranked, there is a sense of order in it. A survey response such as excellent, good, satisfactory, and unsatisfactory is an example of this. It makes sense to rank excellence above good.</p>
<p>Quantitative data can be measured using:</p>
<p><strong>Interval Scale Data:</strong> These are numerical data with ordering and can be measured (for example find the difference between the data). The readings on a temperature scale are an example of interval data.</p>
<p>For example, you can measure the difference between 4 and 10 degrees Celsius, and 10 degrees is higher than 4 degrees. However, there are two exceptions for interval scale data:</p>
<ol>
<li><p>It does not have a starting point (that is, it does not begin from zero and you can have a temperature value below zero)</p>
</li>
<li><p>You can't figure out their ratio: For example, it makes no logic to claim that 4 times 20 degrees Celsius is 80 degrees Celsius.</p>
</li>
</ol>
<p><strong>Ratio Scale data</strong>: These are numerical data that have the features of interval scale data (that is they may be ordered and measured), but also solve the exception of interval scale data (they have a starting point, and also you can find the ratio between them).</p>
<p>A grade score of 20, 68, 90, or 80 is an example. We can order it, measure it, and find the ratio between the values. It makes sense to say the score of 80 is 4 times better than the score of 20.</p>
<p>Now that we've covered the fundamentals of data, let's look at how the first category of statistics (descriptive statistics) can be applied to data.</p>
<p>As previously stated, descriptive statistics require summarizing data either numerically or graphically. Let's take a look at some of the most typical numerical and graphical summaries you'll encounter when dealing with data on a regular basis.</p>
<h2 id="heading-mean-vs-median-vs-mode-what-is-the-difference">Mean vs Median vs Mode – What is the Difference?</h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/Visualisation_mode_median_mean.svg" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Mean, Median, and Mode explained through illustration. Mode is the high point, Median is the half way point, and Mean is the average.</em></p>
<h3 id="heading-what-is-a-mean">What is a Mean?</h3>
<p>When we have a set of numerical data like this (4, 5, 6, 7, 10), each value in the set of data is referred to as a data point. We might want to find the data's average value.</p>
<p>So mean is essentially the average of a set of data and is calculated as the sum of all the data points divided by the total number of data points.</p>
<p>In our above data set, their sum is 32 and the total number of data points is 5. So the average number, that is the mean, is 6.4</p>
<p>Mean is only used on numerical data. Finding the average of our category data is impractical.</p>
<h3 id="heading-what-is-a-median">What is a Median?</h3>
<p>Also, given a group of values, we may want to discover the value in the center. The median is used to compute the value in the middle. Median also is used on numerical data only.</p>
<h3 id="heading-what-is-a-mode">What is a Mode?</h3>
<p>This is the value with the highest frequency (that is a value that has the highest number of occurrences). The mode can be used for numerical or categorical data.</p>
<h2 id="heading-what-is-an-outlier">What is an Outlier?</h2>
<p>Outliers are data points that differ from other data points and, when present, can lead us to incorrect conclusions. Here's a typical example of how outliers are harmful.</p>
<p>Consider the following scenario: you have a machine that counts how many customers enter your supermarket every day, and the readings are thus for a given week (20, 23, 26, 27, 302). We can see that the number 302 is an outlier because it deviates significantly from the other data points.</p>
<p>Outliers could have resulted from a sudden change, machine faults, or other circumstances. However, when they are present, they can lead us to make incorrect decisions, such as if you want to find the average number of consumers who visit your supermarket, the value 302 may lead you to an incorrect result. The mean of the preceding values is 75.</p>
<h2 id="heading-what-is-a-standard-deviation">What is a Standard Deviation?</h2>
<p>A Standard Deviation is a summary value that indicates how far our data point deviates from the mean. It is used to determine the spread of our data.</p>
<p>The closer the standard deviation is to zero, the closer our data points are to one another.</p>
<p>The standard deviation is an extremely valuable summary that informs us that we have some outliers in our dataset. Here's how it works:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/1920px-Standard_deviation_diagram.svg.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>A chart of a Normal Distribution, with the number of standard deviations listed on the x axis.</em></p>
<p>In the above chart, we see a Normal Distribution. 34.1% + 34.1% = 68.2% of all observations are within one standard deviation, or 1σ (pronounced one Sigma).</p>
<p>13.6% + 13.6% = 27.2% of the remaining observations are within two standard deviations, or 2σ. And so on.</p>
<p>And yes, if you've heard of Six Sigma, that is a concept in engineering where six standard deviation's worth of possibilities are accounted for in the quality assurance process. Meaning you are accounting for all but the most extreme outliers. 99.99966% of all possibilities, to be exact.</p>
<p>Now that we've grasped some numerical summaries, let's take a look at some common graphical summaries.</p>
<h2 id="heading-what-is-a-bar-chart">What is a Bar Chart?</h2>
<p>A bar chart is a type of data visualization used for categorical data. You use it to graphically show the frequency of categorical data (that is the number of times a categorical data point occurs). Here's an example:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/download-1.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-what-is-a-histogram">What is a Histogram?</h2>
<p>A histogram is similar to a bar chart in that it shows the frequency of your numerical data called height, but it groups the numerical data points into bins or ranges.</p>
<p>It is a very efficient visualization tool because it helps you visualize the distribution of your numerical data. You can read more <a target="_blank" href="https://www.cuemath.com/data/histograms/">here</a> to learn more about histograms.</p>
<p><img alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-what-is-a-boxplot">What is a Boxplot?</h2>
<p>Another excellent visualization that helps you visualize the distribution of your data is the boxplot.</p>
<p>A boxplot, for example, allows you to visually observe if there are any outliers in your data collection. It includes terms such as minimum, 25th percentile, 50th percentile, 75th percentile, and maximum. A Boxplot looks as follows:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/OutliersAnomalies--1-.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><em>Image by Ibrahim Ogunbiyi</em></p>
<p>So let’s go over what we have in the above diagram:</p>
<p><strong>Minimum</strong>: The minimum value does not imply the smallest value in our dataset. It is calculated using this formula ( Q1 -1.5*IQR) where:</p>
<ul>
<li><p>Q1 – implies The 25th percentile</p>
</li>
<li><p>IQR – implies the Interquartile range (which is the difference between the 75th percentile and the 25th percentile).</p>
</li>
</ul>
<p>With the minimum, it can help us detect data points that are also far below the other observed values.</p>
<p>For instance, assuming our data points are spread like these [345, 402, 295, 386, 10]. We can see that the value 10 is also an outlier because it is a lower value that is far below other observations.</p>
<p><strong>The 25th percentile</strong> is a value that tells us that 25% of our data points are below that value and 75% of our data points are above that value. The 25th percentile is also known as the first quartile.</p>
<p><strong>The 50th percentile</strong> is a value that indicates that 50% of our data points are below that value and the remaining 50% are above that value. It is also known as the second quartile.</p>
<p><strong>The 75th percentile</strong> is a value that tells us that 75 percent of our data point is below that value and the remaining 25 percent is above it. It is also known as the third quartile.</p>
<p><strong>Maximum:</strong> Also like the minimum, the maximum does not imply the highest value in the dataset. It is calculated using the formula (Q3 + 1.5*IQR) where:</p>
<ul>
<li><p>Q3 – implies the 75th percentile</p>
</li>
<li><p>IQR implies Interquartile Range (which is the difference between the 75th percentile and the 25th percentile).</p>
</li>
</ul>
<p>With maximum also, it can help us detect data points that are also far above the other observed values.</p>
<p>For instance, assuming our data points are spread like these [645, 40, 25, 38, 42]. We can see that the value 645 is also an outlier because it is a higher value that is far above other observations.</p>
<p>We've seen some graphical summaries of what we'll be dealing with on a daily basis. Let's look at the final topic we will discuss in this article:</p>
<h2 id="heading-what-is-the-association-between-quantitative-variables">What is the Association Between Quantitative Variables?</h2>
<p><strong>Variables</strong> are any values (alphabetical or numerical, but typically alphabetical) that represent a collection of observations. It is sometimes referred to as a column in a table.</p>
<p>Two variables are said to be associated if a specific value of one variable is most likely to occur with a specific value of another variable.</p>
<p>To study the association between two quantitative variables (often referred to as correlation), we calculate it using the Karl Pearson formula, and the result is between -1 and +1.</p>
<p>If the correlation value approaches 1, it indicates that the two variables are positively correlated (that is, as one variable increases the other variable increases as well). If the value approaches -1, it indicates that the variables are negatively linked (that is as one variable increases, the other variable decreases). Finally, if the correlation current is 0, there is no correlation between the variables.</p>
<p>You can read more <a target="_blank" href="https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/">here</a> to know more about correlation and Karl Pearson formula</p>
<h2 id="heading-what-is-a-scatter-plot">What is a Scatter Plot?</h2>
<p>We can represent the correlation between quantitative variables in a graphical summary by using a plot called a scatter plot.</p>
<p>A scatter plot looks like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/scatter-ice-cream1.svg" alt="Image" width="600" height="400" loading="lazy"></p>
<p><a target="_blank" href="https://www.mathsisfun.com/data/scatter-xy-plots.html"><em>Scatter (XY) Plots (mathsisfun.com)</em></a></p>
<p>To learn about scatter plots you can read more <a target="_blank" href="https://byjus.com/maths/scatter-plot/#:~:text=Scatter%20plots%20are%20the%20graphs,plotted%20on%20the%20Y%2Daxis.">here</a>.</p>
<h1 id="heading-conclusion-and-learning-more">Conclusion and Learning More</h1>
<p>In this tutorial, we've explored some fundamental statistics concepts that will help you work more efficiently with your data.</p>
<p>But the learning does not stop here – there are a few fundamental topics that you must be familiar with. Because this is only the beginning, you can delve deeper by consulting online resources or textbooks.</p>
<p>Thank you very much for reading, and please share the article so that beginners who want to go into data science can learn as well.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Deploy a Machine Learning Model as a Web App Using Gradio ]]>
                </title>
                <description>
                    <![CDATA[ You've built your Machine Learning model with 99% accuracy and now you are ecstatic. You are like yaaaaaaaaay! My model performed well. Then you paused and you were like – now what? Well first, you might have thought of uploading your code to GitHub ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-deploy-your-machine-learning-model-as-a-web-app-using-gradio/</link>
                <guid isPermaLink="false">66d45f393a8352b6c5a2aa6f</guid>
                
                    <category>
                        <![CDATA[ deployment ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Web Applications ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ibrahim Ogunbiyi ]]>
                </dc:creator>
                <pubDate>Wed, 01 Jun 2022 15:14:51 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/05/deploy-ml-models-article.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You've built your Machine Learning model with 99% accuracy and now you are ecstatic. You are like yaaaaaaaaay! My model performed well.</p>
<p>Then you paused and you were like – now what?</p>
<p>Well first, you might have thought of uploading your code to GitHub and showing people your Jupyter notebook file. It comprises those gorgeous-looking visualizations you created using Seaborn, those extremely powerful ensemble models, and how they are able to pass their evaluation metrics and so on.</p>
<p>But then you noticed that no one is interacting with it.</p>
<p>Well, my friend, why not try deploying the model as a web app so that non-techies can interact with the model, too? Because only programmers like you will likely understand that first approach.</p>
<p>There are several methods for deploying your model, but we will focus on one of them in this article: using Gradio. I can tell you're excited. Well, relax and enjoy, because this is going to be an exciting ride.</p>
<h1 id="heading-prerequisites">Prerequisites</h1>
<p>Before beginning this journey, I assume you have the following knowledge:</p>
<ol>
<li><p>You know how to create a user-defined function in Python</p>
</li>
<li><p>You can build and fit an ML model</p>
</li>
<li><p>Your environment is all set up</p>
</li>
</ol>
<h1 id="heading-what-is-gradio">What is Gradio?</h1>
<p><a target="_blank" href="https://gradio.app/">Gradio</a> is a free and open-source Python library that allows you to develop an easy-to-use customizable component demo for your machine learning model that anyone can use anywhere.</p>
<p>Gradio integrates with the most popular Python libraries, including Scikit-learn, PyTorch, NumPy, seaborn, pandas, Tensor Flow, and others.</p>
<p>One of its advantages is that it allows you to interact with the web app you are currently developing in your Jupyter or Colab notebook. It has a lot of unique features that can help you construct a web app that users can interact with.</p>
<h1 id="heading-how-to-install-gradio">How to Install Gradio</h1>
<p>To use Gradio, we must first install its library on our local PC. So go to your Conda PowerShell or terminal and run the following command. If you are using Google Colab you can also type the following:</p>
<pre><code class="lang-javascript">pip install gradio
</code></pre>
<p>We now have Gradio installed on our local PC. Let's go through some of the fundamentals of Gradio so we can become acquainted with the library.</p>
<p>To begin, we must import the library into our notebook or IDE, whichever you are using. We can do this by typing the following command:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">import</span> gradio <span class="hljs-keyword">as</span> gr
</code></pre>
<h1 id="heading-how-to-create-your-first-web-app">How to Create Your First Web App</h1>
<p>In this tutorial, we'll create an example greeting app to familiarize ourselves with the fundamentals of Gradio.</p>
<p>To do so, we'll need to write a greeting function because Gradio works with Python user defined functions. As a result, our greeting function looks like this:</p>
<pre><code class="lang-py"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">greet_user</span>(<span class="hljs-params">name</span>):</span>
    <span class="hljs-keyword">return</span> <span class="hljs-string">"Hello "</span> + name + <span class="hljs-string">" Welcome to Gradio!😎"</span>
</code></pre>
<p>We now need to deploy the Python function on Gradio so that it can act as a web app. To do this, we type:</p>
<pre><code class="lang-py">app =  gr.Interface(fn = greet_user, inputs=<span class="hljs-string">"text"</span>, outputs=<span class="hljs-string">"text"</span>)
app.launch()
</code></pre>
<p>Let’s walk through and have a grok about what is going on in the above code before we run it.</p>
<p><code>gr.Interface</code>: This attribute serves as the bedrock of anything in Gradio. It is the user interface that displays all the components that will be shown on the web.</p>
<p>The parameter <code>fn</code>: This is the Python function you created and want to provide to Gradio.</p>
<p>The <code>inputs</code> parameter: These are the components that you wish to pass into the function that you created, such as words, images, numbers, audio, and so on. In our case, the function we created required text, so we entered it into the inputs parameters.</p>
<p>The <code>output</code> parameter: This is a parameter that allows you to display the component on the interface that you want to see. Because the function we created in this example needs to display text, we supply the text component to the outputs parameter.</p>
<p><code>app.launch</code> is used to launch the app. You should have something like this when you run the above code:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/Gradio-pro.png" alt="alt_text" width="600" height="400" loading="lazy"></p>
<p>Once the Gradio interface comes up, just type your name and hit submit. Then it outputs the result in the function we created above. Now that we are done with that, let’s go over one more thing in Gradio before we learn how to deploy our model.</p>
<p>We will create a Gradio app that can accept two inputs and provides one output. This app just asks for your name and a value and then outputs your names as well as multiples of the value you entered. To do that just type the below code:</p>
<pre><code class="lang-py"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">return_multiple</span>(<span class="hljs-params">name, number</span>):</span>
    result = <span class="hljs-string">"Hi {}! 😎. The Mulitple of {} is {}"</span>.format(name, number, round(number**<span class="hljs-number">2</span>, <span class="hljs-number">2</span>))
    <span class="hljs-keyword">return</span> result

app = gr.Interface(fn = return_multiple, inputs=[<span class="hljs-string">"text"</span>, gr.Slider(<span class="hljs-number">0</span>, <span class="hljs-number">50</span>)], outputs=<span class="hljs-string">"text"</span>)
app.launch()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/gradio-2.png" alt="alt_text" width="600" height="400" loading="lazy"></p>
<p>Now that we’ve done that let’s quickly go through some of the things we did here that you might not be familiar with.</p>
<p>Input Parameter: In the input parameter we created a list that involves two components, the text and the slider. The slider is also one of Gradio's attributes that returns a float value when you slide across a given range. We used this because in the function we created we are expecting a text and a value.</p>
<p>We have to order the component in the input parameter the way our attributes are ordered in the function we created above. That is, text first before the number. So what we are expecting for the output is actually a string. We just did some formatting in the above function.</p>
<p>Now that we’ve familiarized ourselves with some of the basics of Gradio, let’s create a model that we will deploy.</p>
<h1 id="heading-how-to-deploy-a-machine-learning-model-on-gradio">How to Deploy a Machine Learning Model on Gradio</h1>
<p>In this section, I will use a classification model that I've previously trained and saved in a pickle file.</p>
<p>When you create a model that takes a long time to train, the most effective approach to deal with it is to save it in a pickle file once it is finished training so that you don't have to go through the stress of training the model again.</p>
<p>If you want to save a model as a pickle file, let me show you how you can do that. First import the pickle library and then type the code below. Let’s say I just want to fit a model like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pickle
</code></pre>
<pre><code class="lang-javascript"><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier
clf = RandomForestClassifier(random_state=<span class="hljs-number">42</span>)
clf.fit(X_train, y_train) 

# If you<span class="hljs-string">'ve fitted the model just type this to save it: Remember to change the file name
with open("filename.pkl", "wb") as f:
pickle.dump(clf, f)</span>
</code></pre>
<p>Now if you wish to load it you can type the following code as well:</p>
<pre><code class="lang-py"><span class="hljs-keyword">with</span> open(<span class="hljs-string">"filename.pkl"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
    clf  = pickle.load(f)
</code></pre>
<p>Now that we’ve understood that, let’s create a function that we will be able to pass into Gradio so that it can make the predictions.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">make_prediction</span>(<span class="hljs-params">age, employment_status, bank_name, account_balance</span>):</span>
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">"filename.pkl"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
        clf  = pickle.load(f)
        preds = clf.predict([[age, employment_status, bank_name, account_balance]])
    <span class="hljs-keyword">if</span> preds == <span class="hljs-number">1</span>:
            <span class="hljs-keyword">return</span> <span class="hljs-string">"You are eligible for the loan"</span>
    <span class="hljs-keyword">return</span> <span class="hljs-string">"You are not eligible for the loan"</span>

<span class="hljs-comment">#Create the input component for Gradio since we are expecting 4 inputs</span>

age_input = gr.Number(label = <span class="hljs-string">"Enter the Age of the Individual"</span>)
employment_input = gr.Number(label= <span class="hljs-string">"Enter Employement Status {1:For Employed, 2: For Unemployed}"</span>)
bank_input = gr.Textbox(label = <span class="hljs-string">"Enter Bank Name"</span>)
account_input = gr.Number(label = <span class="hljs-string">"Enter your account Balance:"</span>)
<span class="hljs-comment"># We create the output</span>
output = gr.Textbox()


app = gr.Interface(fn = make_prediction, inputs=[age_input, employment_input, bank_input, account_input], outputs=output)
app.launch()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/05/bank2.png" alt="alt_text" width="600" height="400" loading="lazy"></p>
<p>So let’s unwrap what we have above:</p>
<p>We'll start at the point where we created the input component. You can choose to create the component in the <code>gr.Interface</code>, but in the following code, I built it directly outside of the <code>gr.Interface</code> and then provided the variable into the <code>gr.Interface</code>.</p>
<p>So, if you want to make a component that receives numbers, use <code>gr.Number</code>, and then from the output variable I created, you can pass text as we did earlier in our first app (the " text" string is shorthand for textbox if you don't want to declare the attribute explicitly).</p>
<p>Also I used the label parameter in each component so that the user will know what to do. We are already familiar with the other code mentioned above. And now that we've done that our model is deployed. 🎉🎉😎🥳🥳.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Thank you for reading this tutorial. We covered a lot in this article. Just remember that learning Gradio does not stop here – you can check out more on their <a target="_blank" href="https://gradio.app/">website</a>. They have pretty intuitive documentation on how you can create your web app.</p>
<p>Thanks once again for reading. If you enjoyed this article, you can support me by following me on <a target="_blank" href="https://www.linkedin.com/in/ibrahimogunbiyi/">LinkedIn</a> or <a target="_blank" href="https://twitter.com/Comejoinfolks">Twitter</a>. Gracias, and happy deployment😀</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
