Ibrahim Ogunbiyi - freeCodeCamp.org

What is Semantic Matching? How to Find Words in a Document Using NLP

Ibrahim Ogunbiyi — Thu, 09 Jan 2025 19:27:37 +0000

Have you ever found yourself searching a document for a specific word or phrase just to discover that the term you're looking for isn't there? It can be frustrating, right?

Sometimes, even though you might not see the exact term you’re looking for, the document might contain similar words or phrases that have the same meaning or context but don’t have the exact same form (such as differences in spelling).

Traditional NLP search approaches have relied on using exact forms to search for words or phrases in a particular document. But this fails at finding words based on semantic or contextual meaning.

To solve this, semantic matching comes into play. It’s an advanced way of searching that takes advantage of traditional search methods while also focusing more on locating or matching words or phrases based on their meaning or context (rather than solely on their exact form).

In this article, you will learn how to perform semantic matching using NLP. Without further ado, let’s get started.

Requirements

To make sure that you can reproduce the experiment in this tutorial, you’ll need to have a few things.

First, you’ll need to have Python 3.x (preferably Python 3.10) installed on your PC. You’ll also need some libraries, which you can install using the Pip package manager.

You should also have basic knowledge of NLP such as text preprocessing and text representation techniques. You can learn more here.

You can also fork the repo which contains all the code in this article so you can follow along.

To install everything using Pip, type the following command:

// to install with pip
pip install pypdf2 keybert sentence-transformers

Problem Definition

Suppose you’re a data scientist who’s part of a curriculum development team and want to know if a particular concept (word or phrase), say birth control, is being taught in a curriculum that’s in a pdf document.

One way you could do this is to open the pdf using a pdf tool and then use the ctrl + f (find) method to check if the phrase birth control is in the pdf.

You could also do it programmatically, as shown below:

# import library
import PyPDF2

# use PDFreader from PyPDF2 to read pdf content.
pdf_reader = PyPDF2.PdfReader("Relationships_Education_RSE_and_Health_Education.pdf")

# join all the content in the pdf pages together and lowercase the letters
pdf_document = " ".join([page.extract_text().lower() for page in pdf_reader.pages])

# check if the string 'birth control' is in the document [Returns False]
"birth control" in pdf_document

Below is the output of the above code:

False

As shown above, you can see that both the programmatic way of searching and the pdf tool say that the phrase “birth control” doesn't exist in the pdf document.

Well, this might be true, but because this is a traditional way of NLP searching (that matches word for word in exact form) let’s not fully trust it. As I explained earlier, some words might be in different forms or have a different spelling, but they might mean the same thing contextually or semantically.

So how do we solve this issue? This is where semantic matching comes into play.

What is Semantic Matching?

Semantic Matching is a technique used to determine if two elements have the same meaning. An element can be a word, phrase, sentence, document, or even a corpus. It refers to matching elements based on meaning or context and not just matching based on exact form.

In order to perform semantic matching in NLP, there are certain things you need to know and do. Let’s go through them now:

What is Word Embedding?

Word embedding is an advanced text representation technique used to represent words in a lower-dimensional vector representation. This vector representation captures inter-word semantic and syntactic information. This means that words that have similar meanings – even though they might be spelled differently – will have close to similar vector representations.

What does Lower-Dimensional Vector representation mean?

In NLP, traditional ways of representing text in a way machines can understand (that is, numerical vector representations) are Bag of Words, Term-Frequency and Inverse Document Frequency (TF-IDF), and One-hot encoding. But these techniques usually generate high dimensions (usually the size of the vocabulary) for a particular word representation and are sparse (meaning there will be lots of zeros).

So, for example, if a word is to be represented as a numerical vector and the document or corpus the word belongs to has 10,000 vocabularies, the size of the dimension of that word would be 10,000 (making it high).

The disadvantages of these techniques are high dimensions, sparsity, and their non-capability in capturing semantic information. So, advancements in NLP led to the development of word embedding techniques that simply create lower (also known as more dense) vector representations of words and can capture inter-word semantic information.

Word embedding is the holy grail in NLP and language technology, serving as the foundation for advanced language representation models such as GPT (Generative Pre-trained Transformer).

There is also sentence embedding that represents sentences in a lower-dimension vector representation.

How do we measure if two vectors are similar?

This is where cosine similarity comes into play. Cosine similarity is a mathematical technique that we use to know how similar two vectors are to each other.

In NLP, it usually outputs a value between 0 to 1. A value close to 1 means that the two vectors are highly similar.

For example, to understand how cosine similarity works, let’s create a word embedding vector representation for three words: Man, Woman, and Cat. Then we’ll use cosine similarity to figure out which vectors are similar.

Based on our own instincts, we know that Man should be closer to Woman than Cat. So, let’s use NLP to help us validate this.

Thanks to advancements in NLP, there are numerous models we can use to create word embeddings, which you can find on the Hugging Face repository. In this article, we are going to use the ⁣all-mpnet-base-v2 model from the ⁣SentenceTransformer library. According to ⁣SentenceTransformer, it provides the best quality performance in terms of sentence embedding, and you can also use it to create word embeddings.

The below code allows us to validate our claim using NLP. So, firstly, we initialize the SentenceTransformer with all-mpnet-base-v2 and then use the encode method to get the embedding of each word. Then, finally, we’ll use the cos_sim class, also from SentenceTransformer, to determine which vectors are similar.

# import library
from sentence_transformers import SentenceTransformer # sentence transformer
from sentence_transformers.util import cos_sim # cosine similarity

# initialize sentence transformer with the 'all-mpnet-base-v2' model
model = SentenceTransformer("all-mpnet-base-v2")

# get the embedding vector of the man, woman, and cat words.
man_vector = model.encode("man")
woman_vector = model.encode("woman")
cat_vector = model.encode("cat")

# get the similarity between man and woman
similarity = cos_sim(man_vector, woman_vector)

# get the similarity between man and cat
cat_similarity = cos_sim(man_vector, cat_vector)

print("The Similarity between Man vector and Woman Vector:", similarity, "\n")

print("The Similarity between Man vector and Cat Vector:", cat_similarity)

// Result

The Similarity between Man vector and Woman Vector: tensor([[0.3501]]) 

The Similarity between Man vector and Cat Vector: tensor([[0.2553]])

As you can see, the similarity score between man and woman (0.35) is higher than that of man and cat (0.26). This shows the beauty of word embedding and cosine similarity together.

Now let’s get back to our business.

How to Perform Semantic Matching on a PDF Document

Now we are going to use semantic matching to look for a word or phrase in the document that matches the birth control phrase.

How to Get Words from the PDF using KeyBERT

Word embedding generates embeddings for individual words. Our PDF document contains a large volume of textual components, including digits, special characters, symbols, stopwords, and the actual words we want to match. So, to save time on preprocessing, we are going to utilize KeyBERT. This is a library that allows us to get meaningful keywords (words or phrases) from a particular document in a minimal way.

Keep in mind that by default, KeyBERT extracts single keywords – but we can also tell it to extract phrases with two or more words. We’ll use it here to extract single-word and 2-word phrases. Below is the implementation of using KeyBERT to extract keywords from our document:

from keybert import KeyBERT
# initialize model
keybert_model =  KeyBERT()

# extract all keywords (single word and 2 word phrase) from the pdf
all_keywords = keybert_model.extract_keywords(docs=pdf_document, top_n=-1, keyphrase_ngram_range=(1, 2))
# print length of keywords extracted                                             
print(len(all_keywords))
# show the first 5 keywords
print(all_keywords[:5])

The above code imports KeyBERT from the keybert library. It then initializes KeyBERT, and extracts all keywords (that is, single word and 2-word phrases) from the document. Then the next line prints the number of keywords extracted. Lastly, the code prints the first five 5 keywords out of all the keywords extracted from the PDF.

Below is the output of the above code:

8669
[('education guidance', 0.5954),
 ('schools guidance', 0.5542),
 ('education policies', 0.5405),
 ('sex education', 0.5228),
 ('education safeguarding', 0.5001)]

As you can see above, KeyBERT extracted 8,669 keywords from the PDF. Also, the KeyBERT model usually returns the keywords extracted along with a score of each word. We don’t need the score, so we will only extract each keyword from the tuple it is enclosed in.

# remove score from each keyword

all_keywords = [keyword[0] for keyword in all_keywords]
all_keywords[:5]

Below is the output of the above code:

['education guidance',
 'schools guidance',
 'education policies',
 'sex education',
 'education safeguarding']

Embedding of the Birth Control Phrase and the Keywords Extracted from the PDF

Now that we’ve extracted these keywords from the document, the next step is to get the embedding of our phrase and the keywords from the document.

The below code lets us do this:

# initialize sentence transformer with the 'all-mpnet-base-v2' model
model = SentenceTransformer("all-mpnet-base-v2")

# get the embedding of the 'birth control' phrase
birth_control_embedding = model.encode("birth control")

# get the embedding of all the keywords in the document
keywords_embedding =  model.encode(all_keywords)

Cosine Similarity of Birth Control Phrase and Keywords in PDF

After getting the embedding of the phrase and the keywords, the next step is to get the similarity score of the phrase and the keywords. This will help us know which keyword in the document is highly similar to the phrase.

The below code allows us to get the cosine similarity of the phrase and the keywords’ embedding vector.

# calculate the cosine similarity of the birth control word and each word in the document
cosine_similarity_result = cos_sim(birth_control_embedding, keywords_embedding)
# print the shape (equal to the number of keywords)
print(cosine_similarity_result.shape)
# show the top 5 similarities
print(cosine_similarity_result[:5])

Below is the output of the above code:

torch.Size([1, 2034])
tensor([[0.2166, 0.1977, 0.0998,  ..., 0.1634, 0.1082, 0.2194]])

Now that we have the similarity score of the phrase and the keywords, the total size of the resulting tensor will be the number of keywords, as shown above. Then we can use the argmax() method to get the index of the element of the tensor with the highest score. This index will help us filter out the particular keyword in the all_keywords list variable. The below code achieves this:

# return the index number of the high similarity score
index = cosine_similarity_result.argmax()
print(index)

Below is the output of the above code. It tells us that the keyword with the highest similarity to the Birth Control phrase is at index 1490.

tensor(1490)

Now, let’s look at the keyword at index 1490 in the all_keywords variable.

# print the keyword at index 1490 
print(all_keywords[index])

Below is the output of the above code:

contraceptive

After examining it, we found that "contraceptive" was the word with the highest similarity, which makes sense because "birth control" and "contraceptive" mean the same thing. This demonstrates the elegance of semantic matching in finding similar words.

Let’s Also Explore Top 5 Keywords in the PDF that Match with the Phrase “Birth Control”

Let’s explore the 5 top keywords with the highest similarity score to “birth control” to see what the result would look like.

To do that, we can use the topk() method to get the top 5 indices. Then we can then loop through these indices to get the actual keywords:

# extract the top 5 indices
top_5_indices = cosine_similarity_result.topk(5)[1].tolist()[0]

print(top_5_indices)

Below is the result of the above code:

[1490, 1972, 871, 1199, 1944]

# get top 5 keywords
top_5_keywords = [all_keywords[index] for index in top_5_indices]
print(top_5_keywords)

Below is the output of the above code:

['contraceptive', 'contraception', 'contraceptive choices', 'range contraceptive', 'cover contraception']

There, we can see that the top five results relate to contraception and contraceptives. This demonstrates that semantic matching is an effective way to find related elements in a document.

Wrapping Up

In this article, you learned what semantic matching is and its advantages compared to traditional NLP search methods. You also encountered concepts such as word embeddings and cosine similarity and learned how they help us perform semantic matching. Then we implemented semantic matching by finding a phrase in a document.

Thank you for reading this article, and I will see you in the next one.

References

Natural Language Processing Techniques for Topic Identification – Explained with Examples

Ibrahim Ogunbiyi — Thu, 25 Jan 2024 16:16:15 +0000

There's a lot of textual information available these days. It ranges from articles to social media posts and research papers. So our ability to distill meaningful insights is key. This helps us make informed decisions in a wide array of contexts.

For example, you can analyze a large volume of textual content to extract a common theme. Companies and businesses utilize this technique to understand public opinion about their brand. This lets them make informed decisions and improve their services.

The ability to extract themes from a large amount of textual data is referred to as topic identification.

In this article, you will learn how to utilize NLP techniques for topic identification, enhancing your skillset as a data scientist. So sit back, because it's gonna be an interesting journey.

What is Topic Identification?

Topic identification, simply put, is a sub-field under natural language processing. It involves the process of automatically discovering and organizing the main themes or topics present in a collection of textual data.

There are several Natural Language Processing (NLP) techniques you can use to identify themes in text, from simple ones to more algorithm based techniques. In this article we will look at the common NLP techniques used for topic identification. We'll discuss these in more detail below.

I recently tweeted about the essence of NLP. It really is purely statistics, because there are different manipulations you can do to ensure that numbers serve as representations for text (since computers don't understand text).

Requirements for this Project

In order for you to be able to follow along and get hands-on practical experience while learning, you should have Python 3.x installed on your machine.

We'll also use the following libraries: Gensim, Scikit-Learn, and NLTK. You can install them using the Pip package installer with the following command:

pip install gensim nltk scikit-learn

Techniques Used in NLP for Topic Identification

There are various techniques you can use for topic identification. In this article, you will learn about some common NLP techniques that work quite well, from simple and effective methods to more advanced ones.

Bag of Words

Bag of Words (BoW) is a common representation used in NLP for textual data. You can use it to count the frequency at which each word occurs in a document.

BoW, in the context of topic identification, is based on the assumption that the more frequently a word occurs in a document, the more important it is. Then you can use those more common words to infer what the document is all about.

Bag of words is the simplest technique used to identify topics in NLP. While Bag of Words is simple and efficient, it is highly affected by stop words, which are common words in text data (like "the," "and," "is," and so on).

But once you eliminate the issue of stop words from the text, allowing you to perform effective text processing (using techniques like normalization), BoW can still prove effective in identifying some main topics.

Let's look at how you can use BoW to identify the topic below.

How to implement of Bag of Words in Python

A bit of background about the example article we'll use here: I got it from the BBC, and it's titled "US lifts ban on imports of latest Apple watch." The article discusses the lifted ban on Apple's latest watches, Ultra 2 and Series 9.

Now let's go over how to implement the bag of words in Python. I'll break this code block up into sections and explain each part as I go to make it a bit more easy to digest.

#import necessary libraries
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

article = "Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities.\
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row.\
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level.\
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week.\
Apple had said it strongly disagrees with the ruling.\
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."

In the above code, we're importing the necessary libraries that we'll use to implement the BoW.

We'll use the Counter library to count the frequency of each word, and the word_tokenize library to tokenize the document into individual word tokens so they can be counted. Lastly, the stopwords library will remove stop words from the document.


# Initialize english stopwords
english_stopwords = stopwords.words("english")

#convert article to tokens
tokens = word_tokenize(article)

#extract alpha words and convert to lowercase
alpha_lower_tokens = [word.lower() for word in tokens if word.isalpha()]

#remove stopwords
alpha_no_stopwords = [word for word in alpha_lower_tokens if word not in english_stopwords]

#Count word
BoW = Counter(alpha_no_stopwords)

#3 Most common words
BoW.most_common(3)

In the above code, we use the first line of code to extract all stop words in the English language. Then, the second line tokenizes the article string into individual words. The third line of code normalizes each word into lowercase and only extracts alphabetic words from the article. The last two lines of code are used to count the frequency of each word and select the most common three words.

Below is the output of the BoW model:

[('watches', 4), ('us', 4), ('apple', 3), ('emergency', 2)]

From this, we can infer that the article is all about "Apple's watches in the US". As you can see, with the simplicity in reasoning behind the bag of words, it is still possible to infer a bit of knowledge about the article.

Latent Dirichlet Allocation

Latent Dirichlet Allocation, or LDA for short, is a popular probabilistic model used in NLP and machine learning for topic modeling (using algorithms to identify topics). It is based on the assumption that documents are mixtures of topics, and topics are mixtures of words.

Simply put, LDA is an NLP technique used to identify the topic to which a document belongs based on the words contained in the document.

LDA operates on the bag-of-words representation of documents, where each document is represented as a vector of word frequencies. You can implement LDA using the Gensim library in Python (which is an open source library used for topic modelling and document similarity analysis).

Steps for implementing LDA include:

Import Libraries: First step is to import the necessary libraries you will be utilizing.
Data Preparation: Convert raw data to a document format then tokenize, remove stop words, and optionally perform stemming or lemmatization.
Create Dictionary and Corpus: Build a dictionary with unique word IDs. Then form a bag of words corpus representing document-word frequency.
Train LDA Model: Use the document-word frequency and dictionary to train the LDA model, setting the desired number of topics.
Print Topics: Explore and print the discovered topics.

# Import the necessary libraries
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

article = "Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities. \
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row. \
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level. \
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week. \
Apple had said it strongly disagrees with the ruling. \
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."

The above lines of code include the necessary libraries that we'll use to implement the LDA.

The first line of code contains the Dictionary object. Then, the second line imports the LDA model, and the third line of code contains the sent_tokenize, which we'll use to convert the article into document. After that, word_tokenize will tokenize the document into individual words. Lastly, we have the stop_words library.

# convert article to documents
documents = sent_tokenize(article)

#toeknize and normalize the document
tokenized_words = [word_tokenize(doc.lower()) for doc in documents]

# remove stops words and onl extract alphabets
cleaned_token = [[word for word in sentence if word not in english_stopwords and word.isalpha()]
                 for sentence in tokenize_words]

# create a dictionary
dictionary = Dictionary(cleaned_token)

# Create a corpus from the document
corpus = [dictionary.doc2bow(text) for text in cleaned_token]

The above lines of code include the preprocessing steps that will be performed on the article, including converting the article to a document, normalizing, and tokenizing the document into individual words.

The next part removes stopwords from the text and then extracts words and numbers from the document. After that, we create a dictionary, which is a map between each word and its numerical identifier. The last line of code then creates a corpus of the document.

# Build the LDA model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=3)

# Print the topics
print("Identified Topics:")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx + 1}: {topic}")

The above code is used to train the model on the corpus and then prints the top 3 topics from the article.

Below is the output of the LDA Model:

Identified Topics:
Topic 1: 0.045*"9" + 0.045*"ultra" + 0.044*"sales" + 0.044*"2" + 0.043*"series" + 0.043*"watches" + 0.029*"apple" + 0.028*"ruling" + 0.028*"disagrees" + 0.028*"said"
Topic 2: 0.051*"maker" + 0.035*"ban" + 0.035*"us" + 0.031*"emergency" + 0.031*"made" + 0.031*"successful" + 0.031*"court" + 0.031*"lifted" + 0.031*"request" + 0.031*"proved"
Topic 3: 0.055*"apple" + 0.054*"us" + 0.054*"watches" + 0.031*"sales" + 0.031*"technology" + 0.031*"imports" + 0.031*"authorities" + 0.031*"barred" + 0.031*"appeal" + 0.031*"filed"

The LDA technique shows some improvement as compared to BoW method. We can still obtain a more information that the article is all about a ban related to Apple ultra series watches in the US.

Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF), just like LDA, is another topic modeling technique that uncovers latent topics in a collection of documents.

But instead of relying on BoW, it relies on the Term Frequency-Inverse Document Frequency (TF-IDF) representation to capture and retrieve hidden themes or topics from the documents.

By incorporating TF-IDF information, NMF is able to weigh the importance of terms, thereby identifying more hidden patterns. You can perform NMF using the Scikit-learn library.

Steps for performing NMF

Import necessary libraries
Data Preparation: Convert text into document, then perform necessary data preparation like removing stop words. The TF-IDF function in Scikit-Learn has as an argument that does that.
Convert the document to a TF-IDF matrix using the TF-IDF vectorizer in Scikit-learn
Apply the NMF function on the TF-IDF matrix and specify the numbers of topic you want and the number of words in each topic
Lastly, interpret your result.

# import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

article = "Apple's latest smart watches can resume being sold in the US after the tech company filed an emergency appeal with authorities. \
Sales of the Series 9 and Ultra 2 watches had been halted in the US over a patent row. \
The US's trade body had barred imports and sales of Apple watches with technology for reading blood-oxygen level. \
Device maker Masimo had accused Apple of poaching its staff and technology. \
It comes after the White House declined to overturn a ban on sales and imports of the Series 9 and Ultra 2 watches which came into effect this week. \
Apple had said it strongly disagrees with the ruling. \
The iPhone maker made an emergency request to the US Court of Appeals, which proved successful in getting the ban lifted."

The above code contains the libaries that we'll use to implement NMF and the article itself.

# convert article to documents
documents = sent_tokenize(article)

# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english').fit_transform(document)

# Apply NMF
num_topics = 5  # Set the number of topics you want to identify
nmf_model = NMF(n_components=num_topics, init='random', random_state=42)
nmf_matrix = nmf_model.fit_transform(tfidf)

The above code converts the article into documents. Then it creates a Term-Frequency Inverse Document Frequency matrix of the article document. The last three lines of code then define the number of topics and create the topics from the document matrix using the NMF.

Below is the output of the NMF Model:

Topic #1: ultra, series, sales, watches, row, halted, patent, white, house, effect
Topic #2: lifted, court, iphone, getting, request, successful, proved, appeals, ban, maker
Topic #3: disagrees, strongly, ruling, said, apple, body, blood, level, trade, oxygen
Topic #4: filed, resume, appeal, latest, tech, authorities, sold, smart, company, emergency
Topic #5: technology, apple, accused, masimo, device, staff, poaching, maker, trade, level

You can see that NMF reveals more insights concerning the themes of the document. For example, you can tell that another company called Masimo is accusing Apple of a patent infringement in their Ultra series watches.

How to Choose Which Technique to Use?

I recommend experimenting with all the approaches in order to gain different perspectives concerning the contents of your document.

Bag of Words and LDA are based on how frequently words occur, making these techniques useful for inferring the biggest/most general themes about the document.

On the other hand, when using NMF, which is based on TF-IDF, less frequent words can be used to infer additional topics and provide a different perspective on the document.

For example, NMF was able to identify key terms like "Masimo" and "accused," whereas LDA was not able to do this. So depending on your needs, go ahead and experiment with all the approaches to see which one is able to yield better results.

Conclusion

In this article, you've learned about topic identification and how you can use it to extract themes or topics from a large document.

We covered some different techniques you can use to identify topic including simple ones like BoW and more advanced ones like LDA and NMF.

Happy learning, and see you in the next one.

How to Write Common Date Functions in SQL with Examples

Ibrahim Ogunbiyi — Mon, 13 Mar 2023 16:49:25 +0000

When querying data from a database, you will frequently encounter the date datatype. Depending on what you want to achieve, you may need to extract subset information from the date column, perform some operation, and so on.

SQL provides a variety of date functions that can assist you with your task. In this tutorial, we will look at various common date functions in SQL and some examples to show how they work. Without further ado let's get started.

Note: There are numerous SQL flavors available, and the functions for completing a specific task may differ between flavors. This tutorial will concentrate on three of the most popular SQL flavors: PostgreSQL, MySQL, and SQL server. We will start with PostgreSQL functions and then present the variants of the other flavors if they differ from PostgreSQL.

Date Data types

Date data types are one of the built in data types in SQL that you use to store date values. A date value is usually stored across all database management systems or flavors in the timestamp format, that is YYYY-MM-DD HH:MM:SS – for example 2022-01-01 10:08:56.

Before we get started, we will be using this table we created to explain the function we will be talking about later in the article. You can create it using the following query. Note the SQL flavor we are using is PostgreSQL.

DROP TABLE IF EXISTS student;

CREATE TABLE student (
  student_id SERIAL PRIMARY KEY,
  student_name VARCHAR(30),
  admitted_date DATE
);

INSERT INTO student VALUES (11, 'Ibrahim', '2012-10-01');
INSERT INTO student VALUES (7, 'Taiwo', '2013-12-01');
INSERT INTO student VALUES (9, 'Nurain', '2012-11-21');
INSERT INTO student VALUES (8, 'Joel', '2012-10-31');
INSERT INTO student VALUES (10, 'Mustapha', '2015-11-01');
INSERT INTO student VALUES (5, 'Muritadoh', '2011-09-01');
INSERT INTO student VALUES (2, 'Yusuf', '2022-05-03');
INSERT INTO student VALUES (3, 'Habeebah', '2012-11-01');
INSERT INTO student VALUES (1, 'Tomiwa', '2013-04-01');
INSERT INTO student VALUES (4, 'Gbadebo', '2008-10-01');
INSERT INTO student VALUES (12, 'Tolu', '2009-11-21');


SELECT * FROM student;

Common SQL Date Functions

Let's look at the common date functions you will work with on a daily basis.

How to use the `Now()` function

You use the Now() function to return the current timestamp (date +time) of the computer system where the database management system is currently hosted. In PostgreSQL it also includes the time zone of the timestamp as shown below.

SELECT NOW();

The function for getting the current timestamp in MySQL is also the same as in PostgreSQL – Now(). But in SQL server, you use the function CURRENT_TIMESTAMP.

How to use the `current_date` function

This function, as the name implies, gets the current date of the computer system on which the SQL database is running. When retrieving the current date in PostgreSQL, you do not need to use a parenthesis, as you can see below:

SELECT current_date;

In MySQL, you use the [CURDATE](https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_curdate)() function to get the current date, but SQLServer uses [GETDATE](https://learn.microsoft.com/en-us/sql/t-sql/functions/getdate-transact-sql?view=sql-server-ver16) ().

How to use the `Extract()` or `Date_Part()` functions

You use the Extract or date part functions to extract a certain part or unit of a date or date column.

Let's start with the Extract function. Its syntax looks like this:

EXTRACT(unit FROM date/date_column)

The unit part of the Extract function is a unit you can extract from a date such as DAY, WEEK , YEAR , QUARTER , and so on. Click here to see the list of units that you can extract from a date or date column in SQL.

Say for instance in the above student table we've created earlier you wish to extract the year the students were admitted from the admitted_date column you can achieve that using the EXTRACT() function as shown below.

SELECT 
    *,
    EXTRACT(YEAR FROM admitted_date) As "Year of Admission"
FROM student;

The EXTRACT() function is only available in only PostgreSQL and MySQL and works similarly. Another Function that works like EXTRACT() is DATEPART() and it is also available in PostgreSQL and SQLServer. Let's look at how the DATEPART() function works.

The syntax for Datepart in PostgreSQL looks a little bit different from the one SQLServer uses in that it has an underscore between the date and part. You also need to pass in the unit in a single quote as shown below:

SELECT DATE_PART('Year', admitted_date)
FROM student;

For SQLServer there won't be any underscore between the date and part, and the unit will not be enclosed in single quotes. For example the above result can be generated in SQLServer as shown below.

SELECT DATEPART(YEAR, admitted_date)
FROM student;

How to add intervals or parts to dates

Intervals are units that you can add to a date – for example a days interval, time interval, and so on.

For example, say you want to add 1 day interval to all the dates in a particular table. In PostgreSQL there is no dedicated function you can use to add an interval to a particular date. Instead, you can do this using arithmetic operations.

The syntax for achieving that is shown below:

SELECT date/date_column + INTERVAL "# unit"

Where # is an integer such as 3, 4, and so on, and unit can be Days, Year, and so on. Click here for a list of units that can be passed as an interval.

Say, for instance, that you want to add an interval of 3 days to the admitted_date column in the student table. You can do this in PostgreSQL using the following query:

SELECT 
    *,
    admitted_date + INTERVAL '3 Days' AS "3_daysadded"
FROM student;

Now that you've seen how to add intervals to dates in PostgreSQL, let's see how it is done in MySQL and SQLServer. In MySQL and SQLServer there are functions that you can use to add intervals to dates.

In my SQL, the name of the function is called DATE_ADD() and the syntax is shown below:

DATE_ADD(date/date_column, INTERVAL value unit)

For example, you can get the above table using MySQL by typing the following code:

SELECT *,
    DATE_ADD(admitted_date, INTERVAL 3 DAY) AS "3_daysadded"
FROM student;

In SQLServer, the function you use is similar to the one in MySQL but with a small difference. The syntax for the function used is shown below:

DATEADD (datepart/unit , number , date/date_column)

You can replicate the above table in SQLServer like this:

SELECT *,
    DATEADD (day , 3 , admitted_date) AS "3_daysadded"
FROM student;

How to subtract intervals from dates

Subtracting intervals from dates in PostgreSQL works like adding intervals, except that the operator changes from plus to minus. For example, say you want to subtract 3 days from the admitted_date column. You can do this using the below code:

SELECT 
    *,
    admitted_date - INTERVAL '3 Days' AS "3_dayssubtracted"
FROM student;

In MySQL, you use the DATESUB function to subtract intervals from the date. You can replicate the above table in MySQL using the following query:

SELECT *,
    DATE_SUB(admitted_date, INTERVAL 3 DAY) AS "3_dayssubtracted"
FROM student;

In SQLServer, you still use the DATEADD function, but instead of specifying a positive value in the function parameter, you use a negative value. It looks like this:

SELECT *,
    DATEADD (day , -3 , admitted_date) AS "3_dayssubtracted"
FROM student;

How to subtract two dates

To subtract two dates in PostgreSQL, there is also not a dedicated function. But you can use arithmetic operators to achieve your desired result.

SELECT '2012-10-31'::date -'2012-05-01'::date AS days;

In MySQL, there is a function called [DATE_DIFF](https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_datediff)() that you can use to achieve this, while for SQLServer you use the [DATEDIFF](https://learn.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-ver16)() function. Click here to learn more about it.

Conclusion

In this tutorial you've learned some common date functions you will use with when working with dates in SQL.

You learned how to get the current timestamp, get the current date, extract parts from a date, and how to add or subtract dates. Also you learned how each date function differs across different SQL flavors.

Thank you for reading. You can check out the below resources to learn more about date function across the three different SQL flavor discussed in this article.

How to Use Window Functions in SQL – with Example Queries

Ibrahim Ogunbiyi — Thu, 09 Feb 2023 21:47:41 +0000

Window functions are an advanced type of function in SQL. They let you work with observations more easily.

Window functions give you access to features like advanced analytics and data manipulation without the need to write complex queries.

In this lesson you will learn about what window functions are and how they work. Without further ado let's get started.

What is a Window Function?

Before learning exactly what a window function is, let's define the meaning of a term that will appear frequently in this article: result set.

In SQL, a result set is the data or result that is returned from a query. That is, it's the result (table) of running the code of a select statement.

For you to understand what a window function is, let's break the words down into pieces.

What exactly is a window in SQL?

A window is basically a set of rows or observations in a table or result set. In a table you may have more than one window depending on how you specify the query – you will learn about this shortly. A window is defined using the OVER() clause in SQL.

You will learn how to determine the number of windows in a result set later in this article.

What is a Function?

Functions are predefined in SQL and you use them to perform operations on data. They let you do things like aggregating data, formatting strings, extracting dates, and so on.

So windows functions are SQL functions that enable us to perform operations on a window – that is, a set of records.

The interesting thing about window functions is that with them you can specify the windows you want to apply the function on. For example, we can partition the full result set into various groups/windows.

Before we go into the syntax of Window functions, let's have a look at the categories of window functions.

Different Types of Window Functions

There are a lot of window functions that exist in SQL but they are primarily categorized into 3 different types:

Aggregate window functions
Value window functions
Ranking window functions

Aggregate window functions are used to perform operations on sets of rows in a window(s). They include SUM(), MAX(), COUNT(), and others.

Rank window functions are used to rank rows in a window(s). They include RANK(), DENSE_RANK(), ROW_NUMBER(), and others.

Value window functions are like aggregate window functions that perform multiple operations in a window, but they're different from aggregate functions. They include things like LAG(), LEAD(), FIRST_VALUE(), and others. We will see their usefulness later in the section.

Sample Table

In this tutorial you will be working with a table called student_score which contains data such as student_id, student_name, dep_name and score.

You can create the table using the following code:

DROP TABLE IF EXISTS student_score;

CREATE TABLE student_score (
  student_id SERIAL PRIMARY KEY,
  student_name VARCHAR(30),
  dep_name VARCHAR(40),
  score INT
);

INSERT INTO student_score VALUES (11, 'Ibrahim', 'Computer Science', 80);
INSERT INTO student_score VALUES (7, 'Taiwo', 'Microbiology', 76);
INSERT INTO student_score VALUES (9, 'Nurain', 'Biochemistry', 80);
INSERT INTO student_score VALUES (8, 'Joel', 'Computer Science', 90);
INSERT INTO student_score VALUES (10, 'Mustapha', 'Industrial Chemistry', 78);
INSERT INTO student_score VALUES (5, 'Muritadoh', 'Biochemistry', 85);
INSERT INTO student_score VALUES (2, 'Yusuf', 'Biochemistry', 70);
INSERT INTO student_score VALUES (3, 'Habeebah', 'Microbiology', 80);
INSERT INTO student_score VALUES (1, 'Tomiwa', 'Microbiology', 65);
INSERT INTO student_score VALUES (4, 'Gbadebo', 'Computer Science', 80);
INSERT INTO student_score VALUES (12, 'Tolu', 'Computer Science', 67);

Syntax for Window Functions

In a simple expression, a window function looks like this:

function(expression|column) OVER(
    [ PARTITION BY expr_list optional]
    [ ORDER BY order_list optional]
)

Let's go over the syntax piece by piece:

function(expression|column) is the window function such as SUM() or RANK().

OVER() specifies that the function before it is a window function not an ordinary one. So when the SQL engine sees the over clause it will know that the function before the over clause is a window function.

The OVER() clause has some parameters which are optional depending on what you want to achieve. The first one being PARTITION BY.

The PARTITION BY divides the result set into different partitions/windows. For example if you specify the PARTITION BY clause by a column(s) then the result-set will be divided into different windows of the value of that column(s).

The expr_list in the PARTITION BY clause is:

expression | column_name [, expr_list ]

Which means that the PARTITION BY can have an expression, a column, or more than one occurrence or an expression or column which must be separated by a comma. For example PARTITION BY column1, column2.

The next parameter ORDER BY is used to sort the observations in a window. The ORDER BY clause takes order_list which is:

expression | column_name [ ASC | DESC ]
[ NULLS FIRST | NULLS LAST ][, order_list ]

where order_list can be a expression or column name and you can also specify the sort order (either ascending or descending), or you can sort any null values first or last. Also the order by can take many expressions or column names.

As stated earlier, the OVER() clause is used to specify the window in a result set. Now one thing to note is if any parameter is not specified in the OVER() clause the default number of windows in the result set will be one.

You use the PARTITION BY and ORDER BY parameters to determine or specify the numbers of windows. Let's go over an example.

How to Use a Window Function – Example

Let's go over an example of how to use a window function. Say for instance you want to compare the minimum score and maximum score from all the records in the table we created earlier. You can do that using a window function as shown below.

Remember that not specifying a partition clause in the OVER clause will cause all the windows to span through the entire dataset.

SELECT 
    *,
    MAX(score) OVER() AS maximum_score,
    MIN(score) OVER() AS minimum_score

FROM student_score;

As you can see, we have the minimum and maximum salary across the entire dataset.

Table showing result of window function

Also, note that the above query can be also achieved using subqueries like this:

SELECT *,
    (SELECT MAX(score) FROM student_score) AS maximum_score,
    (SELECT MIN(score) FROM student_score) AS minimum_score
FROM student_score;

As you can see, the window function is easier to comprehend compared to the subquery method which looks a bit more advanced.

How to Use a Window Function with `PARTITION BY`

Say, for instance, that you want to split the dataset into different partitions. Then you want to compare each record in each partition with an aggregate value or a calculated value of each partition. You can specify the PARTITION BY clause in the OVER function.

For example, say you want to compare the maximum score and average score in each department with the individual score. You can do this by specifying the PARTITION BY clause in the OVER statement and also use it with the aggregate function you want to use to achieve your desired result.

SELECT 
    *,
    MAX(score)OVER(PARTITION BY dep_name) AS dep_maximum_score,
    ROUND(AVG(score)OVER(PARTITION BY dep_name), 2) AS dep_average_score
FROM student_score;

You can see that the PARTITION BY clause specified in the OVER() clause split the result set into 4 different partitions. This is because there are 4 different departments in the dep_name column (which are Biochemistry, Computer Science, Industrial Chemistry, and Microbiology).

Now after the PARTITION BY clause, you can then calculate the aggregate function for each record in the different departments.

You can see from the above image that the aggregate function MAX() and AVG() is calculated for each partition.

Other Examples of Window Functions

Let's go over some of the common window functions you will work with in SQL.

How to Use the `ROW_NUMBER` Function

You use ROW_NUMBER() to assign serial numbers to records in a window. Say we want to assign serial numbers to the records in a partition. For example, we want to add row numbers to the dataset based on their names in alphabetical order. You can do that using the following code:

SELECT
    *,
    ROW_NUMBER() OVER(ORDER BY student_name) AS name_serial_number
FROM student_score;

As you can see from the above image, the student_name with the smallest value (that is, the one that falls earliest in the alphabet) is Gbadebo since it starts with G. Then 1 is added as its row number which is followed by the name that begins with H, and so on.

How to Use the `RANK` Function

RANK(), as the name implies, lets you rank observations in a window but with gaps. Let's see what this means:

SELECT
    *,
    RANK()OVER(PARTITION BY dep_name ORDER BY score DESC)    
FROM student_score;

As you can see in the above code, the result set was partitioned into different windows based on the department column. Then we used the ORDER BY clause to sort the student records based on their score in descending order in each partition. After that, we applied the RANK function.

Now concerning the gaps, as you can see in the highlighted part in the above image, two records in the Computer Science department have the same score (80). This caused both to be ranked with the value 2 (instead of one being ranked 2 and the other 3). So it doesn't know how to handle a tie, basically.

You can avoid this scenario using another window function called DENSE_RANK that ranks observations in a window without these gaps.

How to Use the `DENSE_RANK` Function

DENSE_RANK is similar to RANK except that it ranks observations in a window without gaps.

SELECT
    *,
    DENSE_RANK()OVER(PARTITION BY dep_name ORDER BY score DESC)    
FROM student_score;

As you can see in the output above, when using DENSE_RANK, the next rank number (which is 3) was assigned to Tolu (unlike when using RANK which assigned Tolu a rank of 4, skipping 3 because of the tie).

How to Use the `LAG` Function

LAG is used to return the offset row before the current row within a window. By default it returns the previous row before the current row.

You typically use LAG when you want to compare the value of a previous row with the current row. It's commonly applied in time-series analysis. For example:

SELECT
    *,
    LAG(score) OVER(PARTITION BY dep_name ORDER BY score)    
FROM student_score;

As shown in the first partition, the first record in the biochemistry partition (Yusuf's) does not have a previous value (that is, no record comes before it) so that's why null was returned. Then moving to the next record – Muritadoh's – it has a previous record, so it returns the previous value which is 70.

How to Use the Frame Clause in `ORDER BY`

Now you've learned some common window functions you might work with on a daily basis. So let's move on to learning another key concept related to the ORDER BY clause called the frame clause.

A frame clause, as the name implies, provides the frame (that is, the set of rows in a window) on which the function is to be applied. You use it to provide the offset of rows to be included or calculated with the current row (that is, the rows before or after the current row – the SQL engine process row one after the other).

Now before we look into how to specify a frame clause, let's look at some of the frame clause's assumptions:

First, a frame clause does not apply to ranking functions. The ranking function only ranks the observation in the window based on the ORDER BY clause.
When using an aggregate window function, you may not include the ORDER BY clause. But when you use the ORDER BY clause, it's a best practice to specify the frame clause for accurate results. What this means is say you want to use an aggregate window function and you want to also order the observations in that window by a column. It's best practice is to specify a frame clause so that you will get an accurate result. But if you are not ordering the observations in the window when using an aggregate function, you don't need to specify a frame clause.

You can specify a frame clause using two things – ROWS and RANGE. But in this part you will learn how to use the ROWS keyword since it is commonly used to specify a frame clause. The RANGE keyword is beyond the scope of this article.

The ROWS clause defines the frame in terms of the physical offset rows from the current rows. That is, it is used to specify the rows that will be used in conjunction with the current row for calculation.

For example the following frame clause ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING defines a frame that includes the current row, 1 row preceding it and 1 row following it.

Let's look at the keywords that you can use in conjunction with the ROWS clause:

N PRECEDING is a keyword you use to specify the N rows that will be included in the calculation along with the current row. For example 3 PRECEDING means 3 rows preceding the current row.
N FOLLOWING works like N PRECEDING excepts that it works in an opposite manner. N FOLLOWING specifies the numbers of row after the current row.
UNBOUNDED PRECEDING means all rows before the current row.
UNBOUNDED FOLLOWING means all rows after the current row.
CURRENT ROW is used to specify the current row.

For example, let's look at the below frame clause:

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW will use less than or equal to 2 rows before the current row, along with the current row for the calculation.

Frame clause example

Let's look at an example. Say for instance you want to get the cumulative sum of all the student scores. You can do that by using a frame clause.

So first, to be able to do this, you need to first know the types of keywords you will specify in the frame clause.

Since you want to sum up all rows before the current row and the current row itself, you can use the UNBOUNDED PRECEDING keyword. Remember that this gets all rows before the current row and also uses the current row itself.

So the code to achieve that task is shown below:

SELECT
    *,
    SUM(score)OVER(ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cummulative_sum
FROM student_score

Let's break down the window function code:

SUM(score)OVER(ORDER BY student_id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cummulative_sum

Firstly in the OVER() clause, we sort the entire window – which is the whole dataset – using the student id.

Then we specify the frame clause which is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is all rows before the current row and the current row will be used for calculation.

The result is shown in the below image:

The first row in the dataset does not have any row before it. But since we also specify the CURRENT ROW keyword as the last frame, then the SQL engine finds its sum which equals 65.

Then moving to the second row. It has 1 row before it. So the SQL engine sums the score of the first row 65 with the current row which is 70. That is why the result is 135...and so on down the table.

When to Use a Window Function

You've learned what window functions are in this tutorial. Some practical cases where you can use them are:

When you want to compare an aggregate value in a window with individual records in that window.
When you want to do things like ranking, percentile, cumulative sum or running total, moving average, and so on.

Conclusion

In this tutorial, you've learned what window functions are, and you've also looked at some of the clauses you can add in Windows functions. One example is the PARTITION BY clause, which divides the result set into separate partitions or windows.

You also learned how to utilize the ORDER BY clause to order observations in a window and you saw various common examples of window functions.

Finally, you learned another advanced clause that you can use with window functions, the frame clause, which allows you to access more features of a window.

Thank you for reading all the way to the end. You can use the tutorial listed below to learn about more SQL window functions.

https://www.postgresql.org/docs/current/functions-window.html

What is Stratified Random Sampling? Definition and Python Example

Ibrahim Ogunbiyi — Tue, 15 Nov 2022 16:33:52 +0000

When we wish to conduct an experiment on a population – for example, the entire population of a country – it is not always practical or realistic to include every subject (citizen) in the experiment.

Instead, we rely on a sample, which is a subset of the population, and then draw conclusions about the population based on the sample's results.

Now, drawing a sample from a population is known as sampling technique, and the manner in which the sample is drawn is essential to the result.

There are lot of sampling techniques out there, but in this tutorial we will look at one of them called stratified random sampling and how it works. Without further ado, let's get started.

What is Stratified Random Sampling?

Before we go into the details of stratified random sampling, let's break the term down into bits so we can grasp it better. Let's start with stratified.

In the context of sampling, stratified means splitting the population into smaller groups or strata based on a characteristic. To put it another way, you divide a population into groups based on their features.

Random sampling entails randomly selecting subjects (entities) from a population. Each subject has an equal probability of being chosen from the population to form a sample (subpopulation) of the overall population.

So therefore, stratified random sampling is a sampling approach in which the population is separated into groups or strata depending on a particular characteristic. Then subjects from each stratum (the singular of strata) are randomly sampled.

You divide the population into groups based on a characteristic and then choose a subject or entity at random from each group.

Types of Stratified Random Sampling

Stratified sampling is divided into two categories, which are:

Proportionate stratified random sampling.
Disproportionate stratified random sampling.

Proportionate stratified random sampling is a type of sampling in which the size of the random sample obtained from each stratum is proportionate to the size of the entire stratum's population.

In other words, the proportion of the entire stratum equals the proportion of the sample stratum. Consider the following example:

students = {

    "Name": ["Ibrahim", "Ganiyat", "Joel", "Elijah", "Yusuf", "Nurain", 
            "Dayo", "David", "Olu", "Tobi"],

    "ID":  ['001', '002', '003', '004', '005', '006','007', '008', '009', '010'],

    "Grade": ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'A', 'B', 'A'],

    "Category": [1, 2, 2, 1, 3, 3, 1, 2, 3, 3]
}
df = pd.DataFrame(students)
>>

The above dataframe contains students' names, IDs, grades, and categories. Assume we wish to stratify students based on their grade characteristics and sample 60% of students from each group. That means we will have three strata in the above dataframe, because we have three different grades.

We can sample it by typing the following:

df_sample = df.groupby("Grade", group_keys=False).apply(lambda x:x.sample(frac=0.6))

Now what we did above is to group the dataframe into different strata using the groupby() method. Then we passed in the Grade feature. For each group (stratum) we randomly sampled out 0.6(60%) of observation from it.

Now if we look at the proportion for df_sample and df, we will see that the proportions for both dataframes are the same.

Disproportionate stratified random sampling, on the other hand, involves randomly selecting strata without regard for proportion. In other words, sampling is done based on a specified number. Let's look at an example.

df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(n=2))

In this code, you can see that we only specified the actual number of samples we want to achieve.

Most of the time, you'll use proportionate stratified sampling. Disproportionate requires more expert knowledge. When performing stratified sampling you will most likely use proportionate sampling.

Applications of Stratified Random Sampling

1. Sampling Based on Shared Characteristic:

When one or more subjects in an experiment share characteristics, it suggests they are members of the same group (one subject can only be in a particular group).

For example, suppose 50 students take a test, and the grade range for the examination is merely A-E. So we can have students who are in the same grade group, for example, students who received an A (and it is impossible for a student to have two grades). As a result, they share the same characteristic or feature, which is grade.

So when you want to sample subjects based on shared characteristics, you should use stratified random sampling. This ensures that a member of a specific group will be included.

This is because stratified random sampling differs from simple random sampling, which is also a sampling technique. Stratified random sampling randomly samples out the population with no characteristics (that is, each subject of the population has equal chances of being picked).

As a result, simple random sampling cannot guarantee that a certain member of a particular group will be included in the sample.

Let's have a look at an example to see what we're talking about. Let's say we want to sample out 60% of students using both stratified and simple random sampling.

We can see the result for stratified random sampling below:

df.groupby('Grade', group_keys=False).apply(lambda x: x.sample(frac=0.6))

And this is the result of simple random sampling:

df.sample(frac= 0.6)

We can see that students with C grades are not included in the sample. This is because in simple random sampling, every observation has an equal chance of being chosen because we are not sampling based on characteristics. This means that there is a chance that an observation will not be chosen.

In stratified random sampling, on the other hand, we consider all the groups we want to sample and then randomly sample from each group.

2. Imbalanced Dataset:

An imbalanced dataset is a machine learning classification problem in which the two class labels in the target variable are not proportional to one another. In other words, one class has a higher count than the other, resulting in an imbalance.

In machine learning, stratified sampling is also used to obtain the same sample proportion for a train and test set if there is an imbalance in the dataset.

For example, a chronic disease dataset has an imbalance label as shown below. You can click here to download the dataset.

df = pd.read_csv("kidney_disease.csv")
df.head()

If we check the proportion label feature which is classification, we can see that it is imbalanced.

Now let's say we want to split the train and test set using simple random sampling. We won't achieve the same proportion for the train and test set as the population proportion.

from sklearn.model_selection import train_test_split
X = df.drop(columns = ["classification"])
y = df["classification"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

We can see that the label proportion for both y_train and y_test is not the same as the population proportion. To achieve the same proportion we can make use of the stratify parameter in train_test_split as shown below:

from sklearn.model_selection import train_test_split
X = df.drop(columns = ["classification"])
y = df["classification"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y)

The above code shows that the dataset was stratified on the label. So with that we will achieve the same proportion as the population proportion.

Conclusion

In this tutorial, we looked at stratified sampling and how you can use it in statistics and machine learning. We also looked at the types of stratified sampling.

Thank you for your time.

How to Perform Customer Segmentation in Python – Machine Learning Tutorial

Ibrahim Ogunbiyi — Wed, 02 Nov 2022 18:56:39 +0000

Before I get into what this post is all about, I'd like to share the motivation that prompted me to write it.

I'm writing this article because I recall the first time I learned about customer segmentation or clustering. I didn't fully grasp what I was doing back then.

All I remembered was dumping all the features into KMeans and voilà – I'd developed a customer segmentation. I didn't understand the model's attributes for each segment.

So that for that reason, I'm sharing my knowledge of how I've come to grasp customer segmentation so hopefully you can gain from it.

In this tutorial, you will learn how to build an effective customer segmentation as well as how to perform effective Exploratory Data Analysis (EDA). These are the ingredients that will make your customer segmentation result delicious to eat 😋. Without further ado let's get started.

What is Customer Segmentation?

We've been talking about customer segmentation since the beginning of the article – but you might not know what it means.

Note that it is important to try and understand this theoretical part before we move into coding part of the tutorial. This foundation will help you build the segmentation model effectively.

Ok, back to defining what segmentation is:

Segmentation means grouping entities together based on similar properties. Entities could be customers, products, and so on.

For example customer segmentation, in particular, means grouping customers together based on similar features or properties.

Now there's one thing to note is when grouping customers based on properties: the properties you choose to group the customers must be relevant to the criteria based on which you want to group them.

For example, assume you want to categorize customers depending on what they buy. In this scenario, the customer's gender attribute may not be optimal or relevant for segmentation.

Knowing how to select appropriate attributes for customer segmentation is crucial.

Let's look at the different types of Customer Segmentation:

Demographic Segmentation.
Behavioral Segmentation.
Geographic Segmentation.
Psychographic Segmentation.
Technographic Segmentation.
Needs-based Segmentation.
Value-based Segmentation.

The most typical types of consumer segmentation you will work on when performing segmentation revolve around Demographic and Behavioral segmentation.

Demographic Segmentation is the process of grouping customers based on their demography – that is, grouping customers based on their age, income, education, marital status, and so on.

Behavioral Segmentation means grouping customer based on their behavior. For example how frequently they purchase as a group, the total amount they spend on a goods, when they last bought a product, and so on.

To learn more about other types of Customer Segmentation, you can read this article.

Criteria for Customer Segmentation

When grouping customers, you should select relevant features that are tailored to what you want to segment them on. But in some circumstances, combining features from several types of customers segmentation to generate another type of segmentation makes sense.

For example, you can combine features from demographic and behavioral segmentation to create a new segmentation. That is precisely what you will learn in this article – we will build a customer segmentation using demographic features and behavioral features.

Now enough talking – let's get down to business.

Understanding the Business Problem.

The business problem is to segment customers based on their personalities (demographic) and the amount they spend on products (behavioral). This will help the company gain a better understanding of their customers' personalities and habits.

Tools We'll Use for this Project

Of course we're using Python to build our project – but these are the tools and libraries that we will also be using to help us out.

Jupyter environment (Jupyter Lab or Jupyter notebook) – for experimenting with our project.
Pandas – for loading data as a dataframe and wrangling the data.
Numpy and Scipy – for performing some basic mathematical computations.
Scikit-Learn – for building our Customer Segmentation Model.
Seaborn, Matplotlib and Plotly Express – for data visualization.

If you don't have some or any of these libraries, you can check out their official documentations online to see how to install them.

Dataset We'll Use for this Project

The dataset we'll use in this project comes from Kaggle. You can go here to download it.

Here's a little information about the dataset:

To put it simply, the dataset contains the demographics of customers and their behavior as it relates to the company. The features of the dataset are:

Customer Personality Analysis Features

People	Promotion	Product	Place
Year Birth	NumberDealPurchase	MntWines	NumWebPurchases
Title	AcceptedCmp1	MntFruits	NumCatalogPurchases
Education	AcceptedCmp2	MntMeatProducts	NumStorePurchases
Marital_Status	AcceptedCmp3	MntFishProducts	NumWebVisitsMonth
Income	AcceptedCmp4	MntSweetProducts
Kidhome	AcceptedCmp5	MntGoldProds
Teenhome	Response
Dt_customer, Recency,
and Complain

To get the most out of this tutorial, you can download the entire Jupyter notebook beforehand so you can follow along easily. You can go here to fork the repo.

Exploratory Data Analysis (EDA)

As you might know, EDA is the key to performing well as a data analyst or data scientist. It gives you first-hand information about the whole dataset, and it helps you understand all the relationships between the features in your dataset.

We will perform the three phases of EDA in this tutorial which are:

Univariate Analysis.
Bivariate Analysis.
Multivariate Analysis

Firstly we need to import all the necessary libraries we will use in this project. We also need to load the dataset into a dataframe so we can see all the features that are present in it.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
from scipy.stats import iqr
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


df = pd.read_csv("data/marketing_campaign.csv", sep="\t")
df.head()

To begin, there are many features in the dataset – but because we want to focus on customer demographics and behavior, we will only perform EDA on features related to those categories.

Keep in mind that the EDA conducted in this article is simply a subset of the one in the Jupyter Notebook. I did it this way to keep the article from becoming too buggy. To find the entire EDA in the notebook, fork the repo by clicking this link.

Age, income, marital status, education, total children, and amount spent on products are the attributes that belong to this category.

First, since the segmentation is based on the total amount customers have spent, we'll add the amount spent on the product:

df["TotalAmountSpent"] = df["MntFishProducts"] + df["MntFruits"] + df["MntGoldProds"] + df["MntSweetProducts"] + df["MntMeatProducts"] + df["MntWines"]

After that's done we can now begin our EDA. An effective EDA always has three stages, as I mentioned above. Again, they are as follows:

Univariate Analysis
Bivariate Analysis.
Multivariate Analysis.

Univariate analysis

Univariate analysis entails evaluating a single feature in order to get insights about it. So, the initial step in performing EDA is to undertake univariate analysis, which includes evaluating descriptive or summary statistics about the feature.

For example you might check a feature distribution, proportion of a feature, and so on.

In our case, we will check the distribution of customer's ages in the dataset. We can do that by typing the following:

sns.histplot(data=df, x="Age", bins = list(range(10, 150, 10)))
plt.title("Distribution of Customer's Age")

We can see from the above summary that most of the customers belong in the age range of 40-60.

Bivariate Analysis

After you've performed univariate analysis on all your feature of interest, the next step is to perform bivariate analysis. This involves comparing two attributes at the same time.

Bivariate analysis entails determining the correlation between two features, for example.

In our case, some of the bivariate analysis we'll perform in the project include observing the average total spent across different client age groups, determining a correlation between customer income and total amount spent, and so on, as shown below.

For example, in our case we want to check the relationship between a Customer's Income and TotalAmountSpent. We can do that by typing the following:

fig = px.scatter(data_frame=df_cut, x="Income",
                 y="TotalAmountSpent",
                 title="Relationship Between Customer's Income and Total Amount Spent",
                height=500,
                color_discrete_sequence = px.colors.qualitative.G10[1:])
fig.show()

Analysis of relationship between customer's income and total amount spent.

We can see from the above analysis that as the Income increases so does the TotalAmountSpent. So from the analysis we can postulate that Income is one of key factor that determines how much a customer might spend.

Multivariate Analysis

After you've completed univariate (analysis of single feature) and bivariate (analysis of two features) analysis, the last phase of EDA is to perform Multivariate Analysis.

Multivariate Analysis consists of understanding the relationship between two or more variables.

In our project, one of the multivariate analysis we'll do is to understand the relationship between Income, TotalAmountSpent, and Customer's Education.

fig = px.scatter(
    data_frame=df_cut,
    x = "Income",
    y= "TotalAmountSpent",
    title = "Relationship between Income VS Total Amount Spent Based on Education",
    color = "Education",
    height=500
)
fig.show()

Analysis of relationship between income, total amount spent, and education.

We can see from the analysis that customers with an Undergraduate education level generally spend less than other customers with higher levels of education. This is because undergraduate customers typically earn less than other customers, which affects their spending habits.

How to Build the Segmentation Model

After we've finished our analysis, the next step is to create the model that will segment the customers. KMeans is the model we'll use. It is a popular segmentation model that is also quite effective.

The KMeans model is an unsupervised machine learning model that works by simply splitting N observations into K numbers of clusters. The observations are grouped into these clusters based on how close they are to the mean of that cluster, which is commonly referred to as centroids.

When you fit the features into the model and specify the number of clusters or segments you want, KMeans will output the cluster label to which each observation in the feature belongs.

Let's talk about the features you might want to fit into a KMeans model. There are no limits to the number of features you can use to build a Customer segmentation model – but in my opinion, fewer's better. This is because you will be able to grasp and interpret the outcomes of each segment more easily and clearly with fewer features.

In our scenario, we will first construct the KMeans model with two features and then build the final model with three features. But, before we get started, let's go over the KMeans assumptions, which are as follows:

The features must be numerical.
The features you're fitting into KMeans must be normally distributed. This is because KMeans (since it calculates average distance) is affected by outliers (values that deviate a lot from the others). As a result, any skewed feature must be changed in order to be normally distributed. Fortunately, we can use Numpy's logarithm transformation package np.log()
The features must also be of the same scale. For this, we'll use the Scikit-learn StandardScaler() module.

We'll design our KMeans model now that we've grasped the main concept. So, for our first model, we'll use the Income and TotalAmountSpent features.

To begin, because the Income feature has missing values, we will fill it with the median number.

df["Income"].fillna(df["Income"].median(), inplace=True)

After that, we'll assign the features we want to work with, Income and TotalAmountSpent, to a variable called data.

data = df[["Income", "TotalAmountSpent"]]

Once that's done we will transform features and save the result into a variable called data_log.

df_log = np.log(data)

Then we will scale the result using Scikit-learn StandardScaler():

std_scaler = StandardScaler()
df_scaled = std_scaler.fit_transform(df_log)

Once that's done we can then build the model. So the KMeans model requires two parameters. The first is random_state and the second one is n_clusterswhere:

n_clusters represents the number of clusters or segments to be derived from KMeans.
random_state: is required for reproducible results.

So, in a business setting, you might know the number of clusters you want to segment customers into ahead of time. But if not, you will need to experiment with different numbers of clusters to find the optimal one.

Since we're not in a business setting, we will experiment with different numbers of clusters.

The elbow method is the strategy we'll use to select the best cluster. It works simply by plotting the error from each cluster and looking for a spot that forms an elbow on the plot. As a result, the ideal cluster is the one that produces that elbow.

Here's the code that will help us achieve that:

errors = []
for k in range(1, 11):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(df_scaled)
    error.append(model.inertia_)


plt.title('The Elbow Method')
plt.xlabel('k'); plt.ylabel('Error of Cluster')
sns.pointplot(x=list(range(1, 11), y=errors)
plt.show()

Let's summarize what the above code does. We specified the number of clusters to experiment with, which is in the range(1, 11). Then we fit the features on those clusters and added the error to the list we created before above.

Following that, we plot the error for each cluster. The diagram shows that the cluster that creates the elbow is three. So three clusters is the best value for our model. As a result, we will build the KMeans model utilizing three clusters.

model = KMeans(n_clusters = 3, random_state=42)
model.fit(df_scaled)

Now we've built our model. The next thing will be to assign the cluster label for each observation. So we will assign the label to the original feature we didn't processed. That is, where we assigned Income and TotalAmountSpent to the variable data

data = data.assign(ClusterLabel = model.labels_)

How to Interpret the Cluster Result

Now that we've built the model, the next thing will be to interpret the result from each cluster.

There are numerous way you can summarize the results of your cluster depending on what you want to achieve. The most common summary is using central tendency which includes mean, median, and mode.

For our case we will make use of median. We're using median because the original features have outliers and the mean is very sensitive to outliers.

So we will aggregate the cluster labels and find the median for Income and TotalAmountSpent. We can make use of Pandas groupby method for that.

data.groupby("ClusterLabel")[["Income", "TotalAmountSpent"]].median()

We can see that there is a trend within the clusters:

Cluster 0 translates to customers who earn less and spend less.
Cluster 1 represent customers that earn more and spend more.
Cluster 2 represents customers that earn moderate and spend moderate.

We can also visualize the relationship by entering the following code:

fig = px.scatter(
    data_frame=data,
    x = "Income",
    y= "TotalAmountSpent",
    title = "Relationship between Income VS Total Amount Spent",
    color = "ClusterLabel",
    height=500
)
fig.show()

Analysis of relationship between income and total amount spent

Now in the same way we built the formal model, we will build the KMeans model using 3 features (the elbow method also depicts that 3 clusters is the optimal one).

data = df[["Age", "Income", "TotalAmountSpent"]]
df_log = np.log(data)
std_scaler = StandardScaler()
df_scaled = std_scaler.fit_transform(df_log)

model = KMeans(n_clusters=3, random_state=42)
model.fit(df_scaled)

data = data.assign(ClusterLabel= model.labels_)

result = df_result.groupby("ClusterLabel").agg({"Age":"mean", "Income":"median", "TotalAmountSpent":"median"}).round()

We can see from the above summary that:

Cluster 0 depicts young customers that earn a lot and also spend a lot.
Cluster 1 translates to older customers that earn a lot and also spend a lot.
Cluster 2 depicts young customers that earn less and also spend less.

We can also visualize our result by typing the following code:

fig = px.scatter_3d(data_frame=data, x="Income", 
                    y="TotalAmountSpent", z="Age", color="ClusterLabel", height=550,
                   title = "Visualizing Cluster Result Using 3 Features")
fig.show()

Cluster results using three features

Conclusion

In this tutorial, you learnt how to build a customer segmentation model. There are a lot of features we didn't touch on in this article. But I suggest that you experiment with it and create customer segmentation models using different features.

I hope you learn more from doing that. Thank you for reading the article. Happy Coding!

The link to the full code can be found below. And here's an article on K-Means Clustering if you want to learn more.

https://github.com/ibrahim-ogunbiyi/Customer-Segmentation

How the Python Lambda Function Works – Explained with Examples

Ibrahim Ogunbiyi — Tue, 25 Oct 2022 20:37:45 +0000

One of the beautiful things about Python is that it is generally one of the most intuitive programming languages out there. Still, certain concepts can be difficult to grasp and comprehend. The lambda function is one of them.

I've been there. When I first started learning Python, I skipped the lambda function because it wasn't clear to me. But with time, I began to understand it. So don't worry – if you're struggling with it, too, I've got you covered.

This tutorial will teach you what a lambda function is, when to use it, and we'll go over some common use cases where the lambda function is commonly applied. Without further ado let's get started.

What is a Lambda Function?

Lambda functions are similar to user-defined functions but without a name. They're commonly referred to as anonymous functions.

Lambda functions are efficient whenever you want to create a function that will only contain simple expressions – that is, expressions that are usually a single line of a statement. They're also useful when you want to use the function once.

How to Define a Lambda Function

You can define a lambda function like this:

lambda argument(s) : expression

lambda is a keyword in Python for defining the anonymous function.
argument(s) is a placeholder, that is a variable that will be used to hold the value you want to pass into the function expression. A lambda function can have multiple variables depending on what you want to achieve.
expression is the code you want to execute in the lambda function.

Notice that the anonymous function does not have a return keyword. This is because the anonymous function will automatically return the result of the expression in the function once it is executed.

Let's look at an example of a lambda function to see how it works. We'll compare it to a regular user-defined function.

Assume I want to write a function that returns twice the number I pass it. We can define a user-defined function as follows:

def f(x):
  return x * 2

f(3)
>> 6

Now for a lambda function. We'll create it like this:

lambda x: x * 3

As I explained above, the lambda function does not have a return keyword. As a result, it will return the result of the expression on its own. The x in it also serves as a placeholder for the value to be passed into the expression. You can change it to whatever you want.

Now if you want to call a lambda function, you will use an approach known as immediately invoking the function. That looks like this:

(lambda x : x * 2)(3)

>> 6

The reason for this is that since the lambda function does not have a name you can invoke (it's anonymous), you need to enclose the entire statement when you want to call it.

When Should You Use a Lambda Function?

You should use the lambda function to create simple expressions. For example, expressions that do not include complex structures such as if-else, for-loops, and so on.

So, for example, if you want to create a function with a for-loop, you should use a user-defined function.

Common Use Cases for Lambda Functions

How to Use a Lambda Function with Iterables

An iterable is essentially anything that consists of a series of values, such as characters, numbers, and so on.

In Python, iterables include strings, lists, dictionaries, ranges, tuples, and so on. When working with iterables, you can use lambda functions in conjunction with two common functions: filter() and map().

`Filter()`

When you want to focus on specific values in an iterable, you can use the filter function. The following is the syntax of a filter function:

filter(function, iterable)

As you can see, a filter function requires another function that contains the expression or operations that will be performed on the iterable.

For example, say I have a list such as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Now let's say that I’m only interested in those values in that list that have a remainder of 0 when divided by 2. I can make use of filter() and a lambda function.

Firstly I will use the lambda function to create the expression I want to derive like this:

lambda x: x % 2 == 0

Then I will insert it into the filter function like this:

list1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
filter(lambda x: x % 2 == 0, list1)

>> 0x1e3f212ad60> # The result is always filter object so I will need to convert it to list using list()

list(filter(lambda x: x % 2 == 0, list1))
>> [2, 4, 6, 8, 10]

`Map()`

You use the map() function whenever you want to modify every value in an iterable.

map(function, iterable)

For example, let's say I want to raise all values in the below list to the power of 2. I can easily do that using the lambda and map functions like this:

list1 = [2, 3, 4, 5]

list(map(lambda x: pow(x, 2), list1))
>> [4, 9, 16, 25]

Pandas Series

Another place you'll use lambda functions is in data science when creating a data frame from Pandas. A series is a data frame column. You can manipulate all of the values in a series by using the lambda function.

For example, if I have a data frame with the following columns and want to convert the values in the name column to lower case, I can do so using the Pandas apply function and a Python lambda function like this:

import pandas as pd

df = pd.DataFrame(
    {"name": ["IBRAHIM", "SEGUN", "YUSUF", "DARE", "BOLA", "SOKUNBI"],
     "score": [50, 32, 45, 45, 23, 45]
    }
)

df["lower_name"] = df["name"].apply(lambda x: x.lower())

The apply function will apply each element of the series to the lambda function. The lambda function will then return a value for each element based on the expression you passed to it. In our case, the expression was to lowercase each element.

Conclusion

In this tutorial you learnt the basics of the lambda function and how you can commonly apply it. Thank you for taking your time to read this.

Top Python Concepts to Know Before Learning Data Science

Ibrahim Ogunbiyi — Wed, 24 Aug 2022 17:53:30 +0000

If you're interested in learning data science, you've likely heard the buzzword "Python,". It's a popular programming language often used in data science.

But Python is a general-purpose programming language. This means that it's not limited to data science alone. You can use it to develop web and mobile applications too, among other things.

So, when learning Python for data science, one of the most common mistakes beginners make is learning it "incorrectly" — that is, not learning Python in preparation for Data Science. This can result in a loss of time and effort.

In this article, we'll go through the top Python concepts you should know before delving into data science. Now relax and follow along because this will be an exciting journey.

To have a quick overview of what the journey is going to be all about, here's what we'll cover:

Integers and Floating-Point Numbers in Python
Strings in Python
Boolean values in Python
Arithmetic operators in Python
Comparison Operator in Python
Logical Operators in Python
Membership Operator in Python
F-string formatting in Python
Lists in Python
Tuples in Python
Dictionaries in Python
Zip() Function in Python
Enumerate() Function in Python
Counter() Function in Python
If-else Statements in Python
Range() Function in Python
List Comprehension in Python
User Defined Functions in Python

Top Python Concepts to Know for Data Science

Why these concepts are important to Know

To put it bluntly, these concepts are what you will need to kickstart your data science journey when you want to use Python as your language for data science. You will be working with them in your day-to-day work as a data scientist, so it's good to have a firm grip on how they work.

Integers and Floating-Point Numbers in Python

Numbers are one of the most fundamental concepts in data science. And Python contains representations (data types) for the various types of numbers that can exist. These are mostly classified into:

Integers: these are whole numbers that are either positive or negative in Python. Examples include 200, -100, 67, and so forth.
Floating-point numbers: these are decimal values that are either positive or negative. Examples include 200.65, -14.34, 53.0002, and so on.

Strings in Python

In Python, strings contain alphanumeric values that are usually enclosed in single or double quotation marks.

An example includes "FreeCodeCamp has a lot of rich resources".

Python has a lot of methods that you can use to manipulate strings. For example if you wish to convert a string from uppercase to lowercase, you can use the .lower() method in Python as shown below.

string = "FREECODECAMP IS COOL"
print(string.lower())
>>> 'freecodecamp is cool'

You often work with strings in Data Science to create or manipulate any textual data in your dataset.

To learn more about strings and their methods, check out this helpful handbook.

Boolean values in Python

Boolean values are also known as binary values. They are values represented by two numbers. True and False, or 0 and 1.

Arithmetic operators in Python

You use arithmetic operators to perform mathematical operations on two numerical operands or values. They include the following:

The plus symbol + represents addition.
The dash symbol - represents subtraction
The asterisk symbol * represents multiplication.
The slash symbol / represents division.
The percentage symbol % is used to express the modulus
The double asterisk symbol ** represents an exponent.
The double slash symbol // represents floor division.

The first four operators are quite straightforward because we deal with them on a daily basis. However, the following require a bit more explanation:

What is the modulus operator?

The modulus operator (%) returns the remainder when performed on two separate numbers. For example, 8 % 3 will return 2 since 3 can only go in 8 twice, leaving a remainder of 2.

What is the exponential operator?

You use the exponential operator ** to raise a number to the power of another. For example, 2**3 equals 8, because 2 is raised (or multiplied by itself) three times: 2*2*2 = 8

What is the floor division operator?

You use the floor division / operator to divide. But unlike the other division operators which produce a decimal number, floor division returns the whole number portion of the division.

For example, 5//2 will result in 2 (because 2 goes into 5 two times evenly). The floor division does not approximate as well.

How to perform arithmetic operations on a string

Also, you can also perform arithmetic operations on a string. Addition and multiplication are two arithmetic operations that you can perform on a string.

Addition operator +: you use the addition operator to concatenate two strings operands together (that is, you join two strings together). For example:

"Folks" + "connect" 
>>> "Folksconnect"

Multiplication operator *: you use the multiplication operator to repeat a string (but note that one of the operands must be a number). For example:

2 * "Folks" 
>>> "FolksFolks"

Comparison Operator in Python

You use comparison operators to compare two operands. When the comparison operators are performed on two operands they return a boolean value of either true or false. The comparison operators include:

Greater than sign >
Less than sign <
Equality sign ==
Not equal sign !=
Greater than or equals to >=
Less than or equals to <=

Here are some examples: 2==2 will result in True. Also 5>= 5 will result in True since 5 is also equal to 5.

Logical Operators in Python

You use logical operators to combine conditional statements. They include and or and not.

For example 4<5 and 3>2 will return True, because 4 <5 is a condition which is True and 3 > 2 is also another condition which is True. So True and True according to the logic gate will result to true.

Before we move on, I want to define a term that I will be using mostly in the rest of the article – iterables. An iterable is basically something that consists of a sequence of values, for example characters, numbers and so on. Iterables include strings, lists, dictionaries, ranges, tuples, and so on in Python.

Membership Operator in Python

You use the membership operation to determine whether a value belongs in a sequence/iterable. A sequence can be a string of characters, a list of numbers, or anything else.

Membership operator includes the in operator and the not in operator.

For example let's say I want to check if the character b is in the string "What a time to be alive" – I can do that by typing the following statement and the result from it will be a Boolean value.

"b" in "what a time to be alive"


>>> True

To learn more about the operators in Python check out these articles.

F-string formatting in Python

In some cases, you may want to insert a variable value within a string. Assume you don't know the value ahead of time but want it to be within a string. String formatting can help you achieve this.

There are several ways to format strings in Python, but we will focus on one of them: the f-literal format.

Let's look at an example: I have two variables, name and age, and I want to include them in a string and then print out the entire string.

age = 10
name = "Eagle"

string = f"There are some birds of prey such as {name} that are older than {age} years."

print(string)

>>> There are some birds of prey such as Eagle that are older than 10 years.

So the first thing to do is you must had an f to the front of the string you wish to format using the f-literal. Also, the variable you wish to format must be inside curly braces.

To learn more about string formatting using f-literals, check out this article from Bala Priya that explains it. Also, you can learn more about other types of string formatting here.

Lists in Python

You use lists to store or organize data in a sequential order. This data can be a string, numbers, or iterables like a list.

A list is also mutable, which means that it can expand and change after you declare it (you add new elements to it).

In Python, you can create a list with square brackets and then save it to a variable. For instance:

lst_of_num = [2, 3, 4, 2].

As we can see, the preceding is a list of numbers. The beauty of a list is that it allows you to have duplicate values in the list. As previously stated, you can create a list of different data types, such as a list of numbers, strings, and lists.

diverse_lst = [4, "Folks", ["2", 4, 6, 7]]

To get to a list item or element, you use indexing. In Python, the first element of any iterable is always at the zero-th index position. In other words, a list's position begins with 0. As an example, the lst_of_number variable elements in the following index or position.

lst_of_num = [2, 3, 4, 2]. 

2 -- index or position 0
3 -- index or position 1
4 -- index or position 2
2 -- index or position 3

You can access a list element using the following approach:

name_of_list[index or position]

For our case, if you want to access the element in the 3rd position you can do that by typing:

print(lst_of_num[3])
>> 2

Lists are your friend that you'll use a lot in data science. You will need them when you wish to have a sequence of values in a container.

To learn how to add, remove, or update a list, check out this helpful tutorial by Ihechikara Vincent Abba on how to make a list in Python.

Tuples in Python

A tuple is another data collection type in Python. You also use it to store and organize data in the form of a list.

The only difference is that it is immutable, which means it cannot expand (you can't add new elements to it) like a list.

In Python, you can make a tuple by using parentheses.

my_tuple = (2, 3, 5) # This is a tuple of number.

Also a tuple can contain different data types:

diverse_tuple = (2, "Golang", [4, 5, 2], ("day", "night"))

To access elements in a tuple, you do the same thing as with a list:

my_tuple[2]
>>> 5

When you need a Python collection that you don't need to add a new elements to once it is created, tuples come in handy.

If you want to know more about tuples check out this article. Also if you want to know more about the differences between lists and tuples, check out this helpful article by Dionysia Lemonaki that explains it.

Dictionaries in Python

A dictionary is a Python collection that stores data as key-value pairs. You can create a dictionary using curly braces. Also dictionaries are mutable. For example:

my_dict = {"names":["Grace", "Dave", "Jack"], "scores":[45, 56, 70]}

The value before the column is referred to as the key and can only contain immutable datatype such as strings, integers, or tuples. The value after the column is just called a value and can contain mutable and immutable datatypes like lists, dictionaries, and so on.

You can access a dictionary's values through keys. For example, say I want to get the name of a student from the above dictionary. I can just do that easily through the use of keys, like this:

print(my_dict["names"])
>>> ["Grace", "Dave", "Jack"]

You will often need dictionaries for key-value pairs-related tasks or when you wish to transform something into a series/dataframe in Pandas (a library you will work with mostly for data manipulation).

To learn more about dictionaries and how to add, update, or delete from a dictionary, check out this helpful tutorial by Dionysia Lemonaki that explains them. Here's also a helpful article from Kolade Chris about dictionaries.

`Zip()` Function in Python

You use the zip function to zip (combine) two iterables such as a list, tuple, dictionary, and so on. And each element of each iterable is paired together.

To put it another way, the first element of the first iterable is paired with the first element of the second iterable. You typically use the zip function to merge two lists or tuples into a dictionary. Let's see how that goes.

Let's say I have a list that contains the name of a student and another list that contains the score of each student. Now If I want to map the name of each student to their respective score, I can do that using the zip function.

name = ["Dave", "Jerry", "Sasha"]
score = [43, 56, 78]
result = zip(name, score)

Now we are finished – but if you print the result from the above code, it's always an Iterator object. The last thing we will need to do is to make use of a dict function – which you use to convert an iterable into a dictionary.

print(dict(result)
>>> {"Dave":43, "Jerry":56, "Sasha":78}

You will often use the zip() function to join list into a dictionary in Data Science.

To learn more about zip() function check out this helpful tutorial by Ihechikara Vincent Abba here.

`Enumerate()` Function in Python

In Python, you use the enumerate function to assign or pair index or position values to the values in an iterable (remember, index values start at 0).

Once those index values are paired to the iterable values, you can decide to turn it into a dictionary where the index values will now serve as a key for the values in the iterable.

Let's look at an example to see how it works.

lst = ["Free", "Code", "Camp"]
result = dict(enumerate(s))
print(result)
>>> {0: 'Free', 1: 'Code', 2: 'Camp', 3: 'Code'}

You will often use the Enumerate() function to assign an index to a list and then turn it to a dictionary.

`Counter()` Function in Python

The counter function, as the name implies, lets you count the number of times the values in an iterable occurs.

The counter function produces a counter object in the form of a dictionary. To use the counter() we will need to import it from the collection module. Let's see how that works.

from collections import Counter
lst = ["Free", "Code", "Camp", "Code", "Free"]
print(Counter(lst))
>>> Counter({'Free': 2, 'Code': 2, 'Camp': 1})

You will often use the Counter function when performing natural language processing in data science.

If-else Statements in Python

You use if-else statements when you want to execute a task based on a certain condition. In real life, for example, if you pass your exam, you will be promoted. But if you fail, you will have to take it again in order to be promoted.

This type of expression, it turns out, can also be executed in Python using the if-else statement. This is how you write an if else statement:

if condition:
    execute statement
else:
    execute statement

In our exam example, the condition for the above expression is whether you pass or not, and the executable statement is whether you pass or not.

Now what the above expression does is if the condition is evaluated to true, the executable statement inside the if block gets executed. If the condition is not true, the executable statement inside the else block gets executed.

Let's go over an example so we can grok what we just talked about.

Assume I have a list of numbers like [4, 5, 6, 8, 10], and I have a variable i with the value 6. Now I need to write an if-else statement that will print whether or not the i is in the list.

As you might expect, our condition will be whether or not i is in the list, and our executable statement will be to print a message to us. You can do this using the code provided above like this:

lst = [4, 5, 6, 8, 10]
i = 6

if i in lst:
    print("Yes 6 is present in the list")
else:
    print("No 6 is not present in the list")

>>> "Yes 6 is present in the list"

The i in lst is the conditional statement that evaluates to True or False. If i was not present in the list then the executable statement in the else block gets printed.

You will often need if-else statements to perform conditional operations in Data science.

To learn more about if-else statements, check out this article written by Dionysia Lemonaki that explains Python if-else statements simply.

`Range()` Function in Python

The range function, as the name implies, provides a sequence of values within a specific range when needed. It basically works like this: (start, end-1). That is, it will not include the last value.

So, let's say I want a list of numbers ranging from 2 to 10. So I can easily do that with the range function and then convert the result to a list instead of creating a list and then typing out those items. For example:

# rememeber it's end-1 so it will display values from 2 to 10
no_range = range(2, 11)
print(list(no_range))
>>> [2, 3, 4, 5, 6, 7, 8, 9, 10]

You will often need the range() function when you need to get a list of numbers with a long range in data science.

To learn more about range function check out this helpful tutorial from Bala Priya here.

For-Loops in Python

The for loop statement allows you to repeat a task a predefined number of times. The syntax for a for-loop basically looks like this:

for i in iterable:
    execute statement


where i is a variable (you can change its name to anything you prefer) which stands as a place holder to access all the items in the iterable (for example dictionary, list, string, etc.)

Assume I have a list containing the names of thousands of students and I want to print those names. Now instead of doing it the manual way (where I access the names in the list through indexing like print(names[10]) up to the 1000th element), I can easily employ a for-loop since I want to perform the same task repeatedly.

For example:

lst  = ["Free", "Code", "Camp", "is", "the", "best", "place", "to", "learn"]
for i in lst:
    print(i)

You will often need for loops in Data Science to iterate through an iterable and perform some certain task.

We can see that the i variable serves as a placeholder to access each item in the list. To learn more about for-loops and all their applications check out this helpful tutorial by Kolade Chris here.

List Comprehension in Python

A list comprehension is a simple method of generating a new list from another iterable using specific operations.

Assume I have a tuple with some values and want to make a new list from it that only contains values from the tuple that can be divided by 3.

One method is to create an empty list and then use a for loop to iterate through all of the elements in the tuple. You also create an if-else statement to match the condition you want and then append the values that match that condition to the empty list you initialized. Here's what that looks like in code:

my_tuple = (2, 3, 4, 6, 10, 12)
my_new_lst = []
for i in my_tuple:
    if i % 3 == 0:
        my_new_lst.append(i)
print(my_new_lst)
>>> [3, 6, 12]

I can also do that using list comprehension in just one line of code. Let's see how that's done:

my_tuple = (2, 3, 4, 6, 10, 12)

my_new_lst = [i for i in my_tuple if i % 3 == 0]
print(my_new_lst)

>>>[3, 6, 12]

So far, we've seen that the list comprehension resembles the above line of code.

To begin, we use the for-loop to iterate through the tuple, with i acting as a placeholder for each item in the tuple. Now i will be evaluated to see if the condition is met (that is for each element i represents in the tuple). So if i condition evaluates to true, i will be added to the newly created list.

You will often need list comprehension in Data Science when you need a simple way to create a new list from an existing list.

To learn more about list comprehension check out this helpful tutorial by Dionysia Lemonaki here.

User Defined Functions in Python

User defined means functions you create yourself from scratch.

You use functions to modularize or group a large amount of code into smaller pieces. Functions are useful when you need to execute a set of code repeatedly. Instead of typing out that code again and again whenever you need it, you can easily modularize it into a function and then call the function (which is just a one-line statement) whenever you need it.

In Python, you create a function in the following manner.

def function_name(parameter1, parameter2, ....):

    //execute statement

    return value

Parameter in the function serves as a placeholder to hold any value you want to pass inside the function executable statement. You can have more than one parameter depending on what you wish to achieve.
Execute statement means the code that you wish to execute any time you call the function.
return is a keyword. It's not compulsory for a function to return a value. You might decide not to return anything.

Let's look at an example of how to write a function. For example, suppose you want to run some Python code that asks for a person's name and age. You also want to create a conditional statement that prints a message based on the person's age.

Now you wish to execute this code over and over again because you want to try it out on different people. You can easily write a function that will group this code into a piece, which you can then call whenever you need it.

def print_func(person_name, person_age):
    if person_age > 10:
        print(f"Hi {person_name} you are more than your denary age and your name contains {len(person_name)} characters.")
    else:
         print(f"Hi {person_name} you are still in your denary age and your name contains {len(person_name)} characters.")

Now let's go over what we have above. We created a function named print_func which requires two parameters that we want to pass into it: they are person_name and person_age.

Also the executable statement is the if -else statement we created inside it which will print out a message if a person's age is greater than 10 and another message if it is not.

You can see that we make use of string formatting to print the person's name and the length of the person's name. Also we decided not to return anything since we just want to print a value to the console.

Now if you wish to call this function, you will call it with its name and the parameters it requires. In our case it requires name and age.

name = "Ibrahim"
age = 12
print_func(name, age)

>>> Hi Ibrahim you are more than your denary age and your name contains 7 characters.

You will often need functions to modularize your code in Data Science.

To learn more about how to create function check out this helpful tutorial on functions for beginners from Bala Priya here. Also check this one from Dionysia Lemonaki on how to declare and invoke functions with params here.

Conclusion

We've come to the end of this long journey. You may be wondering whether you should learn advanced topics like object-oriented programming (OOP), which includes concepts like classes, before learning data science.

To answer your question directly, it's not necessary. The majority of your data science work will revolve around these concepts we discussed in this tutorial, and you will primarily use functions to modularize your code.

Still, as your knowledge grows, it's useful to learn OOP in case you need to contribute to an open source project.

Thank you for taking the time to read this article. I hope you learned a thing or two.

For instance you have to remember that the string must be separated with something which can either be a space, letter, or symbol. Also, the string range must not be greater or smaller than the range of the format code.

Thank you for reading.

Web Scraping with Python – How to Scrape Data from Twitter using Tweepy and Snscrape

Ibrahim Ogunbiyi — Tue, 12 Jul 2022 17:58:29 +0000

If you are a data enthusiast, you'll likely agree that one of the richest sources of real-world data is social media. Sites like Twitter are full of data.

You can use the data you can get from social media in a number of ways, like sentiment analysis (analyzing people's thoughts) on a specific issue or field of interest.

There are several ways you can scrape (or gather) data from Twitter. And in this article, we will look at two of those ways: using Tweepy and Snscrape.

We will learn a method to scrape public conversations from people on a specific trending topic, as well as tweets from a particular user.

Now without further ado, let’s get started.

Tweepy vs Snscrape – Introduction to Our Scraping Tools

Now, before we get into the implementation of each platform, let's try to grasp the differences and limits of each platform.

Tweepy

Tweepy is a Python library for integrating with the Twitter API. Because Tweepy is connected with the Twitter API, you can perform complex queries in addition to scraping tweets. It enables you to take advantage of all of the Twitter API's capabilities.

But there are some drawbacks – like the fact that its standard API only allows you to collect tweets for up to a week (that is, Tweepy does not allow recovery of tweets beyond a week window, so historical data retrieval is not permitted).

Also, there are limits to how many tweets you can retrieve from a user's account. You can read more about Tweepy's functionalities here.

Snscrape

Snscrape is another approach for scraping information from Twitter that does not require the use of an API. Snscrape allows you to scrape basic information such as a user's profile, tweet content, source, and so on.

Snscrape is not limited to Twitter, but can also scrape content from other prominent social media networks like Facebook, Instagram, and others.

Its advantages are that there are no limits to the number of tweets you can retrieve or the window of tweets (that is, the date range of tweets). So Snscrape allows you to retrieve old data.

But the one disadvantage is that it lacks all the other functionalities of Tweepy – still, if you only want to scrape tweets, Snscrape would be enough.

Now that we've clarified the distinction between the two methods, let's go over their implementation one by one.

How to Use Tweepy to Scrape Tweets

Before we begin using Tweepy, we must first make sure that our Twitter credentials are ready. With that, we can connect Tweepy to our API key and begin scraping.

If you do not have Twitter credentials, you can register for a Twitter developer account by going here. You will be asked some basic questions about how you intend to use the Twitter API. After that, you can begin the implementation.

The first step is to install the Tweepy library on your local machine, which you can do by typing:

pip install git+https://github.com/tweepy/tweepy.git

How to Scrape Tweets from a User on Twitter

Now that we’ve installed the Tweepy library, let’s scrape 100 tweets from a user called john on Twitter. We'll look at the full code implementation that will let us do this and discuss it in detail so we can grasp what’s going on:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]

    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))
    time.sleep(3)

Now let's go over each part of the code in the above block.

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)

In the above code, we've imported the Tweepy library into our code, then we've created some variables where we store our Twitter credentials (The Tweepy authentication handler requires four of our Twitter credentials). So we then pass in those variable into the Tweepy authentication handler and save them into another variable.

Then the last statement of call is where we instantiated the Tweepy API and passed in the require parameters.

username = "john"
no_of_tweets =100


try:
    #The number of tweets we want to retrieved from the user
    tweets = api.user_timeline(screen_name=username, count=no_of_tweets)

    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]

    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

In the above code, we created the name of the user (the @name in Twitter) we want to retrieved the tweets from and also the number of tweets. We then created an exception handler to help us catch errors in a more effective way.

After that, the api.user_timeline() returns a collection of the most recent tweets posted by the user we picked in the screen_name parameter and the number of tweets you want to retrieve.

In the next line of code, we passed in some attributes we want to retrieve from each tweet and saved them into a list. To see more attributes you can retrieve from a tweet, read this.

In the last chunk of code we created a dataframe and passed in the list we created along with the names of the column we created.

Note that the column names must be in the sequence of how you passed them into the attributes container (that is, how you passed those attributes in a list when you were retrieving the attributes from the tweet).

If you correctly followed the steps I described, you should have something like this:

Image by Author

Now that we are done, let's go over one more example before we move into the Snscrape implementation.

How to Scrape Tweets from a Text Search

In this method, we will be retrieving a tweet based on a search. You can do that like this:

import tweepy

consumer_key = "XXXX" #Your API/Consumer key 
consumer_secret = "XXXX" #Your API/Consumer Secret Key
access_token = "XXXX"    #Your Access token key
access_token_secret = "XXXX" #Your Access token Secret key

#Pass in our twitter API authentication key
auth = tweepy.OAuth1UserHandler(
    consumer_key, consumer_secret,
    access_token, access_token_secret
)

#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)


search_query = "sex for grades"
no_of_tweets =150


try:
    #The number of tweets we want to retrieved from the search
    tweets = api.search_tweets(q=search_query, count=no_of_tweets)

    #Pulling Some attributes from the tweet
    attributes_container = [[tweet.user.name, tweet.created_at, tweet.favorite_count, tweet.source,  tweet.text] for tweet in tweets]

    #Creation of column list to rename the columns in the dataframe
    columns = ["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"]

    #Creation of Dataframe
    tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
    print('Status Failed On,',str(e))

The above code is similar to the previous code, except that we changed the API method from api.user_timeline() to api.search_tweets(). We've also added tweet.user.name to the attributes container list.

In the code above, you can see that we passed in two attributes. This is because if we only pass in tweet.user, it would only return a dictionary user object. So we must also pass in another attribute we want to retrieve from the user object, which is name.

You can go here to see a list of additional attributes that you can retrieve from a user object. Now you should see something like this once you run it:

Image by Author.

Alright, that just about wraps up the Tweepy implementation. Just remember that there is a limit to the number of tweets you can retrieve, and you can not retrieve tweets more than 7 days old using Tweepy.

How to Use Snscrape to Scrape Tweets

As I mentioned previously, Snscrape does not require Twitter credentials (API key) to access it. There is also no limit to the number of tweets you can fetch.

For this example, though, we'll just retrieve the same tweets as in the previous example, but using Snscrape instead.

To use Snscrape, we must first install its library on our PC. You can do that by typing:

pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git

How to Scrape Tweets from a User with Snscrape

Snscrape includes two methods for getting tweets from Twitter: the command line interface (CLI) and a Python Wrapper. Just keep in mind that the Python Wrapper is currently undocumented – but we can still get by with trial and error.

In this example, we will use the Python Wrapper because it is more intuitive than the CLI method. But if you get stuck with some code, you can always turn to the GitHub community for assistance. The contributors will be happy to help you.

To retrieve tweets from a particular user, we can do the following:

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Created a list to append all tweet attributes(data)
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

Let's go over some of the code that you might not understand at first glance:

for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:john').get_items()):
    if i>100:
        break
    attributes_container.append([tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])


# Creating a dataframe from the tweets list above 
tweets_df = pd.DataFrame(attributes_container, columns=["Date Created", "Number of Likes", "Source of Tweet", "Tweets"])

In the above code, what the sntwitter.TwitterSearchScaper does is return an object of tweets from the name of the user we passed into it (which is john).

As I mentioned earlier, Snscrape does not have limits on numbers of tweets so it will return however many tweets from that user. To help with this, we need to add the enumerate function which will iterate through the object and add a counter so we can access the most recent 100 tweets from the user.

You can see that the attributes syntax we get from each tweet looks like the one from Tweepy. These are the list of attributes that we can get from the Snscrape tweet which was curated by Martin Beck.

Credit: Martin Beck

More attributes might be added, as the Snscrape library is still in development. Like for instance in the above image, source has been replaced with sourceLabel. If you pass in only source it will return an object.

If you run the above code, you should see something like this as well:

Image by Author

Now let's do the same for scraping by search.

How to Scrape Tweets from a Text Search with Snscrape

import snscrape.modules.twitter as sntwitter
import pandas as pd

# Creating list to append tweet data to
attributes_container = []

# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('sex for grades since:2021-07-05 until:2022-07-06').get_items()):
    if i>150:
        break
    attributes_container.append([tweet.user.username, tweet.date, tweet.likeCount, tweet.sourceLabel, tweet.content])

# Creating a dataframe to load the list
tweets_df = pd.DataFrame(attributes_container, columns=["User", "Date Created", "Number of Likes", "Source of Tweet", "Tweet"])

Again, you can access a lot of historical data using Snscrape (unlike Tweepy, as its standard API cannot exceed 7 days. The premium API is 30 days.). So we can pass in the date from which we want to start the search and the date we want it to end in the sntwitter.TwitterSearchScraper() method.

What we've done in the preceding code is basically what we discussed before. The only thing to bear in mind is that until works similarly to the range function in Python (that is, it excludes the last integer). So if you want to get tweets from today, you need to include the day after today in the "until" parameter.

Image of Author.

Now you know how to scrape tweets with Snscrape, too!

When to use each approach

Now that we've seen how each method works, you might be wondering when to use which.

Well, there is no universal rule for when to utilize each method. Everything comes down to a matter preference and your use case.

If you want to acquire an endless number of tweets, you should use Snscrape. But if you want to use extra features that Snscrape cannot provide (like geolocation, for example), then you should definitely use Tweepy. It is directly integrated with the Twitter API and provides complete functionality.

Even so, Snscrape is the most commonly used method for basic scraping.

Conclusion

In this article, we learned how to scrape data from Python using Tweepy and Snscrape. But this was only a brief overview of how each approach works. You can learn more by exploring the web for additional information.

I've included some useful resources that you can use if you need additional information. Thank you for reading.

https://github.com/JustAnotherArchivist/snscrape

https://docs.tweepy.org/en/stable/index.html

https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af

How to Handle Missing Data in a Dataset

Ibrahim Ogunbiyi — Fri, 24 Jun 2022 21:14:52 +0000

Missing values are common when working with real-world datasets – not the cleaned ones available on Kaggle, for example.

Missing data could result from a human factor (for example, a person deliberately failing to respond to a survey question), a problem in electrical sensors, or other factors. And when this happens, you can lose significant information.

Now, there is no perfect way to handle missing values that will give you an accurate result as to what the missing value is. But there are several techniques that you can leverage that will give you decent performance.

In this article, we will look at how to handle missing data in the right way (the right way meaning selecting the appropriate technique for whatever scenario our data set might represent).

Remember that none of these methods are perfect – they still introduce some biases, such as favoring one class over another – but they are useful.

Before we begin, I'd like to start with a quote from George Box to back up the preceding statement:

All models are approximations: Essentially all models are wrong but some are useful.

Now without further ado let’s get started.

What Types of Missing Data Are There?

You may be wondering if missing values have types. Yes, they do – and in the real world, these missing values can be divided into three categories.

Understanding these categories will give you with some insights into how to approach the missing value(s) in your dataset.

Among the categories are:

Missing Completely at Random (MCAR).
Missing at Random (MAR).
Not Missing at Random (NMAR).

Missing Data that's Missing Completely at Random (MCAR)

These are data that are missing completely at random. That is, the missingness is independent from the data. There is no discernible pattern to this type of data missingness.

This means that you cannot predict whether the value was missing due to specific circumstances or not. They are just completely missing at random.

Missing Data that's Missing at Random (MAR)

These types of data are missing at random but not completely missing. The data's missingness is determined by the data you see.

Consider for instance that you built a smart watch that can track people's heart rates every hour. Then you distributed the watch to a group of individuals to wear so you can collect data for analysis.

After collecting the data, you discovered that some data were missing, which was due to some people being reluctant to wear the wristwatch at night. As a result, we can conclude that the missingness was caused by the observed data.

Missing Data that's Not Missing at Random (NMAR)

These are data that are not missing at random and are also known as ignorable data. In other words, the missingness of the missing data is determined by the variable of interest.

A common example is a survey in which students are asked how many cars they own. In this case, some students may purposefully fail to complete the survey, resulting in missing values.

How Should You Handle Missing Data?

As we just learned, these techniques cannot be that precise in determining the missing value. They appear to have some biases.

Handling missing values falls generally into two categories. We will look at the most common in each category. The two categories are as follows:

Deletion
Imputation

How to Handle Missing Data with Deletion

One of the most prevalent methods for dealing with missing data is deletion. And one of the most commonly used methods in the deletion approach is using the list wise deletion method.

What is List-Wise Deletion?

In the list-wise deletion method, you remove a record or observation in the dataset if it contains some missing values.

You can perform list-wise deletion on any of the aforementioned missing value categories, but one of its disadvantages is potential information loss.

The general rule of thumb for when to perform list-wise deletion is when the number of observations with missing values exceeds the number of observations without missing values. This is because the dataset does not have a lot of information to feed the missing values, so it is better to drop those values or discard the dataset entirely.

You can implement list-wise deletion in Python by simply using the Pandas .dropna method like this:

df.dropna(axis=1, inplace=True)

How to Handle Missing Data with Imputation?

Another frequent general method for dealing with missing data is to fill in the missing value with a substituted value.

This methodology encompasses various methods, but we will focus on the most prevalent ones here.

Prior knowledge of an ideal number

This method entails replacing the missing value with a specific value. To use it, you need to have domain knowledge of the dataset. You use this to populate the MAR and MCAR values.

To implement it in Python, you use the .fillna method in Pandas like this:

df.fillna(inplace=True)

Regression imputation

The regression imputation method includes creating a model to predict the observed value of a variable based on another variable. Then you use the model to fill in the missing value of that variable.

This technique is utilized for the MAR and MCAR categories when the features in the dataset are dependent on one another. For example using a linear regression model.

Simple Imputation

This method involves utilizing a numerical summary of the variable where the missing value occurred (that is using the feature or variable's central tendency summary, such as mean, median, and mode).

When you use this strategy to fill in the missing values, you need to evaluate the variable's distribution to determine which central tendency summary to apply.

You use this method in the MCAR category. And you implement it in Python using the SimpleImputer transformer in the Scikit-learn library.

from sklearn.impute import SimpleImputer
#Specify the strategy to be the median class
fea_transformer = SimpleImputer(strategy="median")
values = fea_transformer.fit_transform(df[["Distance"]])
pd.DataFrame(values)

KNN Imputation

KNN imputation is a fairer approach to the Simple Imputation method. It operates by replacing missing data with the average mean of the neighbors nearest to it.

You can use KNN imputation for the MCAR or MAR categories. And to implement it in Python you use the KNN imputation transformer in ScikitLearn, as seen below:

from sklearn.impute import KNNImputer
# I specify the nearest neighbor to be 3 
fea_transformer = KNNImputer(n_neighbors=3)
values = fea_transformer.fit_transform(df[["Distance"]])
pd.DataFrame(values)

How to Use Learning Algorithms

The final strategy we'll mention in this post is using machine learning algorithms to handle missing data.

Some learning algorithms allow us to fit the dataset with missing values. The dataset algorithm then searches for patterns in the dataset and uses them to fill in the missing values. Such algorithms include XGboost, Gradient Boosting, and others. But further discussion is out of the scope of this article.

Conclusion and Learning More

In this article, we've covered some of the most prevalent techniques you'd use on a daily basis to handle missing data.

But the learning does not end here. There are several other techniques available to assist us in filling our dataset, but the key is to grasp the underlying mechanisms in those techniques so that we can manage missing values properly. Thanks for reading.

Statistics for Beginners – Top Stats Concepts to Know Before Getting into Data Science

Ibrahim Ogunbiyi — Fri, 10 Jun 2022 16:33:29 +0000

You've probably heard that statistics is the gateway to data science and that the data science map starts with stats.

Perhaps you've also heard from others that you have to learn statistics before learning data science. But then you ponder, "Since I'm not from a technical background like science, technology, engineering, or math (STEM), do I need to learn everything in statistics before getting into data science?" And those same people will tell you "Yes! You have to learn statistics."

Well, here's my answer: you don't need to learn all of statistics before beginning data science (though you do need to learn some fundamentals).

You can also learn as you go instead of wasting time learning statistics first before data science (that is, as you advance in your knowledge of data science, you can always learn more statistics concepts).

That being said, it is helpful to know statistics basics before jumping into data science. You can indeed say that stats is the gateway to data science because it will help you to have some intuition about your data and how to work with it.

In this article, we'll look at the top statistical concepts you need to know before diving into data science. I'll make it as simple as possible even if you don't come from a technical background. I can tell you're excited and ready to dive into the realm of data science. Let's get started.

What is Statistics?

According to economist and sampling technique pioneer Arthur Lyon Bowley, Statistics is:

"numerical statements of facts in any department of inquiry placed in relation to each other."

That basically means that statistics helps us comprehend our data and also helps us convey the results in that data to others.

Statistical methods (that is, the techniques employed in dealing with data in statistics) are classified into two types:

Descriptive Statistics
Inferential Statistics

Descriptive Statistics is a discipline of statistics that assists us in summarizing data through numerical values or graphical visualization.

Descriptive statistics helps us identify and understand some key properties in our data. It includes concepts such as central tendency, dispersion, boxplots, histograms, and so on, which we'll discuss later in the article.

Inferential Statistics, on the other hand, is a branch of statistics that helps us make decisions or predictions based on the data that we have gathered.

Inferential statistics is a significantly more advanced topic because it requires a deep understanding of descriptive statistics. It includes concepts such as hypothesis, probability, and so forth.

Top Statistical Concepts to Know Before Learning Data Science

Since you're now familiar with the definition of statistics, let's have a look at some of the concepts you'll need to know in statistics that'll help guide you when you dive into the realm of statistics.

Among the most fundamental concepts are:

What is a Subject?

This is the specific thing we wish to observe. It could be a person, an animal, or something else. It is also known as observation.

What is a Population?

Population refers to the entire set of topics in which we are interested (that is, that we want to observe). Assume you wish to count the number of females in a specific country.

What is a Sample?

In reality, observing a population is hardly an ideal situation (because it can be very expensive to perform, and also time-consuming).

Consider the following scenario: you wish to observe every female in the world. This type of observation can be costly to carry out. However, in statistics, we have something called a sample, which is a portion/subset of the population that you want to study. We can now make a decision (inferential statistic) about the full population using the sample.

What are Parameters?

This is a property/summary of a population. Consider the following scenario: you are observing the entire country and you discover that 90% of the inhabitants are males while 10% are females. The numerical values, 90%, and 10% are a numerical summary (that is, descriptive statistics) of the entire population. As a result, the summary is known as the population parameter.

What is a Statistic?

On the other hand, a statistic (not to be confused with statistic(s)) is about a sample's property. As stated in the preceding example, instead of working with the full population, we work with samples, so the numerical value is referred to as the statistic of the sample.

Hopefully you now have a decent understanding of what population, sample, statistic, and parameters are. Let's take a look at another concept with which we are all too familiar: "Data".

Data, as the term implies, represents factual information. That is, it conveys a message to us. It can, however, be divided into two categories:

Quantitative data.
Qualitative data.

What is Quantitative Data?

This is also known as numerical data. These data are a sort of data in which numerical values can be counted or measured. Quantitative data can be further classified into two types:

Quantitative discrete data: These are numerical data that can be counted but cannot be measured. Counting the number of shoes in a shoe store is a common example.

Quantitative continuous data: This is a type of numerical data that is based on measurement. For example, measuring the weight of a glass cylinder is continuous, not discrete.

What is Qualitative Data?

These are sorts of data that represent categories or groups of data. They are also known as categorical data. They are usually written in text. They can be characteristics, names, or anything else.

A common example is a person's name, dog breeds, and so on. However, there are some data that appear to be numerical data but are encoded as categorical data.

For example, suppose you wanted to group a certain group of people based on their age and discovered that the lowest and highest ages are 10 and 60, respectively. You then divided the ages into 5 categories (10-20, 21-30, 31-40, 41-50, 51-60) and assigned numerical values to each of those categories where 1 represents 10-20, 2 represents 21-30, and so on.

In this situation, the numerical values will be handled as categorical data rather than quantitative data. As your data science career progresses, you will learn how to work with categorical data.

Now you know the categories of data. Quantitative and qualitative data can be treated in statistics using these levels of measurement. Data in statistics can be classified into 4 levels of measurement which are:

Nominal scale data
Ordinal Scale data
Interval Scale data
Ratio Scale data

Qualitative data can be measured using:

Nominal scale data: These are the type of categorical data that do not have an ordered sense. That is, they cannot be ordered.

Each piece of data represents a single unit. An example of such categorical data includes color. It is not very ideal to rank blue over yellow. When working with nominal data, each data point must be handled as a separate unit.

Ordinal Scale data: Ordinal Scale data consists of ordered categorical data. When data is ranked, there is a sense of order in it. A survey response such as excellent, good, satisfactory, and unsatisfactory is an example of this. It makes sense to rank excellence above good.

Quantitative data can be measured using:

Interval Scale Data: These are numerical data with ordering and can be measured (for example find the difference between the data). The readings on a temperature scale are an example of interval data.

For example, you can measure the difference between 4 and 10 degrees Celsius, and 10 degrees is higher than 4 degrees. However, there are two exceptions for interval scale data:

It does not have a starting point (that is, it does not begin from zero and you can have a temperature value below zero)
You can't figure out their ratio: For example, it makes no logic to claim that 4 times 20 degrees Celsius is 80 degrees Celsius.

Ratio Scale data: These are numerical data that have the features of interval scale data (that is they may be ordered and measured), but also solve the exception of interval scale data (they have a starting point, and also you can find the ratio between them).

A grade score of 20, 68, 90, or 80 is an example. We can order it, measure it, and find the ratio between the values. It makes sense to say the score of 80 is 4 times better than the score of 20.

Now that we've covered the fundamentals of data, let's look at how the first category of statistics (descriptive statistics) can be applied to data.

As previously stated, descriptive statistics require summarizing data either numerically or graphically. Let's take a look at some of the most typical numerical and graphical summaries you'll encounter when dealing with data on a regular basis.

Mean vs Median vs Mode – What is the Difference?

Mean, Median, and Mode explained through illustration. Mode is the high point, Median is the half way point, and Mean is the average.

What is a Mean?

When we have a set of numerical data like this (4, 5, 6, 7, 10), each value in the set of data is referred to as a data point. We might want to find the data's average value.

So mean is essentially the average of a set of data and is calculated as the sum of all the data points divided by the total number of data points.

In our above data set, their sum is 32 and the total number of data points is 5. So the average number, that is the mean, is 6.4

Mean is only used on numerical data. Finding the average of our category data is impractical.

What is a Median?

Also, given a group of values, we may want to discover the value in the center. The median is used to compute the value in the middle. Median also is used on numerical data only.

What is a Mode?

This is the value with the highest frequency (that is a value that has the highest number of occurrences). The mode can be used for numerical or categorical data.

What is an Outlier?

Outliers are data points that differ from other data points and, when present, can lead us to incorrect conclusions. Here's a typical example of how outliers are harmful.

Consider the following scenario: you have a machine that counts how many customers enter your supermarket every day, and the readings are thus for a given week (20, 23, 26, 27, 302). We can see that the number 302 is an outlier because it deviates significantly from the other data points.

Outliers could have resulted from a sudden change, machine faults, or other circumstances. However, when they are present, they can lead us to make incorrect decisions, such as if you want to find the average number of consumers who visit your supermarket, the value 302 may lead you to an incorrect result. The mean of the preceding values is 75.

What is a Standard Deviation?

A Standard Deviation is a summary value that indicates how far our data point deviates from the mean. It is used to determine the spread of our data.

The closer the standard deviation is to zero, the closer our data points are to one another.

The standard deviation is an extremely valuable summary that informs us that we have some outliers in our dataset. Here's how it works:

A chart of a Normal Distribution, with the number of standard deviations listed on the x axis.

In the above chart, we see a Normal Distribution. 34.1% + 34.1% = 68.2% of all observations are within one standard deviation, or 1σ (pronounced one Sigma).

13.6% + 13.6% = 27.2% of the remaining observations are within two standard deviations, or 2σ. And so on.

And yes, if you've heard of Six Sigma, that is a concept in engineering where six standard deviation's worth of possibilities are accounted for in the quality assurance process. Meaning you are accounting for all but the most extreme outliers. 99.99966% of all possibilities, to be exact.

Now that we've grasped some numerical summaries, let's take a look at some common graphical summaries.

What is a Bar Chart?

A bar chart is a type of data visualization used for categorical data. You use it to graphically show the frequency of categorical data (that is the number of times a categorical data point occurs). Here's an example:

What is a Histogram?

A histogram is similar to a bar chart in that it shows the frequency of your numerical data called height, but it groups the numerical data points into bins or ranges.

It is a very efficient visualization tool because it helps you visualize the distribution of your numerical data. You can read more here to learn more about histograms.

What is a Boxplot?

Another excellent visualization that helps you visualize the distribution of your data is the boxplot.

A boxplot, for example, allows you to visually observe if there are any outliers in your data collection. It includes terms such as minimum, 25th percentile, 50th percentile, 75th percentile, and maximum. A Boxplot looks as follows:

Image by Ibrahim Ogunbiyi

So let’s go over what we have in the above diagram:

Minimum: The minimum value does not imply the smallest value in our dataset. It is calculated using this formula ( Q1 -1.5*IQR) where:

Q1 – implies The 25th percentile
IQR – implies the Interquartile range (which is the difference between the 75th percentile and the 25th percentile).

With the minimum, it can help us detect data points that are also far below the other observed values.

For instance, assuming our data points are spread like these [345, 402, 295, 386, 10]. We can see that the value 10 is also an outlier because it is a lower value that is far below other observations.

The 25th percentile is a value that tells us that 25% of our data points are below that value and 75% of our data points are above that value. The 25th percentile is also known as the first quartile.

The 50th percentile is a value that indicates that 50% of our data points are below that value and the remaining 50% are above that value. It is also known as the second quartile.

The 75th percentile is a value that tells us that 75 percent of our data point is below that value and the remaining 25 percent is above it. It is also known as the third quartile.

Maximum: Also like the minimum, the maximum does not imply the highest value in the dataset. It is calculated using the formula (Q3 + 1.5*IQR) where:

Q3 – implies the 75th percentile
IQR implies Interquartile Range (which is the difference between the 75th percentile and the 25th percentile).

With maximum also, it can help us detect data points that are also far above the other observed values.

For instance, assuming our data points are spread like these [645, 40, 25, 38, 42]. We can see that the value 645 is also an outlier because it is a higher value that is far above other observations.

We've seen some graphical summaries of what we'll be dealing with on a daily basis. Let's look at the final topic we will discuss in this article:

What is the Association Between Quantitative Variables?

Variables are any values (alphabetical or numerical, but typically alphabetical) that represent a collection of observations. It is sometimes referred to as a column in a table.

Two variables are said to be associated if a specific value of one variable is most likely to occur with a specific value of another variable.

To study the association between two quantitative variables (often referred to as correlation), we calculate it using the Karl Pearson formula, and the result is between -1 and +1.

If the correlation value approaches 1, it indicates that the two variables are positively correlated (that is, as one variable increases the other variable increases as well). If the value approaches -1, it indicates that the variables are negatively linked (that is as one variable increases, the other variable decreases). Finally, if the correlation current is 0, there is no correlation between the variables.

You can read more here to know more about correlation and Karl Pearson formula

What is a Scatter Plot?

We can represent the correlation between quantitative variables in a graphical summary by using a plot called a scatter plot.

A scatter plot looks like this:

Scatter (XY) Plots (mathsisfun.com)

To learn about scatter plots you can read more here.

Conclusion and Learning More

In this tutorial, we've explored some fundamental statistics concepts that will help you work more efficiently with your data.

But the learning does not stop here – there are a few fundamental topics that you must be familiar with. Because this is only the beginning, you can delve deeper by consulting online resources or textbooks.

Thank you very much for reading, and please share the article so that beginners who want to go into data science can learn as well.

How to Deploy a Machine Learning Model as a Web App Using Gradio

Ibrahim Ogunbiyi — Wed, 01 Jun 2022 15:14:51 +0000

You've built your Machine Learning model with 99% accuracy and now you are ecstatic. You are like yaaaaaaaaay! My model performed well.

Then you paused and you were like – now what?

Well first, you might have thought of uploading your code to GitHub and showing people your Jupyter notebook file. It comprises those gorgeous-looking visualizations you created using Seaborn, those extremely powerful ensemble models, and how they are able to pass their evaluation metrics and so on.

But then you noticed that no one is interacting with it.

Well, my friend, why not try deploying the model as a web app so that non-techies can interact with the model, too? Because only programmers like you will likely understand that first approach.

There are several methods for deploying your model, but we will focus on one of them in this article: using Gradio. I can tell you're excited. Well, relax and enjoy, because this is going to be an exciting ride.

Prerequisites

Before beginning this journey, I assume you have the following knowledge:

You know how to create a user-defined function in Python
You can build and fit an ML model
Your environment is all set up

What is Gradio?

Gradio is a free and open-source Python library that allows you to develop an easy-to-use customizable component demo for your machine learning model that anyone can use anywhere.

Gradio integrates with the most popular Python libraries, including Scikit-learn, PyTorch, NumPy, seaborn, pandas, Tensor Flow, and others.

One of its advantages is that it allows you to interact with the web app you are currently developing in your Jupyter or Colab notebook. It has a lot of unique features that can help you construct a web app that users can interact with.

How to Install Gradio

To use Gradio, we must first install its library on our local PC. So go to your Conda PowerShell or terminal and run the following command. If you are using Google Colab you can also type the following:

pip install gradio

We now have Gradio installed on our local PC. Let's go through some of the fundamentals of Gradio so we can become acquainted with the library.

To begin, we must import the library into our notebook or IDE, whichever you are using. We can do this by typing the following command:

import gradio as gr

How to Create Your First Web App

In this tutorial, we'll create an example greeting app to familiarize ourselves with the fundamentals of Gradio.

To do so, we'll need to write a greeting function because Gradio works with Python user defined functions. As a result, our greeting function looks like this:

def greet_user(name):
    return "Hello " + name + " Welcome to Gradio!😎"

We now need to deploy the Python function on Gradio so that it can act as a web app. To do this, we type:

app =  gr.Interface(fn = greet_user, inputs="text", outputs="text")
app.launch()

Let’s walk through and have a grok about what is going on in the above code before we run it.

gr.Interface: This attribute serves as the bedrock of anything in Gradio. It is the user interface that displays all the components that will be shown on the web.

The parameter fn: This is the Python function you created and want to provide to Gradio.

The inputs parameter: These are the components that you wish to pass into the function that you created, such as words, images, numbers, audio, and so on. In our case, the function we created required text, so we entered it into the inputs parameters.

The output parameter: This is a parameter that allows you to display the component on the interface that you want to see. Because the function we created in this example needs to display text, we supply the text component to the outputs parameter.

app.launch is used to launch the app. You should have something like this when you run the above code:

Once the Gradio interface comes up, just type your name and hit submit. Then it outputs the result in the function we created above. Now that we are done with that, let’s go over one more thing in Gradio before we learn how to deploy our model.

We will create a Gradio app that can accept two inputs and provides one output. This app just asks for your name and a value and then outputs your names as well as multiples of the value you entered. To do that just type the below code:

def return_multiple(name, number):
    result = "Hi {}! 😎. The Mulitple of {} is {}".format(name, number, round(number**2, 2))
    return result

app = gr.Interface(fn = return_multiple, inputs=["text", gr.Slider(0, 50)], outputs="text")
app.launch()

Now that we’ve done that let’s quickly go through some of the things we did here that you might not be familiar with.

Input Parameter: In the input parameter we created a list that involves two components, the text and the slider. The slider is also one of Gradio's attributes that returns a float value when you slide across a given range. We used this because in the function we created we are expecting a text and a value.

We have to order the component in the input parameter the way our attributes are ordered in the function we created above. That is, text first before the number. So what we are expecting for the output is actually a string. We just did some formatting in the above function.

Now that we’ve familiarized ourselves with some of the basics of Gradio, let’s create a model that we will deploy.

How to Deploy a Machine Learning Model on Gradio

In this section, I will use a classification model that I've previously trained and saved in a pickle file.

When you create a model that takes a long time to train, the most effective approach to deal with it is to save it in a pickle file once it is finished training so that you don't have to go through the stress of training the model again.

If you want to save a model as a pickle file, let me show you how you can do that. First import the pickle library and then type the code below. Let’s say I just want to fit a model like this:

import pickle

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train) 

# If you've fitted the model just type this to save it: Remember to change the file name
with open("filename.pkl", "wb") as f:
pickle.dump(clf, f)

Now if you wish to load it you can type the following code as well:

with open("filename.pkl", "rb") as f:
    clf  = pickle.load(f)

Now that we’ve understood that, let’s create a function that we will be able to pass into Gradio so that it can make the predictions.

def make_prediction(age, employment_status, bank_name, account_balance):
    with open("filename.pkl", "rb") as f:
        clf  = pickle.load(f)
        preds = clf.predict([[age, employment_status, bank_name, account_balance]])
    if preds == 1:
            return "You are eligible for the loan"
    return "You are not eligible for the loan"

#Create the input component for Gradio since we are expecting 4 inputs

age_input = gr.Number(label = "Enter the Age of the Individual")
employment_input = gr.Number(label= "Enter Employement Status {1:For Employed, 2: For Unemployed}")
bank_input = gr.Textbox(label = "Enter Bank Name")
account_input = gr.Number(label = "Enter your account Balance:")
# We create the output
output = gr.Textbox()


app = gr.Interface(fn = make_prediction, inputs=[age_input, employment_input, bank_input, account_input], outputs=output)
app.launch()

So let’s unwrap what we have above:

We'll start at the point where we created the input component. You can choose to create the component in the gr.Interface, but in the following code, I built it directly outside of the gr.Interface and then provided the variable into the gr.Interface.

So, if you want to make a component that receives numbers, use gr.Number, and then from the output variable I created, you can pass text as we did earlier in our first app (the " text" string is shorthand for textbox if you don't want to declare the attribute explicitly).

Also I used the label parameter in each component so that the user will know what to do. We are already familiar with the other code mentioned above. And now that we've done that our model is deployed. 🎉🎉😎🥳🥳.

Conclusion

Thank you for reading this tutorial. We covered a lot in this article. Just remember that learning Gradio does not stop here – you can check out more on their website. They have pretty intuitive documentation on how you can create your web app.

Thanks once again for reading. If you enjoyed this article, you can support me by following me on LinkedIn or Twitter. Gracias, and happy deployment😀

Ibrahim Ogunbiyi - freeCodeCamp.org

What is Semantic Matching? How to Find Words in a Document Using NLP

Requirements

Problem Definition

What is Semantic Matching?

What is Word Embedding?

What does Lower-Dimensional Vector representation mean?

How do we measure if two vectors are similar?

How to Perform Semantic Matching on a PDF Document

How to Get Words from the PDF using KeyBERT

Embedding of the Birth Control Phrase and the Keywords Extracted from the PDF

Cosine Similarity of Birth Control Phrase and Keywords in PDF

Let’s Also Explore Top 5 Keywords in the PDF that Match with the Phrase “Birth Control”

Wrapping Up

References

Natural Language Processing Techniques for Topic Identification – Explained with Examples

What is Topic Identification?

Requirements for this Project

Techniques Used in NLP for Topic Identification

Bag of Words

How to implement of Bag of Words in Python

Latent Dirichlet Allocation

Non-Negative Matrix Factorization

Steps for performing NMF

How to Choose Which Technique to Use?

Conclusion

How to Write Common Date Functions in SQL with Examples

Date Data types

Common SQL Date Functions

How to use the Now() function

How to use the current_date function

How to use the Extract() or Date_Part() functions

How to add intervals or parts to dates

How to subtract intervals from dates

How to subtract two dates

Conclusion

How to Use Window Functions in SQL – with Example Queries

What is a Window Function?

What exactly is a window in SQL?

What is a Function?

Different Types of Window Functions

Sample Table

Syntax for Window Functions

How to Use a Window Function – Example

How to Use a Window Function with PARTITION BY

Other Examples of Window Functions

How to Use the ROW_NUMBER Function

How to Use the RANK Function

How to Use the DENSE_RANK Function

How to Use the LAG Function

How to Use the Frame Clause in ORDER BY

Frame clause example

When to Use a Window Function

Conclusion

What is Stratified Random Sampling? Definition and Python Example

What is Stratified Random Sampling?

Types of Stratified Random Sampling

Applications of Stratified Random Sampling

1. Sampling Based on Shared Characteristic:

2. Imbalanced Dataset:

Conclusion

How to Perform Customer Segmentation in Python – Machine Learning Tutorial

What is Customer Segmentation?

Criteria for Customer Segmentation

Understanding the Business Problem.

Tools We'll Use for this Project

Dataset We'll Use for this Project

Customer Personality Analysis Features

Exploratory Data Analysis (EDA)

Univariate analysis

Bivariate Analysis

Multivariate Analysis

How to Build the Segmentation Model

How to Interpret the Cluster Result

Conclusion

How the Python Lambda Function Works – Explained with Examples

What is a Lambda Function?

How to Define a Lambda Function

When Should You Use a Lambda Function?

Common Use Cases for Lambda Functions

How to use the `Now()` function

How to use the `current_date` function

How to use the `Extract()` or `Date_Part()` functions

How to Use a Window Function with `PARTITION BY`

How to Use the `ROW_NUMBER` Function

How to Use the `RANK` Function

How to Use the `DENSE_RANK` Function

How to Use the `LAG` Function

How to Use the Frame Clause in `ORDER BY`

`Filter()`

`Map()`

`Zip()` Function in Python

`Enumerate()` Function in Python

`Counter()` Function in Python

`Range()` Function in Python