Mene-Ejegi Ogbemi - freeCodeCamp.org

What is Overfitting in Machine Learning?

Mene-Ejegi Ogbemi — Mon, 16 Oct 2023 20:21:22 +0000

Have you ever performed some task without really thinking about the process involved? For example, making coffee, tying your shoes, or walking through your neighborhood.

In these types of activities, you've done these things so many times that you've mastered the process. You can be thinking about something unrelated, yet you perform these activities all the same. This phenomenon is called procedural memory in psychology.

We have this kind of thing with machine learning models as well, but it's not as positive as it is with humans. This is known as overfitting in machine learning.

What is Overfitting?

In overfitting, a model becomes so good at our training data that it has mastered every pattern, including noise. This makes the model perform well with training data but poorly with test or validation data.

The illustration below depicts how an optimal model fits into the data compared to overfitting.

In the graph, we have our features on the x-axis. In datasets, features are data that can be used to predict an outcome. The output variable is the outcome based on those features. The blue dots represent the data points where the features determine output variables.

In the optimal graph, our model tries to find the generalized trend. But in our overfitted chart, the model tries to master each data point, resulting in an asymmetrical curve.

An example of a case study would be to predict if a customer would default on a bank loan. Assuming we have a dataset of 100,000 customers containing features such as demographics, income, loan amount, credit history, employment record, and default status, we split our data into training and test data.

Our training dataset contains 80,000 customers, while our test dataset contains 20,000 customers. In the training the dataset, we observe that our model has a 97% accuracy, but in prediction, we only get 50% accuracy. This shows that we have an overfitting problem.

Can you tell why overfitting is a problem? Yes! It produces an incorrect prediction. It is the purpose of machine learning models to make predictions to help business decision-making. We waste time and resources when our model makes incorrect predictions.

Imagine predicting that a customer will pay back a loan, and the customer defaults. Not just one customer but thousands of customers. This can cause a crisis for any financial institution.

Causes of Overfitting

Noisy data

Noise in data often appears as errors, fluctuations, or outliers in the data. This can be caused by data entry errors, data aging, data transmission errors, and so on.

Too much noise in data can cause the model to think these are valid data points. Fitting the noise pattern in the training dataset will cause poor performance on the new dataset.

For example, let's say that we are building a machine-learning model to classify images of cats and dogs. But some of the images in the dataset are blurry or poorly lit. While the model may perform well on the training data, it might struggle on the test data since it must have mastered some pattern with the blurry images in the dataset.

In the picture above, you can see that we have some blurry images that cannot be labelled if they are cat or dog. In these instances, the model could also learn these patterns alongside relevant features. Removing these images can reduce overfitting.

Insufficient training data

There will be fewer patterns and noises to analyze if we don't have sufficient training data. This means that the machine can only learn a little about our data.

Using our previous example, if our training data contains fewer images of dogs but many more of cats, the model learns so much about cats that when we feed the system an image of a dog, it will likely give a wrong output.

Overly complex model

In a complex model, there are many parameters capable of capturing patterns and relationships in training data. As a result, our model makes a more accurate prediction.

But this can pose a problem, since the model can start capturing noise, fluctuations, or outliers. Let's look at a decision tree model, how it works, and how overfitting can happen when it becomes too complex.

A decision tree model works by repeatedly breaking down data into significant features, making each point a node. This creates a tree like structure.

To make a prediction, it starts from the root node and follow the branches down, breaking and fitting every feature until it gets to the leaf node. The prediction is then made based on the value associated with the leaf node.

Let's look at a simple tree diagram of how a decision tree can predict if a customer is likely to default on loan base on certain features.

Tree diagram showing whether a customer is likely to default on a loan

This model starts by creating a parent node which is credit score. Depending on whether the credit score for the applicant is high or low, it goes down to the next node, which is either debt to income ratio or employment status. Then it makes the final prediction as to whether the customer is likely to default or not.

A decision tree can become overly complex when it creates too many nodes, making it too detailed or specific to the training data.

Let's see a sample machine learning program that predicts whether a customer will default a loan or not using decision tree model. For specificity, I wont be showing the cleaning process and visualization. I'll just lay emphasis on the required functions and how overfitting can happen with decision tree model.

The link to the complete repository containing cleaning and visualization can be found here, and you can get the dataset here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline

#Importing our libraries
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

#Combine both training and test data
df = pd.concat([train, test], axis=0)
df.head()

#View dataset
train.head()

# Copy require features to a variable df_
df_ = train[['Gender',
'Married',
'Education',
'Self_Employed',
'Dependents',
'ApplicantIncome',
'CoapplicantIncome',
'LoanAmount',
'Loan_Amount_Term',
'Property_Area',
'Credit_History']]

### Duplicate a copy of df into X
X = df_.copy()

### label encode for Y
y = train['Loan_Status'].map({'N':0,'Y':1}).astype(int)

### train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train
clf = DecisionTreeClassifier() #change model here
clf.fit(X_train, y_train)

# predict
predictions_clf = clf.predict(X_test)

#Print Accuracy
print('Model Accuracy:', accuracy_score(predictions_clf, y_test))

To understand this better, I'll explain what each module does:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

The first block is the import section. This is where we import all our dependencies.

Numpy is a Python library used for scientific computing.
Pandas is a library for data analysis and manipulation.
Matplotlib and Seaborn are for statistical data visualization.
Accuracy_score is a function to calculate the accuracy of our model.
train_test_split is used to split our dataset into training and test data.
The LabelEncoder encodes categorical variables into numeric variables.
tree is for building a decision tree classifier.
metrics helps us evaluate our models.

#Importing our dataset
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

This module imports our datasets. Our train and test datasets have been downloaded from the public repository, so we import them separately.

#Combine both training and test data
df = pd.concat([train, test], axis=0)
df.head()

To work with both datasets, we need to combine them into one dataset. The concat function combines both datasets. We use df.head() to visualize the dataset which is shown below.

Screenshot of our dataset

# Copy require features to a variable df_
df_ = train[['Gender',
'Married',
'Education',
'Self_Employed',
'Dependents',
'ApplicantIncome',
'CoapplicantIncome',
'LoanAmount',
'Loan_Amount_Term',
'Property_Area',
'Credit_History']]

### Duplicate a copy of df into X
X = df_.copy()

To start working with our features, we created a variable df_ to store all the features needed for prediction. We duplicated this into the variable X to create a copy to work with.

### label encode for Y
y = train['Loan_Status'].map({'N':0,'Y':1}).astype(int)

To work with our outcome variable, we needed to convert it from a categorical value to an integer value. This also makes it easy for our model to understand. All values of N were converted to 0, while Y was converted to 1.

### train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We use our train_test_split to split our data into training and test data. The test_size = 0.2 means we are using 20% of the data for testing and 80% for training.

# train
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

We assigned DecisionTreeClassifier() to the variable clf, which we'll use to train and fit our data. DecisionTreeClassifier() has an optional argument named max_depth. The number assigned to max_depth determines the depth of the tree. This is how we'll use it to cause overfitting in another section below.

# predict
predictions_clf = clf.predict(X_test)

In the code snippet above, clf.predict is used to predict the data in X_test.

print('Model Accuracy:', accuracy_score(predictions_clf, y_test))

The model accuracy was printed using the accuracy_score function, which you can see in the screenshot below:

Model accuracy - almost 70%

Now that we've seen how a decision tree works and even run a machine learning model to predict if a customer will default or not, let's see how to cause and diagnose overfitting by modifying the code using the max_depth argument.

How to Diagnose Overfitting

Visualizations

Using visualizations can help us detect overfitting by providing insights into the behavior of our model.

Common visualization methods include plotting data points for the model's prediction, visualizing feature distributions, or creating plots of decision boundaries.

To visualize the overfitting for our loan application above, I had to tweak the code by creating an iteration using different max_depth values ranging from 1 to 24. Predictions are calculated based on training and test data and stored in a list.

#Creating a list to store accuracy values
train_accuracies = []
test_accuracies = []

#Loop
for depth in range(1, 25):
  tree_model = DecisionTreeClassifier(max_depth = depth)
  tree_model.fit(X_train, y_train)

  train_predictions = tree_model.predict(X_train)
  test_predictions = tree_model.predict(X_test)

  #calculate training and test accuracy
  train_accuracy = metrics.accuracy_score(y_train, train_predictions)

  test_accuracy = metrics.accuracy_score(y_test, test_predictions)

  #Append accuracies
  train_accuracies.append(train_accuracy)
  test_accuracies.append(test_accuracy)

The difference here is that we are creating two variables – train_accuracies and test_accuracies – to store the accuracy values. Using these variables, we can use the code below to generate a plot that shows the changes between these variables as the max_depth value changes.

#Creating our plot
plt.figure(figsize = (10, 5))
sns.set_style("whitegrid")
plt.plot(train_accuracies, label= "train accuracy")
plt.plot(test_accuracies, label="test accuracy")
plt.legend(loc = "upper left")
plt.xticks(range(0, 26, 5))
plt.xlabel("max_depth", size = 20)
plt.ylabel("accuracy", size = 20)
plt.show()

This is how the plot looks:

Train accuracy vs test accuracy

You'll notice that as max_depth values on the x-axis begin to increase, the training data accuracy starts improving a lot to a perfect score. In spite of this, the test data accuracy decreased from 0.78 to 0.70. This is a classic example of overfitting as the model becomes too complex.

Training and validation accuracy gap

The accuracy gap is a good way to know if overfitting has occurred in your program. This means that there is a wide gap between training data and validation data when it comes to accuracy.

As a guide, a 5% gap is what you should look for. Cases where you have more than this are often an indicator of overfitting: for example, our visualization above shows that when our max_depth value was at 2o, our training accuracy was at 100% while our test accuracy was 70%.

How to Prevent Overfitting

Collect more training data

As discussed above, insufficient training data can cause overfitting as the model cannot capture the relevant patterns and intricacies represented in the data.

Machine learning generally requires thousands or millions of records in your dataset for training. With this, there will be enough patterns to capture. You can identify outliers or noise more easily if you've done proper cleaning on the dataset using relevant techniques.

Use regularization techniques

Regularization techniques involve simplifying models by penalizing less influential features. These penalties are embedded in the model's loss function.

Regularization techniques for the decision tree model above include pruning, cost complexity pruning, and others.

Pruning is a technique that involves removing unnecessary branches from the decision tree. For example, we can set a minimum number of customers on a leaf, such as 20. This prevents the tree from making decisions based on a very small group of customers.

Cost complexity involves removing branches from the tree based on their complexity. This controls the trade-off between tree complexity and accuracy.

Ensembling

Ensembling entails combining several machine learning models to contribute their strengths and unique perspectives to make a prediction.

Ensembling leverages the wisdom of the crowd to make more accurate predictions on unseen data, which improves generalization and reduces the risk of overfitting.

Popular ensemble methods include bagging, boosting, and stacking, which have been successful in a wide range of machine-learning tasks.

Diagram showing how ensembling works

The diagram above shows how the ensembling method combines various machine learning models for making predictions. Each model is trained independently on its respective subset of data. The predictions for individual models are then combined or the mean is gotten to make a final prediction.

Conclusion

Overfitting happens when a model fits training data too closely, resulting in great training performance but poor generalization. Overfitting can be problematic as it yields incorrect predictions.

This can be caused by a lack of training data, an overly complex model, or noisy data. Diagnosis involves assessing the training-validation accuracy gap, using visualizations to scrutinize model behavior, and so on.

Prevention strategies include collecting more training data, using regularization techniques, and employing ensemble methods. These approaches ensure models generalize well and make accurate predictions for informed decisions.

Thank you for reading! Please follow me on LinkedIn where I also post more data related content.

References:

Siddhardhan. "Overfitting in Machine Learning | Causes for Overfitting and its Prevention" [Video]. Retrieved from https://www.youtube.com/watch?v=gy8kXdd6K-o
Udacity. "Ensemble Learners" [Video]. Retrieved from https://www.youtube.com/watch?v=Un9zObFjBH0
White Board Machine Learning. "Overfitting in Decision Trees" [Video]. Retrieved from https://www.youtube.com/watch?v=eU4X-dL8nYo

What Is Hypothesis Testing? Types and Python Code Example

Mene-Ejegi Ogbemi — Fri, 22 Sep 2023 00:41:23 +0000

Curiosity has always been a part of human nature. Since the beginning of time, this has been one of the most important tools for birthing civilizations. Still, our curiosity grows — it tests and expands our limits. Humanity has explored the plains of land, water, and air. We've built underwater habitats where we could live for weeks. Our civilization has explored various planets. We've explored land to an unlimited degree.

These things were possible because humans asked questions and searched until they found answers. However, for us to get these answers, a proven method must be used and followed through to validate our results. Historically, philosophers assumed the earth was flat and you would fall off when you reached the edge. While philosophers like Aristotle argued that the earth was spherical based on the formation of the stars, they could not prove it at the time.

This is because they didn't have adequate resources to explore space or mathematically prove Earth's shape. It was a Greek mathematician named Eratosthenes who calculated the earth's circumference with incredible precision. He used scientific methods to show that the Earth was not flat. Since then, other methods have been used to prove the Earth's spherical shape.

When there are questions or statements that are yet to be tested and confirmed based on some scientific method, they are called hypotheses. Basically, we have two types of hypotheses: null and alternate.

A null hypothesis is one's default belief or argument about a subject matter. In the case of the earth's shape, the null hypothesis was that the earth was flat.

An alternate hypothesis is a belief or argument a person might try to establish. Aristotle and Eratosthenes argued that the earth was spherical.

Other examples of a random alternate hypothesis include:

The weather may have an impact on a person's mood.
More people wear suits on Mondays compared to other days of the week.
Children are more likely to be brilliant if both parents are in academia, and so on.

What is Hypothesis Testing?

Hypothesis testing is the act of testing whether a hypothesis or inference is true. When an alternate hypothesis is introduced, we test it against the null hypothesis to know which is correct. Let's use a plant experiment by a 12-year-old student to see how this works.

The hypothesis is that a plant will grow taller when given a certain type of fertilizer. The student takes two samples of the same plant, fertilizes one, and leaves the other unfertilized. He measures the plants' height every few days and records the results in a table.

After a week or two, he compares the final height of both plants to see which grew taller. If the plant given fertilizer grew taller, the hypothesis is established as fact. If not, the hypothesis is not supported. This simple experiment shows how to form a hypothesis, test it experimentally, and analyze the results.

In hypothesis testing, there are two types of error: Type I and Type II.

When we reject the null hypothesis in a case where it is correct, we've committed a Type I error. Type II errors occur when we fail to reject the null hypothesis when it is incorrect.

In our plant experiment above, if the student finds out that both plants' heights are the same at the end of the test period yet opines that fertilizer helps with plant growth, he has committed a Type I error.

However, if the fertilized plant comes out taller and the student records that both plants are the same or that the one without fertilizer grew taller, he has committed a Type II error because he has failed to reject the null hypothesis.

What are the Steps in Hypothesis Testing?

The following steps explain how we can test a hypothesis:

Step #1 - Define the Null and Alternative Hypotheses

Before making any test, we must first define what we are testing and what the default assumption is about the subject. In this article, we'll be testing if the average weight of 10-year-old children is more than 32kg.

Our null hypothesis is that 10 year old children weigh 32 kg on average. Our alternate hypothesis is that the average weight is more than 32kg. Ho denotes a null hypothesis, while H1 denotes an alternate hypothesis.

Ho = 32

H1 = 32

Step #2 - Choose a Significance Level

The significance level is a threshold for determining if the test is valid. It gives credibility to our hypothesis test to ensure we are not just luck-dependent but have enough evidence to support our claims. We usually set our significance level before conducting our tests. The criterion for determining our significance value is known as p-value.

A lower p-value means that there is stronger evidence against the null hypothesis, and therefore, a greater degree of significance. A p-value of 0.05 is widely accepted to be significant in most fields of science. P-values do not denote the probability of the outcome of the result, they just serve as a benchmark for determining whether our test result is due to chance. For our test, our p-value will be 0.05.

Step #3 - Collect Data and Calculate a Test Statistic

You can obtain your data from online data stores or conduct your research directly. Data can be scraped or researched online. The methodology might depend on the research you are trying to conduct.

We can calculate our test using any of the appropriate hypothesis tests. This can be a T-test, Z-test, Chi-squared, and so on. There are several hypothesis tests, each suiting different purposes and research questions. In this article, we'll use the T-test to run our hypothesis, but I'll explain the Z-test, and chi-squared too.

T-test is used for comparison of two sets of data when we don't know the population standard deviation. It's a parametric test, meaning it makes assumptions about the distribution of the data. These assumptions include that the data is normally distributed and that the variances of the two groups are equal. In a more simple and practical sense, imagine that we have test scores in a class for males and females, but we don't know how different or similar these scores are. We can use a t-test to see if there's a real difference.

The Z-test is used for comparison between two sets of data when the population standard deviation is known. It is also a parametric test, but it makes fewer assumptions about the distribution of data. The z-test assumes that the data is normally distributed, but it does not assume that the variances of the two groups are equal. In our class test example, with the t-test, we can say that if we already know how spread out the scores are in both groups, we can now use the z-test to see if there's a difference in the average scores.

The Chi-squared test is used to compare two or more categorical variables. The chi-squared test is a non-parametric test, meaning it does not make any assumptions about the distribution of data. It can be used to test a variety of hypotheses, including whether two or more groups have equal proportions.

Step #4 - Decide on the Null Hypothesis Based on the Test Statistic and Significance Level

After conducting our test and calculating the test statistic, we can compare its value to the predetermined significance level. If the test statistic falls beyond the significance level, we can decide to reject the null hypothesis, indicating that there is sufficient evidence to support our alternative hypothesis.

On the other contrary, if the test statistic does not exceed the significance level, we fail to reject the null hypothesis, signifying that we do not have enough statistical evidence to conclude in favor of the alternative hypothesis.

Step #5 - Interpret the Results

Depending on the decision made in the previous step, we can interpret the result in the context of our study and the practical implications. For our case study, we can interpret whether we have significant evidence to support our claim that the average weight of 10 year old children is more than 32kg or not.

For our test, we are generating random dummy data for the weight of the children. We'll use a t-test to evaluate whether our hypothesis is correct or not.

import numpy as np
import scipy.stats as stats

# Create a dummy dataset of 10 year old children's weight
data = np.random.randint(20, 40, 10)

# Define the null hypothesis
H0 = "The average weight of 10 year old children is 32kg."

# Define the alternative hypothesis
H1 = "The average weight of 10 year old children is more than 32kg."

# Calculate the test statistic
t_stat, p_value = stats.ttest_1samp(data, 32)

# Print the results
print("Test statistic:", t_stat)
print("p-value:", p_value)

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

For a better understanding, let's look at what each block of code does.

import numpy as np
import scipy.stats as stats

The first block is the import statement, where we import numpy and scipy.stats. Numpy is a Python library used for scientific computing. It has a large library of functions for working with arrays. Scipy is a library for mathematical functions. It has a stat module for performing statistical functions, and that's what we'll be using for our t-test.

# Create a dummy dataset of 10 year old children's weight
data = np.random.randint(20, 40, 100)

The weights of the children were generated at random since we aren't working with an actual dataset. The random module within the Numpy library provides a function for generating random numbers, which is randint.

The randint function takes three arguments. The first (20) is the lower bound of the random numbers to be generated. The second (40) is the upper bound, and the third (100) specifies the number of random integers to generate. That is, we are generating random weight values for 100 children. In real circumstances, these weight samples would have been obtained by taking the weight of the required number of children needed for the test.

# Define the null hypothesis
H0 = "The average weight of 10 year old children is 32kg."

# Define the alternative hypothesis
H1 = "The average weight of 10 year old children is more than 32kg."

Using the code above, we declared our null and alternate hypotheses stating the average weight of a 10-year-old in both cases.

# Calculate the test statistic
t_stat, p_value = stats.ttest_1samp(data, 32)

t_stat and p_value are the variables in which we'll store the results of our functions. stats.ttest_1samp is the function that calculates our test. It takes in two variables, the first is the data variable that stores the array of weights for children, and the second (32) is the value against which we'll test the mean of our array of weights or dataset in cases where we are using a real-world dataset.


# Print the results
print("Test statistic:", t_stat)
print("p-value:", p_value)

The code above prints both values for t_stats and p_value.

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

Lastly, we evaluated our p_value against our significance value, which is 0.05. If our p_value is less than 0.05, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis. Below is the output of this program. Our null hypothesis was rejected.

Test statistic: -5.114430435590074
p-value: 1.541000376540265e-06
Reject the null hypothesis.

Conclusion

In this article, we discussed the importance of hypothesis testing. We highlighted how science has advanced human knowledge and civilization through formulating and testing hypotheses.

We discussed Type I and Type II errors in hypothesis testing and how they underscore the importance of careful consideration and analysis in scientific inquiry. It reinforces the idea that conclusions should be drawn based on thorough statistical analysis rather than assumptions or biases.

We also generated a sample dataset using the relevant Python libraries and used the needed functions to calculate and test our alternate hypothesis.

Thank you for reading! Please follow me on LinkedIn where I also post more data related content.

Data Visualization with Matplotlib – a Step by Step Guide

Mene-Ejegi Ogbemi — Mon, 24 Apr 2023 18:32:32 +0000

SEE is a beautiful Apple TV series that depicts a dystopia where humans have lost their sight. Hundreds of years later, it was considered a myth that people could ever see.

Jason Momoa is one of the leads and plays Baba Voss, an elite warrior. Jason's wife gives birth to sighted twins, and years after, during battle, Baba Voss sometimes needs the aid of the sighted children. They helped him understand the terrain better, even with his battlefield mastery. We could say his children helped him visualize things.

In ancient times, before digital devices, data visualization was also a myth. Earlier humans understood the need for visualization, so they had resources like maps, hieroglyphs, rock art, and so on. Eyewitnesses typically draw their paths and other relevant information on stones, wood, or scrolls.

Like Baba Voss's kids, these resources make it easier for humans to have a visual perspective on things or environments.

So what does visualization actually mean in this context? We can define visualization as "any technique for creating images, diagrams, or animations to communicate a message." (source)

In this article, we'll explore what data visualization is and how you can use the data visualization tool Matplotlib to explore and analyze data. You'll learn how to use it to create charts that help business owners and stakeholders get more insight about data and make informed decisions.

What is Data Visualization?

Data visualization refers to the integration of data and visual elements like images, charts, diagrams, and so on to communicate messages to different stakeholders.

These stakeholders can be users, team members, managers, or top executive members of an organization.

Data in this context refers to different input gathered from the organization database or gotten from external sources, like public databases or private organizations, that have given access through their APIs.

We'll work with an employee layoff dataset which contains details of employees that have been laid off in different industries from 2020 to 2022. The columns in the dataset include the names of companies, locations, industries, total laid off, percentage laid off, date, countries, and other relevant columns.

Below is a snapshot of the data frame:

What is Matplotlib?

Matplotlib is a popular Python library for displaying data and creating static, animated, and interactive plots. This program lets you draw appealing and informative graphics like line plots, scatter plots, histograms, and bar charts.

Matplotlib is highly customizable and flexible, which makes it a preferred choice for data analysts and scientists working in fields such as finance, science, engineering, and social sciences.

In this article, I'll show you how to create a bar chart, a pie chart, and a line plot to explain how you can do data visualization using Matplotlib.

The first thing you need is to import the Matplotlib and other relevant libraries like Pandas, Numpy and their sub modules.

#Imports packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates
from matplotlib.ticker import MaxNLocator

In the code above, we import the Pandas package, which analyzes and manipulates our data. We imported Matplotlib and we'll use the Pyplot module for data visualization.

We'll use the Numpy package imported in the third line for numerical computations. We'll also work with the date module for date manipulations when plotting our chart. The last module is the ticker module, which sets ticks on plot axes. With these modules, you can analyze, manipulate, compute, and visualize your data.

How to Create a Bar Chart

Bar charts help you with categorical values. That is, if you want to compare different entities on quantity, a bar chart is an excellent way to visualize it. In the layoff dataset, we'll compare different companies that laid off employees according to the number of staff laid off.

plt.figure(figsize= (8, 6))
industry_val = df_layoffs.groupby('company')['total_laid_off'].sum().sort_values(ascending = False).head(10)
industry_val.plot(label="", kind='bar')
plt.show()

The code above is one way to create a bar chart. It shows the top 10 companies with the highest number of layoffs.

We first set the size of the graph to 8 inches by 6 inches. Then, we group our data in the dataframe by the sum total of employees laid off by each company. We then sort in descending order and select the top 10 with the highest layoffs. Finally, we create our bar chart using the selected data. The last line (plt.show()) displays the graph which is shown below.

From the chart above, you will notice that Meta and Amazon had the highest number of laid off staff while Twitter had the fewest layoffs.

How to Create a Pie Chart

A pie chart represents a whole sector, with each portion allocated according to its size to a sub-sector. The industry column will be a perfect fit for using pie chart. We'll see which industry had most and fewest layoffs.

# Group the data by industry and sum the total laid off employees
industry_val = df_layoffs.groupby('industry')['total_laid_off'].sum().sort_values(ascending=False).head()

# create the pie chart and display the labels and values inside the pie
plt.figure(figsize=(8, 6))
plt.pie(industry_val, labels=industry_val.index, autopct='%1.1f%%')
plt.title('Laid Off Employees by Industry')
plt.show()

First, the code groups the data by industry and sums up the total number of laid-off employees for each industry. It then sorts the industries in descending order based on the total number of laid-off employees and selects the top values using the head() function.

Next, we create a pie chart to visualize the data. The size of each slice in the pie represents the proportion of laid-off employees in that industry. The pie chart labels show the names of the industries. The percentage values inside the slices show the proportion of laid-off employees in that industry. The chart is titled "Laid Off Employees by Industry."

Finally, the pie chart is displayed using the plt.show() function. Like we did in the bar chart, the plt.figure(figsize=(8, 6)) function sets the chart size to be 8 inches wide and 6 inches tall.

The chart above shows the proportion of layoffs across different industries. The transportation sector and consumer sector are the industry mostly affected followed by retail, finance and food industry.

How to Create a Line chart

Line charts show changes over time for an entity. With our dataset, a line chart could be used to show the trend of layoffs over the past year or two. This depends on what you are trying to communicate, but we'll work with a one year analysis.

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

# select data for one-year duration starting from January 1st, 2022
start_date = pd.Timestamp('2022-01-01')
end_date = start_date + pd.DateOffset(years=1)
df_one_year = df_layoffs.loc[(df_layoffs['date'] >= start_date) & (df_layoffs['date'] < end_date)]

# plot the selected data
df_date = df_one_year.groupby('date')['total_laid_off'].sum()
plt.figure(figsize=(10, 4))
plt.plot(df_date.index, df_date.values)
plt.xlabel('Date')
plt.ylabel('Total Laid Off')
plt.title('Laid Off Trend for 2022')
plt.xticks(rotation=45)
# set the format of the x-axis labels to show Month-Year
date_fmt = mdates.DateFormatter('%b-%Y')
plt.gca().xaxis.set_major_formatter(date_fmt)

# Use MaxNLocator to reduce the number of xticks
locator = MaxNLocator(nbins=10)
plt.gca().xaxis.set_major_locator(locator)

plt.show()

In comparison to the bar charts and pie charts, this code is much more challenging. But here is an explanation:

The first line of the code converts the 'date' column of the DataFrame (df_layoffs) into a DateTime object so that the dates can be handled easily.

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

Next, we select the data for a one-year duration starting on January 1st, 2022. The start date is defined as a Timestamp object, and the end date is set as one year from the start date using the pd.DateOffset function. The loc function is then used to filter the DataFrame rows, selecting only those that fall within this one-year duration. Remember we are working with a year's data.

# select data for one-year duration starting from January 1st, 2022
start_date = pd.Timestamp('2022-01-01')
end_date = start_date + pd.DateOffset(years=1)
df_one_year = df_layoffs.loc[(df_layoffs['date'] >= start_date) & (df_layoffs['date'] < end_date)]

After that, we group the selected data by date and calculate the total number of layoffs on each date using the groupby and sum functions. This is stored in a new DataFrame called df_date.

# plot the selected data
df_date = df_one_year.groupby('date')['total_laid_off'].sum()

Then, we create a plot of the laid off trend for 2022 using the matplotlib library. The plot size is set to (10, 4) using the figure function.

plt.figure(figsize=(10, 4))

The x-axis represents the date, and the y-axis represents the total number of layoffs. The xlabel function labels the x-axis as 'Date,' and the ylabel function labels the y-axis as 'Total Laid Off.'

plt.plot(df_date.index, df_date.values)
plt.xlabel('Date')
plt.ylabel('Total Laid Off')

The plot title is set to 'Laid Off Trend for 2022' using the title function.

plt.title('Laid Off Trend for 2022')

The x-axis labels are rotated by 45 degrees using the xticks function to avoid overcrowding.

plt.xticks(rotation=45)

The format of the x-axis labels is set to show the Month-Year format using the DateFormatter function.

# set the format of the x-axis labels to show Month-Year
date_fmt = mdates.DateFormatter('%b-%Y')
plt.gca().xaxis.set_major_formatter(date_fmt)

Finally, the number of xticks on the plot is reduced using the MaxNLocator function, which reduces the number of xticks to 10.

# Use MaxNLocator to reduce the number of xticks
locator = MaxNLocator(nbins=10)
plt.gca().xaxis.set_major_locator(locator)

The plot is then displayed using the show function.

plt.show()

The chart above shows layoff trends and patterns for 2022.

You can also analyze how well an entity performed over different periods of time. The second chart shows an analysis of employee layoffs in 2020 versus 2022.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.ticker import MaxNLocator

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

# filter data to only include 2020 and 2022
df_filtered = df_layoffs[(df_layoffs['date'].dt.year == 2020) | (df_layoffs['date'].dt.year == 2022)]

# group data by year and calculate total layoffs
df_filtered['year'] = df_filtered['date'].dt.year
df_yearly = df_filtered.groupby(['year', 'date'])['total_laid_off'].sum().reset_index()

# create subplots and plot the data for each year in separate charts
fig, axs = plt.subplots(ncols=2, figsize=(14, 8))
for i, year in enumerate(df_yearly['year'].unique()):
    df_year = df_yearly.loc[df_yearly['year'] == year]
    axs[i].plot(df_year['date'], df_year['total_laid_off'])
    axs[i].set_xlabel('Date')
    axs[i].set_ylabel('Total Laid Off')
    axs[i].set_title(f'Laid Off Trend for {year}')
    axs[i].xaxis.set_major_formatter(mdates.DateFormatter('%b-%Y'))
    axs[i].tick_params(axis='x', rotation=45)
    locator = MaxNLocator(nbins=10)
    axs[i].xaxis.set_major_locator(locator)

# set y-axis limit to 0-14000 for each subplot
for ax in axs:
    ax.set_ylim([0, 14000])

plt.show()

Let's review the different components of the code above.

The 'date' column in the DataFrame is converted to a datetime object.

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

Next, the code filters the data to only include layoffs from the years 2020 and 2022. It then groups the filtered data by year and date and calculates the total number of layoffs for each date.

# filter data to only include 2020 and 2022
df_filtered = df_layoffs[(df_layoffs['date'].dt.year == 2020) | (df_layoffs['date'].dt.year == 2022)]

# group data by year and calculate total layoffs
df_filtered['year'] = df_filtered['date'].dt.year
df_yearly = df_filtered.groupby(['year', 'date'])['total_laid_off'].sum().reset_index()

We then create two subplots and plot the total number of layoffs for each year in separate charts. We set the x-axis labels to the date format of 'MMM-YYYY' (for example, Jan-2022) and rotate them by 45 degrees. We also set the y-axis label to 'Total Laid Off' and the chart title to 'Laid Off Trend for {year}' (for example, Laid Off Trend for 2020). Finally, we show the charts using the plt.show() command.

# create subplots and plot the data for each year in separate charts
fig, axs = plt.subplots(ncols=2, figsize=(14, 8))
for i, year in enumerate(df_yearly['year'].unique()):
    df_year = df_yearly.loc[df_yearly['year'] == year]
    axs[i].plot(df_year['date'], df_year['total_laid_off'])
    axs[i].set_xlabel('Date')
    axs[i].set_ylabel('Total Laid Off')
    axs[i].set_title(f'Laid Off Trend for {year}')
    axs[i].xaxis.set_major_formatter(mdates.DateFormatter('%b-%Y'))
    axs[i].tick_params(axis='x', rotation=45)
    locator = MaxNLocator(nbins=10)
    axs[i].xaxis.set_major_locator(locator)

plt.show()

Overall, the code is used to filter, group, and visualize data related to company layoffs specifically focusing on trends for 2020 and 2022. You can see the result in the chart below:

Conclusion

We started by discussing what visualization is and how data visualization is significant in transforming raw numbers into insight and business sense.

Then we used the popular Python library Matplotlib, which is a tool for data visualization, to create bar charts, pie charts, and line charts. There are also other use cases not covered in this article, like histograms, scatter plots, box plots, and so on.

By using these visualizations, we can make sense of our data and take actions that wouldn't be possible by looking at raw numbers. Data visualization can help us achieve better outcomes in other areas such as finance, science, engineering, etc. For further study, you can check the official matplotlib documentation here.

Thank you for reading! Please follow me on LinkedIn where I also post more data related content.