Matplotlib - freeCodeCamp.org

How to Get Started with Matplotlib – With Code Examples and Visualizations

Oyedele Tioluwani — Mon, 07 Oct 2024 23:15:31 +0000

One of the key steps in data analysis is data visualization, as it helps you notice certain features, tendencies, and relevant patterns that may not be obvious in raw data. Matplotlib is one of the most effective libraries for Python, and it allows the plotting of static, animated, and interactive graphics.

This guide explores Matplotlib's capabilities, focusing on solving specific data visualization problems and offering practical examples to apply to your projects.

Here’s what we are going to cover in this article:

Importance of Data Visualization in Data Analysis
Brief Overview of Matplotlib
Getting Started with Matplotlib
Advanced Plot Customizations
Interactive Plotting and Animation
- Interactive Features in Matplotlib
- How to Create Animations
How to Optimize Plots for Large Datasets
- Efficient Plotting Techniques for Large Datasets
- Statistical Data Visualization
Common Visualization Pitfalls and How to Avoid Them
Conclusion

Importance of Data Visualization in Data Analysis

Assuming that you are dealing with the sales data of a big chain of stores. Raw data may contain hundreds or thousands of rows, with possible columns such as product categories, sales regions, and monthly revenues. These useful concepts and raw data analytical approaches present the data in a very complex manner which can be estranged for anyone to undertake.

However, by visualizing the data, you can have a broad view of what is likely to be occurring, such as, which product category is succeeding, or which region is lagging.

Data visualization is a process of getting data into more easily comprehensible and analyzable forms for decision-making. Matplotlib is particularly effective at addressing these challenges for data scientists and analysts, due to the vast number of plot types and possible alterations that are available.

Brief Overview of Matplotlib

Matplotlib, which is now one of the most popular plotting software currently running in the Python environment, was started by John Hunter in the year 2003. With it, one can obtain various forms of static, dynamic, and even animated plots, making it an indispensable tool for any scientist, engineer, or data analyst.

Some common problems that Matplotlib can help solve include:

Visualize large datasets to identify patterns and outliers.
Design exemplary complex graphics for the publication of academic articles.
Combining data gathered from different sources into interactive and informative illustrations.
Adapting trends in plots to make clear the information that is being portrayed.

Getting Started with Matplotlib

Installation and Setup

Before we dive into creating plots, let's get Matplotlib installed and set up. You can install Matplotlib using pip or conda:

pip install matplotlib

Alternatively, if you're using Anaconda:

conda install matplotlib

To verify the installation:

import matplotlib
print(matplotlib.__version__)

How to Create Your First Plot

Let’s start by solving a common problem: let’s assume that you have a set of data that records daily temperature for a given month, and you want to study the variation of temperature.

Here’s how you can create a simple line plot to visualize this trend:

import matplotlib.pyplot as plt
import numpy as np

# Simulating daily temperature data
days = np.arange(1,20)
temperature = np.random.normal(loc=25, scale=5, size=len(days))

plt.plot(days, temperature, marker='o')
plt.title('Daily Temperatures in August')
plt.xlabel('Day')
plt.ylabel('Temperature (°C)')
plt.grid(True)

We used np.arange to construct a series of days.
np.random.normal models temperature data with a mean (loc) equaling 20 degrees Celsius and a standard deviation (scale) equal to 5 degrees Celsius.
plt.plot creates a line plot with markers for each day.
Titles and labels were added to make the plot informative.

Exploring Different Types of Plots

Matplotlib supports various plot types, each suited to specific data visualization problems.

Line Plots

Line plots are ideal for visualizing trends over time or continuous data. For example, tracking the monthly sales of a product:

months = np.arange(1,13)
sales = np.random.randint(2000, 4000, size=len(months))
plt.plot(months, sales, color='red', linestyle='--', marker='o')
plt.title("Monthly Sales of Product ")
plt.xlabel("Month")
plt.ylabel("Sales (Units)")
plt.grid(True)
plt.show()

Scatter Plots

They are used for the construction of simple relations between two variables of data where the appearance of the points are compared. For instance, visualizing the relationship between advertisement spending and sales:

ad_spend = np.random.randint(50, 1000, size=50)
sales = ad_spend * np.random.uniform(0.8, 1.2, size=50)

plt.scatter(ad_spend, sales, color='blue')
plt.title("Advertisement Spending vs. Sales")
plt.xlabel("Ad Spend (USD)")
plt.ylabel("Sales (Units)")
plt.show()

Bar Charts

Bar charts are effective for comparing categorical data. For example, visualizing the total revenue generated by several product groupings:

groupings = ['Musical Instruments', 'Furniture', 'Clothing', 'Food']
revenue = [50000, 30000, 20000, 40000]

plt.bar(groupings, revenue, color='green')
plt.title("Revenue by Product Grouping")
plt.xlabel("Group")
plt.ylabel("Revenue (EURO)")
plt.show()

Histograms

They are used to view the distribution of numerical data based on frequency. For example, visualizing the distribution of customer ages in a survey:

ages = np.random.randint(18, 65, size=2000)

plt.hist(ages, bins=10, color='purple', edgecolor='black')
plt.title("Age Distribution of Survey Participants")
plt.xlabel("Age")
plt.ylabel("Number of Participants")
plt.show()

Pie Charts

Pie charts are used to display the percentages of data in graphical format. For example, visualizing the market share of different companies:

companies = ['Company W', 'Company X', 'Company Y', 'Company Z']
market_share = [40, 30, 20, 10]

plt.pie(market_share, labels=companies, autopct='%1.1f%%', colors=['blue', 'orange', 'green', 'red'])
plt.title("Market Share by Company")
plt.show()

Advanced Plot Customizations

How to Work with Multiple Plots

In some situations, you’ll be required to compare multiple datasets in a single figure. For example, comparing sales trends across different regions. This can be achieved using subplots:

regions = ['North', 'South', 'East', 'West']
sales_data = np.random.randint(500, 5000, size=(4, 12))

fig, axs = plt.subplots(2, 2, figsize=(10, 8))
fig.suptitle('Monthly Sales by Region')

for i, region in enumerate(regions):
    ax = axs[i // 2, i % 2]
    ax.plot(months, sales_data[i], marker='o')
    ax.set_title(region)
    ax.set_xlabel("Month")
    ax.set_ylabel("Sales (Units)")

plt.tight_layout()
plt.show()

How to Enhance Plot Aesthetics

Among the typical options for common plotting is the possibility to control the appearance of a plot to make it informative and aesthetically pleasing.

Here’s an example:

plt.plot(days, temperature, color='orange', marker='x', linestyle='-')
plt.title("Daily Temperatures in August", fontsize=16)
plt.xlabel("Day", fontsize=12)
plt.ylabel("Temperature (°C)", fontsize=12)
plt.grid(True)
plt.legend(['Temperature'], loc='upper right')
plt.annotate('Coldest Day', xy=(5, 10), xytext=(7, 5),
             arrowprops=dict(facecolor='black', arrowstyle='->'))
plt.show()

The code changes colors and markers, line styles, titles, and axis labels of the desired font size, grid on, adds legend and annotates the coldest day by an arrow. These improvements make the plot informative and neat and as a result, a professional and clear message would be delivered.

How to Save and Export Plots

Once you've created a plot, you might need to save it in a specific format for a report or presentation. Below is an example on how to save plots efficiently:

plt.plot(days, temperature)
plt.title("Daily Temperatures in August")
plt.xlabel("Day")
plt.ylabel("Temperature (°C)")

# Saving the plot
plt.savefig("daily_temperatures_august.png", dpi=300, bbox_inches='tight')
plt.savefig("daily_temperatures_august.pdf", format='pdf', bbox_inches='tight')

The dpi parameter controls the resolution of the saved plot, and bbox_inches='tight' ensure that the plot is saved without extra whitespace.

Interactive Plotting and Animation

Interactive Features in Matplotlib

You can also make your plots interactive. For example, rather than viewing an entire plot, one might move closer to a region of interest, or when the plot has to be changed in some way because of the user input.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
y = np.cos(x)

fig, ax = plt.subplots()
ax.plot(x, y)

def on_click(event):
    # This function is called when the plot is clicked
    print(f"The Coordinates were clicked at: ({event.xdata}, {event.ydata})")

fig.canvas.mpl_connect('button_press_event', on_click)
plt.show()

The code generates a cosine wave plot and sets a click event handler on it with the on_click name. Once you click anywhere on the plot, the handler prints the coordinates of the click on the Python console.

How to Create Animations

Animations can be handy in showing how things evolve. For instance, the increase of a stock price or the incubation period of a disease:

import matplotlib.animation as animation

fig, ax = plt.subplots()
line, = ax.plot(x, y)

def update(frame):
    line.set_ydata(np.cos(x + frame / 10))
    return line,

ani = animation.FuncAnimation(fig, update, frames=range(100), blit=True)
plt.show()

The code forms an animated cosine wave, which over time seems to move horizontally and creates an impression of a wave moving from left or right. Such animations can also be useful if the data should be represented in terms of change with time.

How to Optimize Plots for Large Datasets

The size of the dataset being considered when dealing with big data is characterized by the amount of data, thus, the importance of performance needs to be expressed. It is often too slow and takes much memory to plot large quantities of data. Here are some tips you need to employ to make the most of your plots.

Efficient Plotting Techniques for Large Datasets

Downsampling

In this process, you sample fewer points than what the original plot has.

import matplotlib.pyplot as plt
import numpy as np

# Generate large dataset
x_huge = np.linspace(0, 100, 10000)
y_huge = np.sin(x_huge) + np.random.normal(0, 0.1, size=x_huge.shape)

# Downsample the data
x_downsampled = x_huge[::10]
y_downsampled = y_huge[::10]

plt.plot(x_downsampled, y_downsampled)
plt.title("Downsampled Plot")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

With this, we reduce the number of points to plot the graph on and plot a point after an interval of 10 points. It reduces the load to be rendered but does so without distorting the general structure of the data.

Data Aggregation

Data Aggregation is a process where data gathered in numerical form is grouped into classes to arrive at tabulations of the observations under a given class.

import matplotlib.pyplot as plt
import numpy as np

# Generate large dataset
x_huge = np.linspace(0, 100, 10000)
y_huge = np.sin(x_huge) + np.random.normal(0, 0.1, size=x_huge.shape)

# Aggregate the data into bins
bins = np.linspace(0, 100, 100)
y_aggregated = [np.mean(y_huge[(x_huge >= bins[i]) & (x_huge < bins[i+1])]) for i in range(len(bins)-1)]

plt.plot(bins[:-1], y_aggregated)
plt.title("Aggregated Plot")
plt.xlabel("X")
plt.ylabel("Average Y")
plt.show()

This process reduces the number of data points needed to represent the data distribution, making the plot easier to read and interpret while still capturing the overall trend of the original data.

Statistical Data Visualization

Statistical plots are useful for summarizing and understanding large datasets, some of which include the following:

Box Plots

It displays the data distribution based on a five-number summary: minimum, first quartile, median, third quartile, and maximum.

import matplotlib.pyplot as plt
import numpy as np

# Generate random data
data = np.random.randn(1000)
plt.boxplot(data)
plt.title("Box Plot")
plt.ylabel("Values")
plt.show()

They are especially used in positional outlier detection and the comparison of the dispersion and symmetry of two variables.

Violin Plot

It employs a box plot as well as a density plot to present more specific information regarding the value distribution of the given variables.

import matplotlib.pyplot as plt
import numpy as np

# Generate random data
data = np.random.randn(1000)
plt.violinplot(data)
plt.title("Violin Plot")
plt.ylabel("Values")
plt.show()

Violin plots can be used when there is a need to represent full distributions.

Common Visualization Pitfalls and How to Avoid Them

Overplotting

A value is rendered over-plotted when many observations are superimposed in the same foreground, which makes the figures messy, and the points or patterns become obscure. This is particularly common in scatter plots or line plots with large datasets.

import matplotlib.pyplot as plt
import numpy as np

# Generate large dataset
x = np.random.rand(10000)
y = np.random.rand(10000)

# Plot without transparency (over-plotting)
plt.scatter(x, y)
plt.title("Scatter Plot with Over-plotting")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

# Plot with transparency to reduce over-plotting
plt.scatter(x, y, alpha=0.1)  # Set alpha for transparency
plt.title("Scatter Plot with Reduced Over-plotting")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

In the first plot, without transparency, the data points overlap significantly, making it hard to identify any patterns or density areas. In the second plot, transparency (alpha=0.1) is applied to the data points, allowing denser regions to become more apparent while reducing clutter. This technique makes it easier to interpret the plot's data distribution.

Misleading Scales and Axes

It is possible to choose the scales and axes in such a way that it changes the overall perception of the plot. Misleading scales mess up the actual picture an analyst gets about the data and leads to making improper conclusions.

import matplotlib.pyplot as plt
import numpy as np

# Generate data
x = np.arange(10)
y1 = np.random.randint(50, 100, size=10)
y2 = y1 + np.random.randint(-5, 5, size=10)

# Plot with truncated y-axis
plt.plot(x, y1, label='Data 1')
plt.plot(x, y2, label='Data 2')
plt.ylim(90, 100)  # Truncated y-axis
plt.title("Plot with Truncated Y-Axis")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

# Plot with full y-axis
plt.plot(x, y1, label='Data 1')
plt.plot(x, y2, label='Data 2')
plt.title("Plot with Full Y-Axis")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

What can be gathered from the first plot is that the range of the y-axis is fixed. This brings out a graph that is quite misleading. The second plot uses the full y-axis, providing a more accurate representation of the data.

Color Misuse

The somewhat weak link in data visualization is the way colors are chosen and, more often than not, used improperly. Issues are low contrasts, picking colors that a color-blind person cannot differentiate, and creating color importance where there is none.

import matplotlib.pyplot as plt
import numpy as np

# Generate data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Plot with non-colorblind-friendly palette
plt.plot(x, y1, color='red', label='sin(x)')
plt.plot(x, y2, color='green', label='cos(x)')
plt.title("Plot with Non-Colorblind-Friendly Colors")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

# Plot with colorblind-friendly palette
plt.plot(x, y1, color='#0072B2', label='sin(x)')  # Blue
plt.plot(x, y2, color='#D55E00', label='cos(x)')  # Orange
plt.title("Plot with Colorblind-Friendly Colors")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

The first plot employs red and green which are notoriously difficult for users with red-green color blindness. The second plot uses a colorblind web-friendly palette to ensure that everyone can understand the plot without being confused by the colors.

Misleading Use of 3D Plots

3D plots can be visually appealing but often add unnecessary complexities and can be misleading if not used appropriately. They are most effective when the third dimension genuinely adds value to the visualization, such as when displaying multivariate data. However, 3D plots make it a bit difficult to have a comparison of the values in the plots.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Generate data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))

# 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis')
plt.title("3D Plot")
plt.show()

# 2D contour plot
plt.contourf(X, Y, Z, cmap='viridis')
plt.colorbar(label='Z value')
plt.title("2D Contour Plot")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

The 3D plot helps to plot the data in three dimensions, but it is not easy to understand the exact height difference of the regions because of the perspective. The 2D contour plot, however, uses varying colors to reflect the dimension data (Z values), making it easier and more accurate to compare areas in the graph. More often than not, the 2D plots used are better representations and easier to understand compared to the 3D ones.

Misleading Use of Area Charts

Area charts can effectively show trends over time or the distribution of a whole into parts. However, they may be confusing if some of the areas intersect or if the accumulation scheme of the chart is not clear.

import matplotlib.pyplot as plt
import numpy as np

# Generate data
x = np.arange(0, 10, 1)
y1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y2 = np.array([1, 3, 2, 5, 4, 6, 5, 7, 6, 8])

# Stacked area chart (potentially misleading)
plt.fill_between(x, y1, color='skyblue', alpha=0.5)
plt.fill_between(x, y2, color='orange', alpha=0.5)
plt.title("Misleading Stacked Area Chart")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

# Improved area chart with non-overlapping areas
plt.fill_between(x, y1, color='skyblue', alpha=0.5)
plt.fill_between(x, y1 + y2, y1, color='orange', alpha=0.5)
plt.title("Improved Stacked Area Chart")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

In the first area chart, the areas overlap, which can create confusion about the contribution of each category to the whole. The second plot improves clarity by stacking the areas on top of each other without overlap, clearly showing the cumulative nature of the data.

Conclusion

With Matplotlib, one has many features to solve particular visualization problems in the data analysis field. You can use it for line plots, complex data handling, large data processing, creating animated plots, and so on.

In this guide, we have explored the important aspects of Matplotlib and tried to bring them closer to solving real problems that you may face in your day-to-day programming work.

We also included detailed examples to support these applications. In whatever capacity you engage with the data, whether as a data scientist, engineer, or analyst, Matplotlib enables you to tell your data’s narrative in the best way possible.

Data Visualization with Matplotlib – a Step by Step Guide

Mene-Ejegi Ogbemi — Mon, 24 Apr 2023 18:32:32 +0000

SEE is a beautiful Apple TV series that depicts a dystopia where humans have lost their sight. Hundreds of years later, it was considered a myth that people could ever see.

Jason Momoa is one of the leads and plays Baba Voss, an elite warrior. Jason's wife gives birth to sighted twins, and years after, during battle, Baba Voss sometimes needs the aid of the sighted children. They helped him understand the terrain better, even with his battlefield mastery. We could say his children helped him visualize things.

In ancient times, before digital devices, data visualization was also a myth. Earlier humans understood the need for visualization, so they had resources like maps, hieroglyphs, rock art, and so on. Eyewitnesses typically draw their paths and other relevant information on stones, wood, or scrolls.

Like Baba Voss's kids, these resources make it easier for humans to have a visual perspective on things or environments.

So what does visualization actually mean in this context? We can define visualization as "any technique for creating images, diagrams, or animations to communicate a message." (source)

In this article, we'll explore what data visualization is and how you can use the data visualization tool Matplotlib to explore and analyze data. You'll learn how to use it to create charts that help business owners and stakeholders get more insight about data and make informed decisions.

What is Data Visualization?

Data visualization refers to the integration of data and visual elements like images, charts, diagrams, and so on to communicate messages to different stakeholders.

These stakeholders can be users, team members, managers, or top executive members of an organization.

Data in this context refers to different input gathered from the organization database or gotten from external sources, like public databases or private organizations, that have given access through their APIs.

We'll work with an employee layoff dataset which contains details of employees that have been laid off in different industries from 2020 to 2022. The columns in the dataset include the names of companies, locations, industries, total laid off, percentage laid off, date, countries, and other relevant columns.

Below is a snapshot of the data frame:

What is Matplotlib?

Matplotlib is a popular Python library for displaying data and creating static, animated, and interactive plots. This program lets you draw appealing and informative graphics like line plots, scatter plots, histograms, and bar charts.

Matplotlib is highly customizable and flexible, which makes it a preferred choice for data analysts and scientists working in fields such as finance, science, engineering, and social sciences.

In this article, I'll show you how to create a bar chart, a pie chart, and a line plot to explain how you can do data visualization using Matplotlib.

The first thing you need is to import the Matplotlib and other relevant libraries like Pandas, Numpy and their sub modules.

#Imports packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates
from matplotlib.ticker import MaxNLocator

In the code above, we import the Pandas package, which analyzes and manipulates our data. We imported Matplotlib and we'll use the Pyplot module for data visualization.

We'll use the Numpy package imported in the third line for numerical computations. We'll also work with the date module for date manipulations when plotting our chart. The last module is the ticker module, which sets ticks on plot axes. With these modules, you can analyze, manipulate, compute, and visualize your data.

How to Create a Bar Chart

Bar charts help you with categorical values. That is, if you want to compare different entities on quantity, a bar chart is an excellent way to visualize it. In the layoff dataset, we'll compare different companies that laid off employees according to the number of staff laid off.

plt.figure(figsize= (8, 6))
industry_val = df_layoffs.groupby('company')['total_laid_off'].sum().sort_values(ascending = False).head(10)
industry_val.plot(label="", kind='bar')
plt.show()

The code above is one way to create a bar chart. It shows the top 10 companies with the highest number of layoffs.

We first set the size of the graph to 8 inches by 6 inches. Then, we group our data in the dataframe by the sum total of employees laid off by each company. We then sort in descending order and select the top 10 with the highest layoffs. Finally, we create our bar chart using the selected data. The last line (plt.show()) displays the graph which is shown below.

From the chart above, you will notice that Meta and Amazon had the highest number of laid off staff while Twitter had the fewest layoffs.

How to Create a Pie Chart

A pie chart represents a whole sector, with each portion allocated according to its size to a sub-sector. The industry column will be a perfect fit for using pie chart. We'll see which industry had most and fewest layoffs.

# Group the data by industry and sum the total laid off employees
industry_val = df_layoffs.groupby('industry')['total_laid_off'].sum().sort_values(ascending=False).head()

# create the pie chart and display the labels and values inside the pie
plt.figure(figsize=(8, 6))
plt.pie(industry_val, labels=industry_val.index, autopct='%1.1f%%')
plt.title('Laid Off Employees by Industry')
plt.show()

First, the code groups the data by industry and sums up the total number of laid-off employees for each industry. It then sorts the industries in descending order based on the total number of laid-off employees and selects the top values using the head() function.

Next, we create a pie chart to visualize the data. The size of each slice in the pie represents the proportion of laid-off employees in that industry. The pie chart labels show the names of the industries. The percentage values inside the slices show the proportion of laid-off employees in that industry. The chart is titled "Laid Off Employees by Industry."

Finally, the pie chart is displayed using the plt.show() function. Like we did in the bar chart, the plt.figure(figsize=(8, 6)) function sets the chart size to be 8 inches wide and 6 inches tall.

The chart above shows the proportion of layoffs across different industries. The transportation sector and consumer sector are the industry mostly affected followed by retail, finance and food industry.

How to Create a Line chart

Line charts show changes over time for an entity. With our dataset, a line chart could be used to show the trend of layoffs over the past year or two. This depends on what you are trying to communicate, but we'll work with a one year analysis.

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

# select data for one-year duration starting from January 1st, 2022
start_date = pd.Timestamp('2022-01-01')
end_date = start_date + pd.DateOffset(years=1)
df_one_year = df_layoffs.loc[(df_layoffs['date'] >= start_date) & (df_layoffs['date'] < end_date)]

# plot the selected data
df_date = df_one_year.groupby('date')['total_laid_off'].sum()
plt.figure(figsize=(10, 4))
plt.plot(df_date.index, df_date.values)
plt.xlabel('Date')
plt.ylabel('Total Laid Off')
plt.title('Laid Off Trend for 2022')
plt.xticks(rotation=45)
# set the format of the x-axis labels to show Month-Year
date_fmt = mdates.DateFormatter('%b-%Y')
plt.gca().xaxis.set_major_formatter(date_fmt)

# Use MaxNLocator to reduce the number of xticks
locator = MaxNLocator(nbins=10)
plt.gca().xaxis.set_major_locator(locator)

plt.show()

In comparison to the bar charts and pie charts, this code is much more challenging. But here is an explanation:

The first line of the code converts the 'date' column of the DataFrame (df_layoffs) into a DateTime object so that the dates can be handled easily.

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

Next, we select the data for a one-year duration starting on January 1st, 2022. The start date is defined as a Timestamp object, and the end date is set as one year from the start date using the pd.DateOffset function. The loc function is then used to filter the DataFrame rows, selecting only those that fall within this one-year duration. Remember we are working with a year's data.

# select data for one-year duration starting from January 1st, 2022
start_date = pd.Timestamp('2022-01-01')
end_date = start_date + pd.DateOffset(years=1)
df_one_year = df_layoffs.loc[(df_layoffs['date'] >= start_date) & (df_layoffs['date'] < end_date)]

After that, we group the selected data by date and calculate the total number of layoffs on each date using the groupby and sum functions. This is stored in a new DataFrame called df_date.

# plot the selected data
df_date = df_one_year.groupby('date')['total_laid_off'].sum()

Then, we create a plot of the laid off trend for 2022 using the matplotlib library. The plot size is set to (10, 4) using the figure function.

plt.figure(figsize=(10, 4))

The x-axis represents the date, and the y-axis represents the total number of layoffs. The xlabel function labels the x-axis as 'Date,' and the ylabel function labels the y-axis as 'Total Laid Off.'

plt.plot(df_date.index, df_date.values)
plt.xlabel('Date')
plt.ylabel('Total Laid Off')

The plot title is set to 'Laid Off Trend for 2022' using the title function.

plt.title('Laid Off Trend for 2022')

The x-axis labels are rotated by 45 degrees using the xticks function to avoid overcrowding.

plt.xticks(rotation=45)

The format of the x-axis labels is set to show the Month-Year format using the DateFormatter function.

# set the format of the x-axis labels to show Month-Year
date_fmt = mdates.DateFormatter('%b-%Y')
plt.gca().xaxis.set_major_formatter(date_fmt)

Finally, the number of xticks on the plot is reduced using the MaxNLocator function, which reduces the number of xticks to 10.

# Use MaxNLocator to reduce the number of xticks
locator = MaxNLocator(nbins=10)
plt.gca().xaxis.set_major_locator(locator)

The plot is then displayed using the show function.

plt.show()

The chart above shows layoff trends and patterns for 2022.

You can also analyze how well an entity performed over different periods of time. The second chart shows an analysis of employee layoffs in 2020 versus 2022.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.ticker import MaxNLocator

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

# filter data to only include 2020 and 2022
df_filtered = df_layoffs[(df_layoffs['date'].dt.year == 2020) | (df_layoffs['date'].dt.year == 2022)]

# group data by year and calculate total layoffs
df_filtered['year'] = df_filtered['date'].dt.year
df_yearly = df_filtered.groupby(['year', 'date'])['total_laid_off'].sum().reset_index()

# create subplots and plot the data for each year in separate charts
fig, axs = plt.subplots(ncols=2, figsize=(14, 8))
for i, year in enumerate(df_yearly['year'].unique()):
    df_year = df_yearly.loc[df_yearly['year'] == year]
    axs[i].plot(df_year['date'], df_year['total_laid_off'])
    axs[i].set_xlabel('Date')
    axs[i].set_ylabel('Total Laid Off')
    axs[i].set_title(f'Laid Off Trend for {year}')
    axs[i].xaxis.set_major_formatter(mdates.DateFormatter('%b-%Y'))
    axs[i].tick_params(axis='x', rotation=45)
    locator = MaxNLocator(nbins=10)
    axs[i].xaxis.set_major_locator(locator)

# set y-axis limit to 0-14000 for each subplot
for ax in axs:
    ax.set_ylim([0, 14000])

plt.show()

Let's review the different components of the code above.

The 'date' column in the DataFrame is converted to a datetime object.

# convert date column to datetime object
df_layoffs['date'] = pd.to_datetime(df_layoffs['date'])

Next, the code filters the data to only include layoffs from the years 2020 and 2022. It then groups the filtered data by year and date and calculates the total number of layoffs for each date.

# filter data to only include 2020 and 2022
df_filtered = df_layoffs[(df_layoffs['date'].dt.year == 2020) | (df_layoffs['date'].dt.year == 2022)]

# group data by year and calculate total layoffs
df_filtered['year'] = df_filtered['date'].dt.year
df_yearly = df_filtered.groupby(['year', 'date'])['total_laid_off'].sum().reset_index()

We then create two subplots and plot the total number of layoffs for each year in separate charts. We set the x-axis labels to the date format of 'MMM-YYYY' (for example, Jan-2022) and rotate them by 45 degrees. We also set the y-axis label to 'Total Laid Off' and the chart title to 'Laid Off Trend for {year}' (for example, Laid Off Trend for 2020). Finally, we show the charts using the plt.show() command.

# create subplots and plot the data for each year in separate charts
fig, axs = plt.subplots(ncols=2, figsize=(14, 8))
for i, year in enumerate(df_yearly['year'].unique()):
    df_year = df_yearly.loc[df_yearly['year'] == year]
    axs[i].plot(df_year['date'], df_year['total_laid_off'])
    axs[i].set_xlabel('Date')
    axs[i].set_ylabel('Total Laid Off')
    axs[i].set_title(f'Laid Off Trend for {year}')
    axs[i].xaxis.set_major_formatter(mdates.DateFormatter('%b-%Y'))
    axs[i].tick_params(axis='x', rotation=45)
    locator = MaxNLocator(nbins=10)
    axs[i].xaxis.set_major_locator(locator)

plt.show()

Overall, the code is used to filter, group, and visualize data related to company layoffs specifically focusing on trends for 2020 and 2022. You can see the result in the chart below:

Conclusion

We started by discussing what visualization is and how data visualization is significant in transforming raw numbers into insight and business sense.

Then we used the popular Python library Matplotlib, which is a tool for data visualization, to create bar charts, pie charts, and line charts. There are also other use cases not covered in this article, like histograms, scatter plots, box plots, and so on.

By using these visualizations, we can make sense of our data and take actions that wouldn't be possible by looking at raw numbers. Data visualization can help us achieve better outcomes in other areas such as finance, science, engineering, etc. For further study, you can check the official matplotlib documentation here.

Thank you for reading! Please follow me on LinkedIn where I also post more data related content.

Conda Remove Package - How To Remove Matplotlib in Anaconda

Ihechikara Abba — Wed, 12 Apr 2023 12:21:34 +0000

You can use Conda to create and manage different environments and their packages. It is mostly used for data science and machine learning projects.

In this article, you'll learn how to remove an environment's package using in built Conda commands.

You'll learn the following:

How to create an environment.
How to install packages in an environment.
How to remove/delete an environment's package.

Let's get started!

How To Create an Environment in Conda

You can use the conda create package-name to create a new environment in Conda.

Here's an example:

conda create -n package-tutorial

The command above creates an environment called package-tutorial.

You can activate or switch to the package-tutorial environment using the conda activate environment-name command. That is:

conda activate package-tutorial

How To Install Packages in a Conda Environment

In the last section, we created and activated an environment called package-tutorial.

In this section, you'll see how to install a package in that environment. Let's install Matplotlib.

You can install a package using the conda install package-name command.

Here's one of the command for installing Matplotlib in Conda:

conda install -c conda-forge matplotlib

The installation might take a while to download and extract the package. You can check the packages that exist in your environment using conda list command.

Once the installation is complete, use the conda list command to verify that the package has been installed in your environment.

How To Remove a Package in Conda

You can remove a package in the current environment by running the conda remove package-name command.

In our case, we want to remove Matplotlib from the current environment (package-tutorial environment):

conda remove matplotlib

The command above removes Matplotlib from the current environment. When you run the conda list command, Matplotlib will no longer be listed as a package.

Summary

In this article, we talked about packages in Conda. They can be installed in Conda environments.

We saw how to create and activate a Conda environment . We also saw how to install and remove packages in Conda.

Happy coding!

Matplotlib Marker - How To Create a Marker in Matplotlib

Ihechikara Abba — Tue, 14 Mar 2023 15:05:00 +0000

In this article, you'll learn how to use markers in Matplotlib to indicate specific points in a plot.

The marker parameter can be used to create "markers" in a plot. You can specify the shape of the marker by passing a value to the parameter.

Here's what a normal Matplotlib plot looks like:

import matplotlib.pyplot as plt
import numpy as np

x = [2,4,6,8]
y = [1,3,9,7]

plt.plot(x,y)
plt.show()

a matplotlib plot without a marker

Here's a plot with a marker:

import matplotlib.pyplot as plt
import numpy as np

x = [2,4,6,8]
y = [1,3,9,7]

plt.plot(x,y, marker = 'o')
plt.show()

a matplotlib plot with an "o" marker

As can be seen in the image above, every meeting point for both axis in the plot is denoted by a marker that looks like an circle.

We're able to do that by setting the value of the marker parameter to "0": plt.plot(x,y, marker = 'o').

List of Matplotlib Markers

Here is a list (from the Matplotlib documentation) of marker values that can be assigned to the marker parameter:

Marker	Description
"."	point
","	pixel
"o"	circle
"v"	triangle_down
"^"	triangle_up
"<"	triangle_left
">"	triangle_right
"1"	tri_down
"2"	tri_up
"3"	tri_left
"4"	tri_right
"8"	octagon
"s"	square
"p"	pentagon
"P"	plus (filled)
"h"	hexagon1
"H"	hexagon2
"+"	plus
"*"	star
"x"	x
"X"	x (filled)
"D"	diamond
"d"	thin_diamond
"_"	hline
"s"	square
0	tickleft
1	tickright
2	tickup
3	tickdown
4	caretleft
5	caretright
6	caretup
7	caretdown
8	caretleft (centered at base)
9	caretright (centered at base)
10	caretup (centered at base)
11	caretdown (centered at base)

This list above shows the different values you can use to change the style of a marker in a plot.

Summary

In this article, we talked about markers in Matplotlib. They can be used to mark/indicate specific points in a plot.

We saw some code examples showing the application of the marker parameter.

Lastly, we saw a list of marker values that can be used to change the style of a marker.

Happy coding!

How To Change Legend Font Size in Matplotlib

Ihechikara Abba — Tue, 14 Mar 2023 15:04:13 +0000

You can modify different properties of a plot — color, size, label, title and so on — when working with Matplotlib.

In this article, you'll learn what a legend is in Matplotlib, and how to use some of its parameters to make your plots more relatable.

You'll then learn how to change the font size of a Matplotlib legend using:

The fontsize parameter.
The prop parameter.

What Is a Legend in Matplotlib?

A legend is a Matplotlib function used to describe elements that make up a graph.

Consider the graph below:

import matplotlib.pyplot as plt

# create a plot
x = [1, 4, 6, 8]
y = [2, 5, 6, 2]

plt.plot(x, y)

plt.legend(["Data"], loc="upper right")

plt.show()

matplotlib graph with a legend

In the graph above, we described the plot using a legend. A description of "Data" was assigned to the legend, and was placed in the upper right corner of the graph using the upper right value of the loc parameter.

With the legend function, you can assign different descriptions to each line of a graph.

Here's an example:

import matplotlib.pyplot as plt

age = [1, 4, 6, 8]
number = [4, 5, 6, 2, 1]

plt.plot(age)
plt.plot(number)

plt.legend(["age", "number"], loc ="upper right")

plt.show()

two line graph with different legend descriptions

In the graph above, we've used the legend function to describe each line in the plot.

This makes it easier for anyone viewing the graph to know that the blue line denotes age while the orange line denotes number in the graph.

You can change the position of the legend using the following values of the loc parameter:

best
upper right
upper left
lower left
lower right
right
center left
center right
lower center
upper center
center

How To Change Legend Font Size in Matplotlib Using the `fontsize` Parameter

You can change the font size of a Matplotlib legend by specifying a font size value for the fontsize parameter.

Here's what the default legend font size looks like:

import matplotlib.pyplot as plt

age = [1, 4, 6, 8]
number = [4, 5, 6, 2, 1]

plt.plot(age)
plt.plot(number)

plt.legend(["age", "number"], loc ="upper right")

plt.show()

matplotlib graph with default legend font size

Here's another code example with the fontsize parameter included:

import matplotlib.pyplot as plt

age = [1, 4, 6, 8]
number = [4, 5, 6, 2, 1]

plt.plot(age)
plt.plot(number)

plt.legend(["age", "number"], fontsize="20", loc ="upper left")

plt.show()

Here's what the legend would look like:

matplotlib legend size using fontsize parameter

We assigned a font size of 20 to the fontsize parameter to get the legend size in the image above: fontsize="20".

You'd also notice the legend was placed at the upper left corner of the graph using the loc parameter.

How To Change Legend Font Size in Matplotlib Using the `prop` Parameter

Another way of changing the font size of a legend is by using the legend function's prop parameter.

Here's how to use it:

import matplotlib.pyplot as plt

age = [1, 4, 6, 8]
number = [4, 5, 6, 2, 1]

plt.plot(age)
plt.plot(number)

plt.legend(["age", "number"], prop = { "size": 20 }, loc ="upper left")

plt.show()

Using the prop parameter, we specified a font size of 20: prop = { "size": 20 }.

Here's the output:

matplotlib legend size using prop parameter

Summary

In this article, we talked about the legend function in Matplotlib. It can be used to describe the elements that maker up a graph.

We first saw what a legend is in Matplotlib, and some examples to show its basic usage and parameters.

We then saw how to use the fontsize and prop parameters to change the font size of a Matplotlib legend.

Happy coding!

Matplotlib Add Color – How To Change Line Color in Matplotlib

Ihechikara Abba — Mon, 13 Mar 2023 21:55:25 +0000

Matplotlib is a Python library used for data visualization, and creating interactive plots and graphs.

In this article, you'll learn how to add colors to your Matplotlib plots using parameter values provided by the Matplotlib plot() function.

You'll learn how to change the color of a plot using:

Color names.
Color abbreviations.
RGB/RGBA values.
Hex values.

Let's get started!

How To Change Line Color in Matplotlib

By default, the color of plots in Matplotlib is blue. That is:

import matplotlib.pyplot as plt

x = [5,10,15,20]
y = [10,20,30,40]

plt.plot(x,y)
plt.show()

To change the color of a plot, simply add a color parameter to the plot function and specify the value of the color.

Here are some examples:

How To Change Line Color in Matplotlib Example #1

In this example, we'll change the color of the plot using a color name.

import matplotlib.pyplot as plt

x = [5,10,15,20]
y = [10,20,30,40]

plt.plot(x,y, color='red')
plt.show()

In the example above, we assigned a value of 'red' to the color parameter: color='red'.

How To Change Line Color in Matplotlib Example #2

You can make use of abbreviations when specifying the color to be used for the plot. That is:

'b' = blue
'g' = green
'r' = red
'c' = cyan
'm' = magenta
'y' = yellow
'k' = black
'w' = white

Here's a code example:

import matplotlib.pyplot as plt

x = [5,10,15,20]
y = [10,20,30,40]

plt.plot(x,y, color='m')
plt.show()

How To Change Line Color in Matplotlib Example #3

You can also make use of RGB and RGBA (red, green, blue, alpha), and hex values.

Here's an example that creates a plot with a yellow color using RGB:

import matplotlib.pyplot as plt

x = [5,10,15,20]
y = [10,20,30,40]

plt.plot(x,y, color=(1.0, 0.92, 0.23))
plt.show()

Here's another example that uses a hex value to create a green plot:

import matplotlib.pyplot as plt

x = [5,10,15,20]
y = [10,20,30,40]

plt.plot(x,y, color='#00FF00')
plt.show()

Summary

In this article, we talked about how to change the color of plots in Matplotlip.

We saw examples that showed how to use color name, abbreviations, RGB/RGBA values, and hex values to change the color of a plot in Matplotlib.

Happy coding!

Matplotlib Figure Size – How to Change Plot Size in Python with plt.figsize()

Ihechikara Abba — Thu, 12 Jan 2023 15:29:17 +0000

When creating plots using Matplotlib, you get a default figure size of 6.4 for the width and 4.8 for the height (in inches).

In this article, you'll learn how to change the plot size using the following:

The figsize() attribute.
The set_figwidth() method.
The set_figheight() method.
The rcParams parameter.

Let's get started!

How to Change Plot Size in Matplotlib with `plt.figsize()`

As stated in the previous section, the default parameters (in inches) for Matplotlib plots are 6.4 wide and 4.8 high. Here's a code example:

import matplotlib.pyplot as plt

x = [2,4,6,8]
y = [10,3,20,4]

plt.plot(x,y)

plt.show()

In the code above, we first imported matplotlib. We then created two lists — x and y — with values to be plotted.

Using plt.plot(), we plotted list x on the x-axis and list y on the y-axis: plt.plot(x,y).

Lastly, the plt.show() displays the plot. Here's what the plot would look like with the default figure size parameters:

matplotlib plot with default figure size parameters

We can change the size of the plot above using the figsize() attribute of the figure() function.

The figsize() attribute takes in two parameters — one for the width and the other for the height.

Here's what the syntax looks like:

figure(figsize=(WIDTH_SIZE,HEIGHT_SIZE))

Here's a code example:

import matplotlib.pyplot as plt

x = [2,4,6,8]
y = [10,3,20,4]

plt.figure(figsize=(10,6))
plt.plot(x,y)

plt.show()

We've added one new line of code: plt.figure(figsize=(10,6)). This will modify/change the width and height of the plot.

Here's what the plot would look like:

matplotlib plot with modified figure size

How to Change Plot Width in Matplotlib with `set_figwidth()`

You can use the set_figwidth() method to change the width of a plot.

We'll pass in the value the width should be changed to as a parameter to the method.

This method will not change the default or preset value of the plot's height.

Here's a code example:

import matplotlib.pyplot as plt

x = [2,4,6,8]
y = [10,3,20,4]

plt.figure().set_figwidth(15)
plt.plot(x,y)

plt.show()

Using the set_figwidth() method, we set the width of the plot to 10. Here's what the plot would look like:

matplotlib plot with modified width

How to Change Plot Height in Matplotlib with `set_figheight()`

You can use the set_figheight() method to change the height of a plot.

This method will not change the default or preset value of the plot's width.

Here's a code example:

import matplotlib.pyplot as plt

x = [2,4,6,8]
y = [10,3,20,4]

plt.figure().set_figheight(2)
plt.plot(x,y)

plt.show()

Using the set_figheight() in the example above, we set the plot's height to 2. Here's what the plot would look like:

matplotlib plot with modified height

How to Change Default Plot Size in Matplotlib with `rcParams`

You can override the default plot size in Matplotlib using the rcParams parameter.

This is useful when you want all your plots to follow a particular size. This means you don't have to change the size of every plot you create.

Here's an example with two plots:

import matplotlib.pyplot as plt

x = [2,4,6,8]
y = [10,3,20,4]

plt.rcParams['figure.figsize'] = [4, 4]
plt.plot(x,y)

plt.show()

a = [5,10,15,20]
b = [10,20,30,40]

plt.plot(a,b)

Using the figure.figsize parameter, we set the default width and height to 4: plt.rcParams['figure.figsize'] = [4, 4]. These parameters will change the default width and height of the two plots.

Here are the plots:

matplotlib plot with modified default size

Summary

In this article, we talked about the different ways you can change the size of a plot in Matplotlib.

We saw code examples and visual representation of the plots. This helped us understand how each method can be used to change the size of a plot.

We discussed the following methods used in changing the plot size in Matplotlib:

The figsize() attribute can be used when you want to change the default size of a specific plot.
The set_figwidth() method can be used to change only the width of a plot.
The set_figheight() method can be used to change only the height of a plot.
The rcParams parameter can be used when want to override the default plot size for all your plots. Unlike the the figsize() attribute that targets a specific plot, the rcParams parameter targets all the plots in a project.

Happy coding!

What is Data Analysis? How to Visualize Data with Python, Numpy, Pandas, Matplotlib & Seaborn Tutorial

freeCodeCamp — Thu, 24 Jun 2021 00:11:01 +0000

By Aakash NS

Data Analysis is the process of exploring, investigating, and gathering insights from data using statistical measures and visualizations.

The objective of data analysis is to develop an understanding of data by uncovering trends, relationships, and patterns.

Data analysis is both a science and an art. On the one hand it requires that you know statistics, visualization techniques, and data analysis tools like Numpy, Pandas, and Seaborn.

On the other hand, it requires that you ask interesting questions to guide the investigation, and then interpret the numbers and figures to generate useful insights.

This tutorial on data analysis covers the following topics:

What is Numerical Computation? Python and Numpy for Beginners
How to Analyze Tabular Data using Python and Pandas
Data Visualization using Python, Matplotlib, and Seaborn

What is Numerical Computation? Python and Numpy for Beginners

_Source: Elegant Scipy_

You can follow along with the tutorial and run the code here: https://jovian.ai/aakashns/python-numerical-computing-with-numpy

This section covers the following topics:

How to work with numerical data in Python
How to turn Python lists into Numpy arrays
Multi-dimensional Numpy arrays and their benefits
Array operations, broadcasting, indexing, and slicing
How to work with CSV data files using Numpy

How to Work with Numerical Data in Python

The "data" in Data Analysis typically refers to numerical data, like stock prices, sales figures, sensor measurements, sports scores, database tables, and so on.

The Numpy library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why and how to use Numpy to work with numerical data.

Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples.

A simple approach to do this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters), and average relative humidity (in percentage) as a linear equation.

yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity.

This equation is an approximation, since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statistical analysis of historical data, we might come up with reasonable values for the weights w1, w2, and w3. Here's an example set of values:

w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

To begin, we can define some variables to record climate data for a region.

kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

We can now substitute these variables into the linear equation to predict the yield of apples.

kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples
# 56.8

print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))
# The expected yield of apples in Kanto region is 56.8 tons per hectare.

To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, that is a list of numbers.

kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively.

We can also represent the set of weights used in the formula as a vector.

weights = [w1, w2, w3]

We can now write a function crop_yield to calculate the yield of apples (or any other crop) given the climate data and the respective weights.

def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result

crop_yield(kanto, weights)
# 56.8

crop_yield(johto, weights)
# 76.9

crop_yield(unova, weights)
# 74.9

How to Turn Python Lists into Numpy Arrays

The calculation performed by the crop_yield (element-wise multiplication of two vectors and taking a sum of the results) is also called the dot product. Learn more about dot products here.

The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

Let's install the Numpy library using the pip package manager.

!pip install numpy --upgrade --quiet

Next, let's import the numpy module. It's common practice to import numpy with the alias np.

import numpy as np

We can now use the np.array function to create Numpy arrays.

kanto = np.array([73, 67, 43])

kanto
# array([73, 67, 43])

weights = np.array([w1, w2, w3])

weights
# array([0.3, 0.2, 0.5])

Numpy arrays have the type ndarray.

type(kanto)
# numpy.ndarray

type(weights)
# numpy.ndarray

Just like lists, Numpy arrays support the indexing notation [].

weights[0]
# 0.3

kanto[2]
#43

How to Operate on Numpy arrays

We can now compute the dot product of the two vectors using the np.dot function.

np.dot(kanto, weights)
# 56.8

We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.

(kanto * weights).sum()
# 56.8

The * operator performs an element-wise multiplication of two arrays if they have the same size. The sum method calculates the sum of numbers in an array.

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

arr1 * arr2
# array([ 4, 10, 18])

arr2.sum()
# 15

What are the Benefits of Using Numpy Arrays?

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

They're easy to use: You can write small, concise, and intuitive mathematical expressions like (kanto * weights).sum() rather than using loops and custom functions like crop_yield.
Performance: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements and loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

# CPU times: user 300 ms, sys: 3.26 ms, total: 303 ms
# Wall time: 302 ms
# 833332333333500000

%%time
np.dot(arr1_np, arr2_np)

# CPU times: user 2.11 ms, sys: 951 µs, total: 3.07 ms
# Wall time: 1.58 ms
# 833332333333500000

As you can see, using np.dot is 100 times faster than using a for loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.

Multi-Dimensional Numpy Arrays

We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.

climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

climate_data
# array([[ 73,  67,  43],
#        [ 91,  88,  64],
#        [ 87, 134,  58],
#        [102,  43,  37],
#        [ 69,  96,  70]])

If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.

Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the .shape property of an array.

_Source: Elegant Scipy_

# 2D array (matrix)
climate_data.shape
# (5, 3)

weights
# array([0.3, 0.2, 0.5])

# 1D array (vector)
weights.shape
# (3,)

# 3D array 
arr3 = np.array([
    [[11, 12, 13], 
     [13, 14, 15]], 
    [[15, 16, 17], 
     [17, 18, 19.5]]])

arr3.shape
# (2, 2, 3)

All the elements in a numpy array have the same data type. You can check the data type of an array using the .dtype property.

weights.dtype
# dtype('float64')

climate_data.dtype
# dtype('int64')

If an array contains even a single floating point number, all the other elements are also converted to floats.

arr3.dtype
# dtype('float64')

We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between climate_data (a 5x3 matrix) and weights (a vector of length 3). Here's what it looks like visually:

You can learn about matrices and matrix multiplication by watching the first 3-4 videos of this YouTube playlist.

We can use the np.matmul function or the @ operator to perform matrix multiplication.

np.matmul(climate_data, weights)
# array([56.8, 76.9, 81.9, 57.7, 74.9])

climate_data @ weights
# array([56.8, 76.9, 81.9, 57.7, 74.9])

How to Work with CSV Data Files

Numpy also provides helper functions reading from and writing to files. Let's download a file climate.txt, which contains 10,000 climate measurements (temperature, rainfall, and humidity) in the following format:

temperature,rainfall,humidity
25.00,76.00,99.00
39.00,65.00,70.00
59.00,45.00,77.00
84.00,63.00,38.00
66.00,50.00,52.00
41.00,94.00,77.00
91.00,57.00,96.00
49.00,96.00,99.00
67.00,20.00,28.00
...

This format of storing data is known as comma-separated values or CSV.

CSVs: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)

To read this file into a numpy array, we can use the genfromtxt function.

import urllib.request

urllib.request.urlretrieve(
    'https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv', 
    'climate.txt')

climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header=1)

climate_data
# array([[25., 76., 99.],
#        [39., 65., 70.],
#        [59., 45., 77.],
#        ...,
#        [99., 62., 58.],
#        [70., 71., 91.],
#        [92., 39., 76.]])

climate_data.shape
# (10000, 3)

We can now perform a matrix multiplication using the @ operator to predict the yield of apples for the entire dataset using a given set of weights.

weights = np.array([0.3, 0.2, 0.5])

yields = climate_data @ weights
yields
# array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

yields.shape
# (10000,)

Let's add the yields to climate_data as a fourth column using the np.concatenate function.

climate_results = np.concatenate((climate_data, yields.reshape(10000, 1)), axis=1)

climate_results
# array([[25. , 76. , 99. , 72.2],
#        [39. , 65. , 70. , 59.7],
#        [59. , 45. , 77. , 65.2],
#        ...,
#        [99. , 62. , 58. , 71.1],
#        [70. , 71. , 91. , 80.7],
#        [92. , 39. , 76. , 73.4]])

There are a couple of subtleties here:

Since we wish to add new columns, we pass the argument axis=1 to np.concatenate. The axis argument specifies the dimension for concatenation.
The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the np.reshape function to change the shape of yields from (10000,) to (10000,1).

Here's a visual explanation of np.concatenate along axis=1 (can you guess what axis=0 results in?):

Source: w3resource.com

The best way to understand what a Numpy function does is to experiment with it and read the documentation to learn about its arguments and return values. Use the cells below to experiment with np.concatenate and np.reshape.

Let's write the final results from our computation above back to a file using the np.savetxt function.

np.savetxt('climate_results.txt', 
           climate_results, 
           fmt='%.2f', 
           delimiter=',',
           header='temperature,rainfall,humidity,yeild_apples', 
           comments='')

The results are written back in the CSV format to the file climate_results.txt.

temperature,rainfall,humidity,yeild_apples
25.00,76.00,99.00,72.20
39.00,65.00,70.00,59.70
59.00,45.00,77.00,65.20
84.00,63.00,38.00,56.80
...

Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:

Mathematics: np.sum, np.exp, np.round, arithmetic operators
Array manipulation: np.reshape, np.stack, np.concatenate, np.split
Linear Algebra: np.matmul, np.dot, np.transpose, np.eigvals
Statistics: np.mean, np.median, np.std, np.max

So how do you find the function you need?** The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to this tutorial on array concatenation.

You can find a full list of array functions here.

Numpy Arithmetic Operations, Broadcasting, and Comparison

Numpy arrays support arithmetic operators like +, -, *, etc. You can perform an arithmetic operation with a single number (also called a scalar) or with another array of the same shape.

Operators make it easy to write mathematical expressions with multi-dimensional arrays.

arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

arr3 = np.array([[11, 12, 13, 14], 
                 [15, 16, 17, 18], 
                 [19, 11, 12, 13]])

# Adding a scalar
arr2 + 3

# array([[ 4,  5,  6,  7],
#        [ 8,  9, 10, 11],
#        [12,  4,  5,  6]])

# Element-wise subtraction
arr3 - arr2

# array([[10, 10, 10, 10],
#        [10, 10, 10, 10],
#        [10, 10, 10, 10]])

# Division by scalar
arr2 / 2

# array([[0.5, 1. , 1.5, 2. ],
#        [2.5, 3. , 3.5, 4. ],
#        [4.5, 0.5, 1. , 1.5]])

# Element-wise multiplication
arr2 * arr3

# array([[ 11,  24,  39,  56],
#        [ 75,  96, 119, 144],
#        [171,  11,  24,  39]])

# Modulus with scalar
arr2 % 4

# array([[1, 2, 3, 0],
#        [1, 2, 3, 0],
#        [1, 1, 2, 3]])

Numpy Array Broadcasting

Numpy arrays also support broadcasting, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.

arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])               
arr2.shape
# (3, 4)

arr4 = np.array([4, 5, 6, 7])
arr4.shape
# (4,)

arr2 + arr4
# array([[ 5,  7,  9, 11],
#        [ 9, 11, 13, 15],
#        [13,  6,  8, 10]])

When the expression arr2 + arr4 is evaluated, arr4 (which has the shape (4,)) is replicated three times to match the shape (3, 4) of arr2. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.

Source: Python Data Science Handbook

Broadcasting only works if one of the arrays can be replicated to match the other array's shape.

arr5 = np.array([7, 8])
arr5.shape
# (2,)

arr2 + arr5
# ValueError: operands could not be broadcast together with shapes (3,4) (2,)

In the above example, even if arr5 is replicated three times, it will not match the shape of arr2. So arr2 + arr5 cannot be evaluated successfully. Learn more about broadcasting here.

Numpy Array Comparison

Numpy arrays also support comparison operations like ==, !=, > and so on. The result is an array of booleans.

arr1 = np.array([[1, 2, 3], [3, 4, 5]])
arr2 = np.array([[2, 2, 3], [1, 2, 5]])

arr1 == arr2
# array([[False,  True,  True],
#        [False, False,  True]])

arr1 != arr2
# array([[ True, False, False],
#        [ True,  True, False]])

arr1 >= arr2
# array([[False,  True,  True],
#        [ True,  True,  True]])

arr1 < arr2
# array([[ True, False, False],
#        [False, False, False]])

Array comparison is frequently used to count the number of equal elements in two arrays using the sum method. Remember that True evaluates to 1 and False evaluates to 0 when you use booleans in arithmetic operations.

(arr1 == arr2).sum()
# 3

Numpy Array Indexing and Slicing

Numpy extends Python's list indexing notation using [] to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.

arr3 = np.array([
    [[11, 12, 13, 14], 
     [13, 14, 15, 19]], 

    [[15, 16, 17, 21], 
     [63, 92, 36, 18]], 

    [[98, 32, 81, 23],      
     [17, 18, 19.5, 43]]])

arr3.shape
# (3, 2, 4)

# Single element
arr3[1, 1, 2]

# 36.0

# Subarray using ranges
arr3[1:, 0:1, :2]

# array([[[15., 16.]],
# 
#        [[98., 32.]]])

# Mixing indices and ranges
arr3[1:, 1, 3]

# array([18., 43.])

arr3[1:, 1, :3]
# array([[63. , 92. , 36. ],
#        [17. , 18. , 19.5]])

# Using fewer indices
arr3[1]

# array([[15., 16., 17., 21.],
#        [63., 92., 36., 18.]])

arr3[:2, 1]
# array([[13., 14., 15., 19.],
#        [63., 92., 36., 18.]])

# Using too many indices
arr3[1,3,2,1]

# IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed

The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it.

Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:

_Source: Scipy Lectures_

How to Create Numpy Arrays – Other Methods

Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the official documentation or use the help function to learn more.

# All zeros
np.zeros((3, 2))

# array([[0., 0.],
#        [0., 0.],
#        [0., 0.]])

# All ones
np.ones([2, 2, 3])

# array([[[1., 1., 1.],
#         [1., 1., 1.]],
#
#        [[1., 1., 1.],
#         [1., 1., 1.]]])

# Identity matrix
np.eye(3)

# array([[1., 0., 0.],
#        [0., 1., 0.],
#        [0., 0., 1.]])

# Random vector
np.random.rand(5)

# array([0.92929562, 0.11301864, 0.64213555, 0.8600434 , 0.53738656])

# Random matrix
np.random.randn(2, 3) # rand vs. randn - what's the difference?

# array([[ 0.09906435, -1.64668094,  0.08073528],
#        [ 0.1437016 ,  0.80715712,  1.27285476]])

# Fixed value
np.full([2, 3], 42)

# array([[42, 42, 42],
#        [42, 42, 42]])

# Range with start, end and step
np.arange(10, 90, 3)

# array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
#        61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

# Equally spaced numbers in a range
np.linspace(3, 27, 9)

# array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])

Exercises

Try the following exercises to become familiar with Numpy arrays and practice your skills:

Assignment on Numpy array functions: https://jovian.ml/aakashns/numpy-array-operations
(Optional) 100 numpy exercises: https://jovian.ml/aakashns/100-numpy-exercises

Summary and Further Reading

With this, we complete our discussion of numerical computing with Numpy. We've covered the following topics in this part of the tutorial:

How to go from Python lists to Numpy arrays
How to operate on Numpy arrays
The benefits of using Numpy arrays over lists
Multi-dimensional Numpy arrays
How to work with CSV data files
Arithmetic operations and broadcasting
Array indexing and slicing
Other ways of creating Numpy arrays

Check out the following resources for learning more about Numpy:

Review Questions to Check Your Comprehension

Try answering the following questions to test your understanding of the topics covered in this notebook:

What is a vector?
How do you represent vectors using a Python list? Give an example.
What is a dot product of two vectors?
Write a function to compute the dot product of two vectors.
What is Numpy?
How do you install Numpy?
How do you import the numpy module?
What does it mean to import a module with an alias? Give an example.
What is the commonly used alias for numpy?
What is a Numpy array?
How do you create a Numpy array? Give an example.
What is the type of Numpy arrays?
How do you access the elements of a Numpy array?
How do you compute the dot product of two vectors using Numpy?
What happens if you try to compute the dot product of two vectors which have different sizes?
How do you compute the element-wise product of two Numpy arrays?
How do you compute the sum of all the elements in a Numpy array?
What are the benefits of using Numpy arrays over Python lists for operating on numerical data?
Why do Numpy array operations have better performance compared to Python functions and loops?
Illustrate the performance difference between Numpy array operations and Python loops using an example.
What are multi-dimensional Numpy arrays?
Illustrate how you'd create Numpy arrays with 2, 3, and 4 dimensions.
How do you inspect the number of dimensions and the length along each dimension in a Numpy array?
Can the elements of a Numpy array have different data types?
How do you check the data types of the elements of a Numpy array?
What is the data type of a Numpy array?
What is the difference between a matrix and a 2D Numpy array?
How do you perform matrix multiplication using Numpy?
What is the @ operator used for in Numpy?
What is the CSV file format?
How do you read data from a CSV file using Numpy?
How do you concatenate two Numpy arrays?
What is the purpose of the axis argument of np.concatenate?
When are two Numpy arrays compatible for concatenation?
Give an example of two Numpy arrays that can be concatenated.
Give an example of two Numpy arrays that cannot be concatenated.
What is the purpose of the np.reshape function?
What does it mean to “reshape” a Numpy array?
How do you write a numpy array into a CSV file?
Give some examples of Numpy functions for performing mathematical operations.
Give some examples of Numpy functions for performing array manipulation.
Give some examples of Numpy functions for performing linear algebra.
Give some examples of Numpy functions for performing statistical operations.
How do you find the right Numpy function for a specific operation or use case?
Where can you see a list of all the Numpy array functions and operations?
What are the arithmetic operators supported by Numpy arrays? Illustrate with examples.
What is array broadcasting? How is it useful? Illustrate with an example.
Give some examples of arrays that are compatible for broadcasting.
Give some examples of arrays that are not compatible for broadcasting.
What are the comparison operators supported by Numpy arrays? Illustrate with examples.
How do you access a specific subarray or slice from a Numpy array?
Illustrate array indexing and slicing in multi-dimensional Numpy arrays with some examples.
How do you create a Numpy array with a given shape containing all zeros?
How do you create a Numpy array with a given shape containing all ones?
How do you create an identity matrix of a given shape?
How do you create a random vector of a given length?
How do you create a Numpy array with a given shape with a fixed value for each element?
How do you create a Numpy array with a given shape containing randomly initialized elements?
What is the difference between np.random.rand and np.random.randn? Illustrate with examples.
What is the difference between np.arange and np.linspace? Illustrate with examples.

You are ready to move on to the next section of this tutorial.

How to Analyze Tabular Data using Python and Pandas

Follow along and run the code here: https://jovian.ai/aakashns/python-pandas-data-analysis.

This section covers the following topics:

How to read a CSV file into a Pandas data frame
How to retrieve data from Pandas data frames
How to query, sort, and analyze data
How to merge, group, and aggregate data
How to extract useful information from dates
Basic plotting using line and bar charts
How to write data frames to CSV files

How to Read a CSV File Using Pandas

Pandas is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). It provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL, and more.

Let's download a file italy-covid-daywise.txt which contains day-wise Covid-19 data for Italy in the following format:

date,new_cases,new_deaths,new_tests
2020-04-21,2256.0,454.0,28095.0
2020-04-22,2729.0,534.0,44248.0
2020-04-23,3370.0,437.0,37083.0
2020-04-24,2646.0,464.0,95273.0
2020-04-25,3021.0,420.0,38676.0
2020-04-26,2357.0,415.0,24113.0
2020-04-27,2324.0,260.0,26678.0
2020-04-28,1739.0,333.0,37554.0
...

This format of storing data is known as comma-separated values or CSV. Here's a reminder in case you need a definition of what the CSV format is:

CSVs: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)

We'll download this file using the urlretrieve function from the urllib.request module.

from urllib.request import urlretrieve

urlretrieve('https://hub.jovian.ml/wp-content/uploads/2020/09/italy-covid-daywise.csv', 'italy-covid-daywise.csv')

To read the file, we can use the read_csv method from Pandas. First, let's install the Pandas library.

!pip install pandas --upgrade --quiet

We can now import the pandas module. As a convention, it is imported with the alias pd.

import pandas as pd

covid_df = pd.read_csv('italy-covid-daywise.csv')

Data from the file is read and stored in a DataFrame object – one of the core data structures in Pandas for storing and working with tabular data. We typically use the _df suffix in the variable names for dataframes.

type(covid_df)
# pandas.core.frame.DataFrame

covid_df

Here's what we can tell by looking at the dataframe:

The file provides four day-wise counts for COVID-19 in Italy
The metrics reported are new cases, deaths, and tests
Data is provided for 248 days: from Dec 12, 2019, to Sep 3, 2020

Keep in mind that these are officially reported numbers. The actual number of cases and deaths may be higher, as not all cases are diagnosed.

We can view some basic information about the data frame using the .info method.

covid_df.info()

It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the .describe method.

covid_df.describe()

The columns property contains the list of columns within the data frame.

covid_df.columns
# Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')

You can also retrieve the number of rows and columns in the data frame using the .shape method.

covid_df.shape
# (248, 4)

Here's a summary of the functions and methods we've looked at so far:

pd.read_csv – Read data from a CSV file into a Pandas DataFrame object
.info() – View basic information about rows, columns, and data types
.describe() – View statistical information about numeric columns
.columns – Get the list of column names
.shape – Get the number of rows and columns as a tuple

How to Retrieve Data from a Data Frame in Pandas

The first thing you might want to do is retrieve data from this data frame, like the counts of a specific day or the list of values in a particular column.

To do this, you should understand the internal representation of data in a data frame. Conceptually, you can think of a dataframe as a dictionary of lists: keys are column names, and values are lists/arrays containing data for the respective columns.

# Pandas format is simliar to this
covid_data_dict = {
    'date':       ['2020-08-30', '2020-08-31', '2020-09-01', '2020-09-02', '2020-09-03'],
    'new_cases':  [1444, 1365, 996, 975, 1326],
    'new_deaths': [1, 4, 6, 8, 6],
    'new_tests': [53541, 42583, 54395, None, None]
}

Representing data in the above format has a few benefits:

All values in a column typically have the same type of value, so it's more efficient to store them in a single array.
Retrieving the values for a particular row simply requires extracting the elements at a given index from each column array.
The representation is more compact (column names are recorded only once) compared to other formats that use a dictionary for each row of data (see the example below).

# Pandas format is not similar to this
covid_data_list = [
    {'date': '2020-08-30', 'new_cases': 1444, 'new_deaths': 1, 'new_tests': 53541},
    {'date': '2020-08-31', 'new_cases': 1365, 'new_deaths': 4, 'new_tests': 42583},
    {'date': '2020-09-01', 'new_cases': 996, 'new_deaths': 6, 'new_tests': 54395},
    {'date': '2020-09-02', 'new_cases': 975, 'new_deaths': 8 },
    {'date': '2020-09-03', 'new_cases': 1326, 'new_deaths': 6},
]

With the dictionary of lists analogy in mind, you can now guess how to retrieve data from a data frame. For example, we can get a list of values from a specific column using the [] indexing notation.

covid_data_dict['new_cases']
# [1444, 1365, 996, 975, 1326]

covid_df['new_cases']
# 0         0.0
# 1         0.0
# 2         0.0
# 3         0.0
# 4         0.0
#         ...  
# 243    1444.0
# 244    1365.0
# 245     996.0
# 246     975.0
# 247    1326.0
# Name: new_cases, Length: 248, dtype: float64

Each column is represented using a data structure called Series, which is essentially a numpy array with some extra methods and properties.

type(covid_df['new_cases'])
# pandas.core.series.Series

Like arrays, you can retrieve a specific value with a series using the indexing notation [].

covid_df['new_cases'][246]
# 975.0

covid_df['new_tests'][240]
57640.0

Pandas also provides the .at method to retrieve the element at a specific row & column directly.

covid_df.at[246, 'new_cases']
# 975.0

covid_df.at[240, 'new_tests']
# 57640.0

Instead of using the indexing notation [], Pandas also allows accessing columns as properties of the dataframe using the . notation. However, this method only works for columns whose names do not contain spaces or special characters.

covid_df.new_cases
# 0         0.0
# 1         0.0
# 2         0.0
# 3         0.0
# 4         0.0
#         ...  
# 243    1444.0
# 244    1365.0
# 245     996.0
# 246     975.0
# 247    1326.0
# Name: new_cases, Length: 248, dtype: float64

Further, you can also pass a list of columns within the indexing notation [] to access a subset of the data frame with just the given columns.

cases_df = covid_df[['date', 'new_cases']]
cases_df

The new data frame cases_df is simply a "view" of the original data frame covid_df. Both point to the same data in the computer's memory. Changing any values inside one of them will also change the respective values in the other.

Sharing data between data frames makes data manipulation in Pandas blazing fast. You needn't worry about the overhead of copying thousands or millions of rows every time you want to create a new data frame by operating on an existing one.

Sometimes you might need a full copy of the data frame, in which case you can use the copy method.

covid_df_copy = covid_df.copy()

The data within covid_df_copy is completely separate from covid_df, and changing values inside one of them will not affect the other.

To access a specific row of data, Pandas provides the .loc method.

covid_df

covid_df.loc[243]
# date          2020-08-30
# new_cases         1444.0
# new_deaths           1.0
# new_tests        53541.0
# Name: 243, dtype: object

Each retrieved row is also a Series object.

type(covid_df.loc[243])
# pandas.core.series.Series

We can use the .head and .tail methods to view the first or last few rows of data.

covid_df.head(5)

covid_df.tail(4)

Notice above that while the first few values in the new_cases and new_deaths columns are 0, the corresponding values within the new_tests column are NaN. That is because the CSV file does not contain any data for the new_tests column for specific dates (you can verify this by looking into the file). These values may be missing or unknown.

covid_df.at[0, 'new_tests']
# nan

type(covid_df.at[0, 'new_tests'])
# numpy.float64

The distinction between 0 and NaN is subtle but important. In this dataset, it represents that daily test numbers were not reported on specific dates. Italy started reporting daily tests on Apr 19, 2020. They'd already conducted 935,310 tests before Apr 19.

We can find the first index that doesn't contain a NaN value using a column's first_valid_index method.

covid_df.new_tests.first_valid_index()
# 111

Let's look at a few rows before and after this index to verify that the values change from NaN to actual numbers. We can do this by passing a range to loc.

covid_df.loc[108:113]

We can use the .sample method to retrieve a random sample of rows from the data frame.

covid_df.sample(10)

Notice that even though we have taken a random sample, each row's original index is preserved. This is a useful property of data frames.

Here's a summary of the functions and methods we looked at in this section:

covid_df['new_cases'] – Retrieving columns as a Series using the column name
new_cases[243] – Retrieving values from a Series using an index
covid_df.at[243, 'new_cases'] – Retrieving a single value from a data frame
covid_df.copy() – Creating a deep copy of a data frame
covid_df.loc[243] - Retrieving a row or range of rows of data from the data frame
head, tail, and sample – Retrieving multiple rows of data from the data frame
covid_df.new_tests.first_valid_index – Finding the first non-empty index in a series

How to Analyze Data from Data Frames in Pandas

Let's try to answer some questions about our data.

Q: What are the total number of reported cases and deaths related to Covid-19 in Italy?

Similar to Numpy arrays, a Pandas series supports the sum method to answer these questions.

total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

print('The number of reported cases is {} and the number of reported deaths is {}.'.format(int(total_cases), int(total_deaths)))
# The number of reported cases is 271515 and the number of reported deaths is 35497.

Q: What is the overall death rate (ratio of reported deaths to reported cases)?

death_rate = covid_df.new_deaths.sum() / covid_df.new_cases.sum()

print("The overall reported death rate in Italy is {:.2f} %.".format(death_rate*100))
# The overall reported death rate in Italy is 13.07 %.

Q: What is the overall number of tests conducted? A total of 935,310 tests were conducted before daily test numbers were reported.

initial_tests = 935310
total_tests = initial_tests + covid_df.new_tests.sum()

total_tests
# 5214766.0

Q: What fraction of tests returned a positive result?

positive_rate = total_cases / total_tests

print('{:.2f}% of tests in Italy led to a positive diagnosis.'.format(positive_rate*100))
# 5.21% of tests in Italy led to a positive diagnosis.

Try asking and answering some more questions about the data.

How to Query and Sort Rows in Pandas

Let's say we only want to look at the days which had more than 1,000 reported cases. We can use a boolean expression to check which rows satisfy this criterion.

high_new_cases = covid_df.new_cases > 1000

high_new_cases
# 0      False
# 1      False
# 2      False
# 3      False
# 4      False
#        ...  
# 243     True
# 244     True
# 245    False
# 246    False
# 247     True
# Name: new_cases, Length: 248, dtype: bool

The boolean expression returns a series containing True and False boolean values. You can use this series to select a subset of rows from the original dataframe, corresponding to the True values in the series.

covid_df[high_new_cases]

The data frame contains 72 rows, but only the first and last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.

high_cases_df = covid_df[covid_df.new_cases > 1000]

high_cases_df

The data frame contains 72 rows, but only the first & last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.

from IPython.display import display
with pd.option_context('display.max_rows', 100):
    display(covid_df[covid_df.new_cases > 1000])

This is just part of the data frame. Check out the rest here.

We can also formulate more complex queries that involve multiple columns. As an example, let's try to determine the days when the ratio of cases reported to tests conducted is higher than the overall positive_rate.

positive_rate
# 0.05206657403227681

high_ratio_df = covid_df[covid_df.new_cases / covid_df.new_tests > positive_rate]

high_ratio_df

The result of performing an operation on two columns is a new series.

covid_df.new_cases / covid_df.new_tests
# 0           NaN
# 1           NaN
# 2           NaN
# 3           NaN
# 4           NaN
#          ...   
# 243    0.026970
# 244    0.032055
# 245    0.018311
# 246         NaN
# 247         NaN
# Length: 248, dtype: float64

We can use this series to add a new column to the data frame.

covid_df['positive_rate'] = covid_df.new_cases / covid_df.new_tests

covid_df

However, keep in mind that sometimes it takes a few days to get the results for a test, so we can't compare the number of new cases with the number of tests conducted on the same day. Any inference based on this positive_rate column is likely to be incorrect.

It's essential to watch out for such subtle relationships that are often not conveyed within the CSV file and require some external context. It's always a good idea to read through the documentation provided with the dataset or ask for more information.

For now, let's remove the positive_rate column using the drop method.

covid_df.drop(columns=['positive_rate'], inplace=True)

Can you figure the purpose of the inplace argument?

How to Sort Rows Using Column Values in Pandas

You can also sort the rows by a specific column using .sort_values. Let's sort to identify the days with the highest number of cases, then chain it with the head method to list just the first ten results.

covid_df.sort_values('new_cases', ascending=False).head(10)

It looks like the last two weeks of March had the highest number of daily cases. Let's compare this to the days where the highest number of deaths were recorded.

covid_df.sort_values('new_deaths', ascending=False).head(10)

It appears that daily deaths hit a peak just about a week after the peak in daily new cases.

Let's also look at the days with the smallest number of cases. We might expect to see the first few days of the year on this list.

covid_df.sort_values('new_cases').head(10)

It seems like the count of new cases on Jun 20, 2020, was -148, a negative number! Not something we might have expected, but that's the nature of real-world data. It could be a data entry error, or the government may have issued a correction to account for miscounting in the past.

Can you dig through news articles online and figure out why the number was negative?

Let's look at some days before and after Jun 20, 2020.

covid_df.loc[169:175]

For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:

Replace it with 0.
Replace it with the average of the entire column
Replace it with the average of the values on the previous and next date
Discard the row entirely

Which approach you pick requires some context about the data and the problem. In this case, since we are dealing with data ordered by date, we can go ahead with the third approach.

You can use the .at method to modify a specific value within the dataframe.

covid_df.at[172, 'new_cases'] = (covid_df.at[171, 'new_cases'] + covid_df.at[173, 'new_cases'])/2

Here's a summary of the functions and methods we looked at in this section:

covid_df.new_cases.sum() – Computing the sum of values in a column or series
covid_df[covid_df.new_cases > 1000] – Querying a subset of rows satisfying the chosen criteria using boolean expressions
df['pos_rate'] = df.new_cases/df.new_tests – Adding new columns by combining data from existing columns
covid_df.drop('positive_rate') – Removing one or more columns from the data frame
sort_values – Sorting the rows of a data frame using column values
covid_df.at[172, 'new_cases'] = ... – Replacing a value within the data frame

How to Work with Dates in Pandas

While we've looked at overall numbers for the cases, tests, positive rate, and more, it would also be useful to study these numbers on a month-by-month basis.

The date column might come in handy here, as Pandas provides many utilities for working with dates.

covid_df.date
# 0      2019-12-31
# 1      2020-01-01
# 2      2020-01-02
# 3      2020-01-03
# 4      2020-01-04
#           ...    
# 243    2020-08-30
# 244    2020-08-31
# 245    2020-09-01
# 246    2020-09-02
# 247    2020-09-03
# Name: date, Length: 248, dtype: object

The data type of date is currently object, so Pandas does not know that this column is a date. We can convert it into a datetime column using the pd.to_datetime method.

covid_df['date'] = pd.to_datetime(covid_df.date)

covid_df['date']
# 0     2019-12-31
# 1     2020-01-01
# 2     2020-01-02
# 3     2020-01-03
# 4     2020-01-04
#          ...    
# 243   2020-08-30
# 244   2020-08-31
# 245   2020-09-01
# 246   2020-09-02
# 247   2020-09-03
# Name: date, Length: 248, dtype: datetime64[ns]

You can see that it now has the datatype datetime64. We can now extract different parts of the data into separate columns, using the DatetimeIndex class (view docs).

covid_df['year'] = pd.DatetimeIndex(covid_df.date).year
covid_df['month'] = pd.DatetimeIndex(covid_df.date).month
covid_df['day'] = pd.DatetimeIndex(covid_df.date).day
covid_df['weekday'] = pd.DatetimeIndex(covid_df.date).weekday

covid_df

Let's check the overall metrics for May. We can query the rows for May, choose a subset of columns, and use the sum method to aggregate each selected column's values.

# Query the rows for May
covid_df_may = covid_df[covid_df.month == 5]

# Extract the subset of columns to be aggregated
covid_df_may_metrics = covid_df_may[['new_cases', 'new_deaths', 'new_tests']]

# Get the column-wise sum
covid_may_totals = covid_df_may_metrics.sum()

covid_may_totals
# new_cases       29073.0
# new_deaths       5658.0
# new_tests     1078720.0
# dtype: float64

type(covid_may_totals)
# pandas.core.series.Series

We can also combine the above operations into a single statement.

covid_df[covid_df.month == 5][['new_cases', 'new_deaths', 'new_tests']].sum()
# new_cases       29073.0
# new_deaths       5658.0
# new_tests     1078720.0
# dtype: float64

As another example, let's check if the number of cases reported on Sundays is higher than the average number of cases reported every day. This time, we might want to aggregate columns using the .mean method.

# Overall average
covid_df.new_cases.mean()

# 1096.6149193548388

# Average for Sundays
covid_df[covid_df.weekday == 6].new_cases.mean()

# 1247.2571428571428

It seems like more cases were reported on Sundays compared to other days.

Try asking and answering some more date-related questions about the data.

How to Group and Aggregate Data in Pandas

As a next step, we might want to summarize the day-wise data and create a new dataframe with month-wise data. We can use the groupby function to create a group for each month, select the columns we wish to aggregate, and aggregate them using the sum method.

covid_month_df = covid_df.groupby('month')[['new_cases', 'new_deaths', 'new_tests']].sum()

covid_month_df

The result is a new data frame that uses unique values from the column passed to groupby as the index. Grouping and aggregation is a powerful method for progressively summarizing data into smaller data frames.

Instead of aggregating by sum, you can also aggregate by other measures like mean. Let's compute the average number of daily new cases, deaths, and tests for each month.

covid_month_mean_df = covid_df.groupby('month')[['new_cases', 'new_deaths', 'new_tests']].mean()

covid_month_mean_df

Apart from grouping, another form of aggregation is the running or cumulative sum of cases, tests, or deaths up to each row's date. We can use the cumsum method to compute the cumulative sum of a column as a new series.

Let's add three new columns: total_cases, total_deaths, and total_tests.

covid_df['total_cases'] = covid_df.new_cases.cumsum()
covid_df['total_deaths'] = covid_df.new_deaths.cumsum()
covid_df['total_tests'] = covid_df.new_tests.cumsum() + initial_tests

We've also included the initial test count in total_test to account for tests conducted before daily reporting was started.

covid_df

Notice how the NaN values in the total_tests column remain unaffected.

How to Merge Data from Multiple Sources in Pandas

To determine other metrics like test per million, cases per million, and so on, we require some more information about the country, namely its population.

Let's download another file locations.csv that contains health-related information for many countries, including Italy.

urlretrieve('https://gist.githubusercontent.com/aakashns/8684589ef4f266116cdce023377fc9c8/raw/99ce3826b2a9d1e6d0bde7e9e559fc8b6e9ac88b/locations.csv', 'locations.csv')

locations_df = pd.read_csv('locations.csv')
locations_df

locations_df[locations_df.location == "Italy"]

We can merge this data into our existing data frame by adding more columns. However, to merge two data frames, we need at least one common column. Let's insert a location column in the covid_df dataframe with all values set to "Italy".

covid_df['location'] = "Italy"

covid_df

We can now add the columns from locations_df into covid_df using the .merge method.

merged_df = covid_df.merge(locations_df, on="location")

merged_df

Check out the full data frame here.

The location data for Italy is appended to each row within covid_df. If the covid_df data frame contained data for multiple locations, then the respective country's location data would be appended for each row.

We can now calculate metrics like cases per million, deaths per million, and tests per million.

merged_df['cases_per_million'] = merged_df.total_cases * 1e6 / merged_df.population
merged_df['deaths_per_million'] = merged_df.total_deaths * 1e6 / merged_df.population
merged_df['tests_per_million'] = merged_df.total_tests * 1e6 / merged_df.population

merged_df

Check out the full data frame here.

How to Write Data Back to Files in Pandas

After completing your analysis and adding new columns, you should write the results back to a file. Otherwise, the data will be lost when the Jupyter notebook shuts down.

Before writing to file, let's first create a data frame containing just the columns we wish to record.

result_df = merged_df[['date',
                       'new_cases', 
                       'total_cases', 
                       'new_deaths', 
                       'total_deaths', 
                       'new_tests', 
                       'total_tests', 
                       'cases_per_million', 
                       'deaths_per_million', 
                       'tests_per_million']]

result_df

To write the data from the data frame into a file, we can use the to_csv function.

result_df.to_csv('results.csv', index=None)

The to_csv function also includes an additional column for storing the index of the dataframe by default. We pass index=None to turn off this behavior. You can now verify that the results.csv is created and contains data from the data frame in CSV format:

date,new_cases,total_cases,new_deaths,total_deaths,new_tests,total_tests,cases_per_million,deaths_per_million,tests_per_million
2020-02-27,78.0,400.0,1.0,12.0,,,6.61574439992122,0.1984723319976366,
2020-02-28,250.0,650.0,5.0,17.0,,,10.750584649871982,0.28116913699665186,
2020-02-29,238.0,888.0,4.0,21.0,,,14.686952567825108,0.34732658099586405,
2020-03-01,240.0,1128.0,8.0,29.0,,,18.656399207777838,0.47964146899428844,
2020-03-02,561.0,1689.0,6.0,35.0,,,27.93498072866735,0.5788776349931067,
2020-03-03,347.0,2036.0,17.0,52.0,,,33.67413899559901,0.8600467719897585,
...

Bonus: Basic Plotting with Pandas

We generally use a library like matplotlib or seaborn to plot graphs within a Jupyter notebook. However, Pandas dataframes and series provide a handy .plot method for quick and easy plotting.

Let's plot a line graph showing how the number of daily cases varies over time.

result_df.new_cases.plot();

While this plot shows the overall trend, it's hard to tell where the peak occurred, as there are no dates on the X-axis. We can use the date column as the index for the data frame to address this issue.

result_df.set_index('date', inplace=True)

result_df

Notice that the index of a data frame doesn't have to be numeric. Using the date as the index also allows us to get the data for a specific data using .loc.

result_df.loc['2020-09-01']
# new_cases             9.960000e+02
# total_cases           2.696595e+05
# new_deaths            6.000000e+00
# total_deaths          3.548300e+04
# new_tests             5.439500e+04
# total_tests           5.214766e+06
# cases_per_million     4.459996e+03
# deaths_per_million    5.868661e+02
# tests_per_million     8.624890e+04
# Name: 2020-09-01 00:00:00, dtype: float64

Let's plot the new cases and new deaths per day as line graphs.

result_df.new_cases.plot()
result_df.new_deaths.plot();

We can also compare the total cases vs. total deaths.

result_df.total_cases.plot()
result_df.total_deaths.plot();

Let's see how the death rate and positive testing rates vary over time.

death_rate = result_df.total_deaths / result_df.total_cases

death_rate.plot(title='Death Rate');

positive_rates = result_df.total_cases / result_df.total_tests

positive_rates.plot(title='Positive Rate');

Finally, let's plot some month-wise data using a bar chart to visualize the trend at a higher level.

covid_month_df.new_cases.plot(kind='bar');

covid_month_df.new_tests.plot(kind='bar')

Pandas Exercises

Try the following exercises to become familiar with Pandas dataframes and practice your skills:

Summary and Further Reading

We've covered the following topics in this tutorial:

How to read a CSV file into a Pandas data frame
How to retrieve data from Pandas data frames
How to query, sort, and analyze data
How to merge, group, and aggregate data
How to extract useful information from dates
Basic plotting using line and bar charts
How to write data frames to CSV files

Check out the following resources to learn more about Pandas:

Review Questions to Check Your Comprehension

Try answering the following questions to test your understanding of the topics covered in this notebook:

What is Pandas? What makes it useful?
How do you install the Pandas library?
How do you import the pandas module?
What is the common alias used while importing the pandas module?
How do you read a CSV file using Pandas? Give an example.
What are some other file formats you can read using Pandas? Illustrate with examples.
What are Pandas dataframes?
How are Pandas dataframes different from Numpy arrays?
How do you find the number of rows and columns in a dataframe?
How do you get the list of columns in a dataframe?
What is the purpose of the describe method of a dataframe?
How are the info and describe dataframe methods different?
Is a Pandas dataframe conceptually similar to a list of dictionaries or a dictionary of lists? Explain with an example.
What is a Pandas Series? How is it different from a Numpy array?
How do you access a column from a dataframe?
How do you access a row from a dataframe?
How do you access an element at a specific row and column of a dataframe?
How do you create a subset of a dataframe with a specific set of columns?
How do you create a subset of a dataframe with a specific range of rows?
Does changing a value within a dataframe affect other dataframes created using a subset of the rows or columns? Why is it so?
How do you create a copy of a dataframe?
Why should you avoid creating too many copies of a dataframe?
How do you view the first few rows of a dataframe?
How do you view the last few rows of a dataframe?
How do you view a random selection of rows of a dataframe?
What is the "index" in a dataframe? How is it useful?
What does a NaN value in a Pandas dataframe represent?
How is Nan different from 0?
How do you identify the first non-empty row in a Pandas series or column?
What is the difference between df.loc and df.at?
Where can you find a full list of methods supported by Pandas DataFrame and Series objects?
How do you find the sum of numbers in a column of a dataframe?
How do you find the mean of numbers in a column of a dataframe?
How do you find the number of non-empty numbers in a column of a dataframe?
What is the result obtained by using a Pandas column in a boolean expression? Illustrate with an example.
How do you select a subset of rows where a specific column's value meets a given condition? Illustrate with an example.
What is the result of the expression df[df.new_cases > 100] ?
How do you display all the rows of a pandas dataframe in a Jupyter cell output?
What is the result obtained when you perform an arithmetic operation between two columns of a dataframe? Illustrate with an example.
How do you add a new column to a dataframe by combining values from two existing columns? Illustrate with an example.
How do you remove a column from a dataframe? Illustrate with an example.
What is the purpose of the inplace argument in dataframe methods?
How do you sort the rows of a dataframe based on the values in a particular column?
How do you sort a pandas dataframe using values from multiple columns?
How do you specify whether to sort by ascending or descending order while sorting a Pandas dataframe?
How do you change a specific value within a dataframe?
How do you convert a dataframe column to the datetime data type?
What are the benefits of using the datetime data type instead of object?
How do you extract different parts of a date column like the month, year, month, weekday, and so on into separate columns? Illustrate with an example.
How do you aggregate multiple columns of a dataframe together?
What is the purpose of the groupby method of a dataframe? Illustrate with an example.
What are the different ways in which you can aggregate the groups created by groupby?
What do you mean by a running or cumulative sum?
How do you create a new column containing the running or cumulative sum of another column?
What are other cumulative measures supported by Pandas dataframes?
What does it mean to merge two dataframes? Give an example.
How do you specify the columns that should be used for merging two dataframes?
How do you write data from a Pandas dataframe into a CSV file? Give an example.
What are some other file formats you can write to from a Pandas dataframe? Illustrate with examples.
How do you create a line plot showing the values within a column of a dataframe?
How do you convert a column of a dataframe into its index?
Can the index of a dataframe be non-numeric?
What are the benefits of using a non-numeric dataframe? Illustrate with an example.
How you create a bar plot showing the values within a column of a dataframe?
What are some other types of plots supported by Pandas dataframes and series?

You are ready to move on to the next section of the tutorial.

Data Visualization using Python, Matplotlib, and Seaborn

Notebook link: https://jovian.ai/aakashns/python-matplotlib-data-visualization

Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers.

Visualizing data is an essential part of data analysis and machine learning. We'll use Python libraries Matplotlib and Seaborn to learn and apply some popular data visualization techniques. We'll use the words chart, plot, and graph interchangeably in this tutorial.

To begin, let's install and import the libraries. We'll use the matplotlib.pyplot module for basic plots like line and bar charts. It is often imported with the alias plt. We'll use the seaborn module for more advanced plots. It is commonly imported with the alias sns.

!pip install matplotlib seaborn --upgrade --quiet

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Notice this we also include the special command %matplotlib inline to ensure that our plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.

How to Create a Line Chart in Python

The line chart is one of the simplest and most widely used data visualization techniques. A line chart displays information as a series of data points or markers connected by straight lines.

You can customize the shape, size, color, and other aesthetic elements of the lines and markers for better visual clarity.

Here's a Python list showing the yield of apples (tons per hectare) over six years in an imaginary country called Kanto.

yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]

We can visualize how the yield of apples changes over time using a line chart. To draw a line chart, we can use the plt.plot function.

plt.plot(yield_apples)

Calling the plt.plot function draws the line chart as expected. It also returns a list of plots drawn [], shown within the output. We can include a semicolon (;) at the end of the last statement in the cell to avoiding showing the output and display just the graph.

plt.plot(yield_apples);

Let's enhance this plot step-by-step to make it more informative and beautiful.

How to Customize the X-axis in MatPlotLib

The X-axis of the plot currently shows list element indices 0 to 5. The plot would be more informative if we could display the year for which we're plotting the data. We can do this by two arguments plt.plot.

years = [2010, 2011, 2012, 2013, 2014, 2015]
yield_apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931]

plt.plot(years, yield_apples)

Axis Labels in MatPlotLib

We can add labels to the axes to show what each axis represents using the plt.xlabel and plt.ylabel methods.

plt.plot(years, yield_apples)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');

How to Plot Multiple Lines in MatPlotLib

You can invoke the plt.plot function once for each line to plot multiple lines in the same graph. Let's compare the yields of apples vs. oranges in Kanto.

years = range(2000, 2012)
apples = [0.895, 0.91, 0.919, 0.926, 0.929, 0.931, 0.934, 0.936, 0.937, 0.9375, 0.9372, 0.939]
oranges = [0.962, 0.941, 0.930, 0.923, 0.918, 0.908, 0.907, 0.904, 0.901, 0.898, 0.9, 0.896, ]

plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)');

Chart Title and Legend in MatPlotLib

To differentiate between multiple lines, we can include a legend within the graph using the plt.legend function. We can also set a title for the chart using the plt.title function.

plt.plot(years, apples)
plt.plot(years, oranges)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

How to Use Line Markers in MatPlotLib

We can also show markers for the data points on each line using the marker argument of plt.plot.

Matplotlib provides many different markers like a circle, cross, square, diamond, and more. You can find the full list of marker types here: https://matplotlib.org/3.1.1/api/markers_api.html .

plt.plot(years, apples, marker='o')
plt.plot(years, oranges, marker='x')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

How to Style Lines and Markers in MatPlotLib

The plt.plot function supports many arguments for styling lines and markers:

color or c – Set the color of the line (supported colors)
linestyle or ls – Choose between a solid or dashed line
linewidth or lw – Set the width of a line
markersize or ms – Set the size of markers
markeredgecolor or mec – Set the edge color for markers
markeredgewidth or mew – Set the edge width for markers
markerfacecolor or mfc – Set the fill color for markers
alpha – Opacity of the plot

Check out the documentation for plt.plot to learn more: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot .

plt.plot(years, apples, marker='s', c='b', ls='-', lw=2, ms=8, mew=2, mec='navy')
plt.plot(years, oranges, marker='o', c='r', ls='--', lw=3, ms=10, alpha=.5)

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

The fmt argument provides a shorthand for specifying the marker shape, line style, and line color. You can provide it as the third argument to plt.plot.

fmt = '[marker][line][color]'

plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

You can use the plt.figure function to change the size of the figure.

plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

How to Change the Figure Size in MatPlotLib

You can use the plt.figure function to change the size of the figure.

plt.figure(figsize=(12, 6))

plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

How to Improve Default Styles using Seaborn

An easy way to make your charts look beautiful is to use some default styles from the Seaborn library. You can apply them globally using the sns.set_style function. You can see a full list of predefined styles here: https://seaborn.pydata.org/generated/seaborn.set_style.html .

sns.set_style("whitegrid")
plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

sns.set_style("darkgrid")

plt.plot(years, apples, 's-b')
plt.plot(years, oranges, 'o--r')

plt.xlabel('Year')
plt.ylabel('Yield (tons per hectare)')

plt.title("Crop Yields in Kanto")
plt.legend(['Apples', 'Oranges']);

plt.plot(years, oranges, 'or')
plt.title("Yield of Oranges (tons per hectare)");

You can also edit default styles directly by modifying the matplotlib.rcParams dictionary. Learn more: https://matplotlib.org/3.2.1/tutorials/introductory/customizing.html#matplotlib-rcparams .

import matplotlib

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Scatter Plots in MatPlotLib

In a scatter plot, the values of 2 variables are plotted as points on a 2-dimensional grid. Additionally, you can also use a third variable to determine the size or color of the points. Let's try out an example.

The Iris flower dataset provides sample measurements of sepals and petals for three species of flowers. The Iris dataset is included with the Seaborn library and you can load it as a Pandas data frame.

# Load data into a Pandas dataframe
flowers_df = sns.load_dataset("iris")

flowers_df

flowers_df.species.unique()
# array(['setosa', 'versicolor', 'virginica'], dtype=object)

Let's try to visualize the relationship between sepal length and sepal width. Our first instinct might be to create a line chart using plt.plot.

plt.plot(flowers_df.sepal_length, flowers_df.sepal_width);

The output is not very informative as there are too many combinations of the two properties within the dataset. There doesn't seem to be simple relationship between them.

We can use a scatter plot to visualize how sepal length and sepal width vary using the scatterplot function from the seaborn module (imported as sns).

sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);

How to Add Hues in MatPlotLib

Notice how the points in the above plot seem to form distinct clusters with some outliers. We can color the dots using the flower species as a hue. We can also make the points larger using the s argument.

sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width, hue=flowers_df.species, s=100);

Adding hues makes the plot more informative. We can immediately tell that Setosa irises have a smaller sepal length but higher sepal widths. In contrast, the opposite is true for Virginica irises.

How to Customize Seaborn Figures**

Since Seaborn uses Matplotlib's plotting functions internally, we can use functions like plt.figure and plt.title to modify the figure.

plt.figure(figsize=(12, 6))
plt.title('Sepal Dimensions')

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=100);

How to Plot Data using Pandas Data Frames with Seaborn

Seaborn has built-in support for Pandas data frames. Instead of passing each column as a series, you can provide column names and use the data argument to specify a data frame.

plt.title('Sepal Dimensions')
sns.scatterplot(x='sepal_length', 
                y='sepal_width', 
                hue='species',
                s=100,
                data=flowers_df);

Histograms in MatPlotLib

A histogram represents the distribution of a variable by creating bins (intervals) along the range of values and showing vertical bars to indicate the number of observations in each bin.

For example, let's visualize the distribution of values of sepal width in the Iris dataset. We can use the plt.hist function to create a histogram.

# Load data into a Pandas dataframe
flowers_df = sns.load_dataset("iris")

flowers_df.sepal_width
# 0      3.5
# 1      3.0
# 2      3.2
# 3      3.1
# 4      3.6
#       ... 
# 145    3.0
# 146    2.5
# 147    3.0
# 148    3.4
# 149    3.0
# Name: sepal_width, Length: 150, dtype: float64

plt.title("Distribution of Sepal Width")
plt.hist(flowers_df.sepal_width);

We can immediately see that the sepal widths lie in the range 2.0 - 4.5, and around 35 values are in the range 2.9 - 3.1, which seems to be the most populous bin.

How to Control the Size and Number of Bins**

We can control the number of bins or the size of each one using the bins argument.

# Specifying the number of bins
plt.hist(flowers_df.sepal_width, bins=5);

import numpy as np

# Specifying the boundaries of each bin
plt.hist(flowers_df.sepal_width, bins=np.arange(2, 5, 0.25));

# Bins of unequal sizes
plt.hist(flowers_df.sepal_width, bins=[1, 3, 4, 4.5]);

How to Manage Multiple Histograms in MatPlotLib

Similar to line charts, we can draw multiple histograms in a single chart. We can reduce each histogram's opacity so that one histogram's bars don't hide the others'.

Let's draw separate histograms for each species of flowers.

setosa_df = flowers_df[flowers_df.species == 'setosa']
versicolor_df = flowers_df[flowers_df.species == 'versicolor']
virginica_df = flowers_df[flowers_df.species == 'virginica']

plt.hist(setosa_df.sepal_width, alpha=0.4, bins=np.arange(2, 5, 0.25));
plt.hist(versicolor_df.sepal_width, alpha=0.4, bins=np.arange(2, 5, 0.25));

We can also stack multiple histograms on top of one another.

plt.title('Distribution of Sepal Width')

plt.hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(2, 5, 0.25), 
         stacked=True);

plt.legend(['Setosa', 'Versicolor', 'Virginica']);

Bar Charts in MatPlotLib

Bar charts are quite similar to line charts, that is they show a sequence of values. However, a bar is shown for each value, rather than points connected by lines. We can use the plt.bar function to draw a bar chart.

years = range(2000, 2006)
apples = [0.35, 0.6, 0.9, 0.8, 0.65, 0.8]
oranges = [0.4, 0.8, 0.9, 0.7, 0.6, 0.8]

plt.bar(years, oranges);

Like histograms, we can stack bars on top of one another. We use the bottom argument of plt.bar to achieve this.

plt.bar(years, apples)
plt.bar(years, oranges, bottom=apples);

Bar Plots with Averages in Seaborn

Let's look at another sample dataset included with Seaborn called tips. The dataset contains information about the sex, time of day, total bill, and tip amount for customers visiting a restaurant over a week.

tips_df = sns.load_dataset("tips");

tips_df

We might want to draw a bar chart to visualize how the average bill amount varies across different days of the week. One way to do this would be to compute the day-wise averages and then use plt.bar (try it as an exercise).

However, since this is a very common use case, the Seaborn library provides a barplot function which can automatically compute averages.

sns.barplot(x='day', y='total_bill', data=tips_df);

The lines cutting each bar represent the amount of variation in the values. For instance, it seems like the variation in the total bill is relatively high on Fridays and low on Saturdays.

We can also specify a hue argument to compare bar plots side-by-side based on a third feature, for example sex.

sns.barplot(x='day', y='total_bill', hue='sex', data=tips_df);

You can make the bars horizontal simply by switching the axes.

sns.barplot(x='total_bill', y='day', hue='sex', data=tips_df);

Heatmaps in Seaborn

A heatmap is used to visualize 2-dimensional data like a matrix or a table using colors. The best way to understand it is by looking at an example.

We'll use another sample dataset from Seaborn, called flights, to visualize monthly passenger footfall at an airport over 12 years.

flights_df = sns.load_dataset("flights").pivot("month", "year", "passengers")

flights_df

flights_df is a matrix with one row for each month and one column for each year. The values show the number of passengers (in thousands) that visited the airport in a specific month of a year. We can use the sns.heatmap function to visualize the footfall at the airport.

plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df);

The brighter colors indicate a higher footfall at the airport. By looking at the graph, we can infer two things:

The footfall at the airport in any given year tends to be the highest around July and August.
The footfall at the airport in any given month tends to grow year by year.

We can also display the actual values in each block by specifying annot=True and using the cmap argument to change the color palette.

plt.title("No. of Passengers (1000s)")
sns.heatmap(flights_df, fmt="d", annot=True, cmap='Blues');

Images in MatPlotLib

We can also use Matplotlib to display images. Let's download an image from the internet.

from urllib.request import urlretrieve

urlretrieve('https://i.imgur.com/SkPbq.jpg', 'chart.jpg');

Before displaying an image, it has to be read into memory using the PIL module.

from PIL import Image

img = Image.open('chart.jpg')

An image loaded using PIL is simply a 3-dimensional numpy array containing pixel intensities for the red, green & blue (RGB) channels of the image. We can convert the image into an array using np.array.

img_array = np.array(img)

img_array.shape
# (481, 640, 3)

We can display the PIL image using plt.imshow.

plt.imshow(img);

We can turn off the axes & grid lines and show a title using the relevant functions.

plt.grid(False)
plt.title('A data science meme')
plt.axis('off')
plt.imshow(img);

To display a part of the image, we can simply select a slice from the numpy array.

plt.grid(False)
plt.axis('off')
plt.imshow(img_array[125:325,105:305]);

How to Plot Multiple Charts in a Grid in MatPlotLib and Seaborn

Matplotlib and Seaborn also support plotting multiple charts in a grid, using plt.subplots, which returns a set of axes for plotting.

Here's a single grid showing the different types of charts we've covered in this tutorial.

fig, axes = plt.subplots(2, 3, figsize=(16, 8))

# Use the axes for plotting
axes[0,0].plot(years, apples, 's-b')
axes[0,0].plot(years, oranges, 'o--r')
axes[0,0].set_xlabel('Year')
axes[0,0].set_ylabel('Yield (tons per hectare)')
axes[0,0].legend(['Apples', 'Oranges']);
axes[0,0].set_title('Crop Yields in Kanto')


# Pass the axes into seaborn
axes[0,1].set_title('Sepal Length vs. Sepal Width')
sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species, 
                s=100, 
                ax=axes[0,1]);

# Use the axes for plotting
axes[0,2].set_title('Distribution of Sepal Width')
axes[0,2].hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(2, 5, 0.25), 
         stacked=True);

axes[0,2].legend(['Setosa', 'Versicolor', 'Virginica']);

# Pass the axes into seaborn
axes[1,0].set_title('Restaurant bills')
sns.barplot(x='day', y='total_bill', hue='sex', data=tips_df, ax=axes[1,0]);

# Pass the axes into seaborn
axes[1,1].set_title('Flight traffic')
sns.heatmap(flights_df, cmap='Blues', ax=axes[1,1]);

# Plot an image using the axes
axes[1,2].set_title('Data Science Meme')
axes[1,2].imshow(img)
axes[1,2].grid(False)
axes[1,2].set_xticks([])
axes[1,2].set_yticks([])

plt.tight_layout(pad=2);

See this page for a full list of supported functions: https://matplotlib.org/3.3.1/api/axes_api.html#the-axes-class .

Pair Plots with Seaborn

Seaborn also provides a helper function sns.pairplot to automatically plot several different charts for pairs of features within a dataframe.

sns.pairplot(flowers_df, hue='species');

See the full output here.

sns.pairplot(tips_df, hue='sex');

Summary and Further Reading

We have covered the following topics in this tutorial:

How to create and customize line charts using Matplotlib
How to visualize relationships between two or more variables using scatter plots
How to study distributions of variables using histograms and bar charts
How to visualize two-dimensional data using heatmaps
How to display images using Matplotlib's plt.imshow
How to plot multiple Matplotlib and Seaborn charts in a grid

In this tutorial we've covered some of the fundamental concepts and popular techniques for data visualization using Matplotlib and Seaborn. Data visualization is a vast field and we've barely scratched the surface here. Check out these references to learn and discover more:

Data Visualization cheat sheet: https://jovian.ml/aakashns/dataviz-cheatsheet
Seaborn gallery: https://seaborn.pydata.org/examples/index.html
Matplotlib gallery: https://matplotlib.org/3.1.1/gallery/index.html
Matplotlib tutorial: https://github.com/rougier/matplotlib-tutorial

Review Questions to Check Your Comprehension

Try answering the following questions to test your understanding of the topics covered in this notebook:

What is data visualization?
What is Matplotlib?
What is Seaborn?
How do you install Matplotlib and Seaborn?
How you import Matplotlib and Seaborn? What are the common aliases used while importing these modules?
What is the purpose of the magic command %matplotlib inline?
What is a line chart?
How do you plot a line chart in Python? Illustrate with an example.
How do you specify values for the X-axis of a line chart?
How do you specify labels for the axes of a chart?
How do you plot multiple line charts on the same axes?
How do you show a legend for a line chart with multiple lines?
How you set a title for a chart?
How do you show markers on a line chart?
What are the different options for styling lines and markers in line charts? Illustrate with examples.
What is the purpose of the fmt argument to plt.plot?
Where can you see a list of all the arguments accepted by plt.plot?
How do you change the size of the figure using Matplotlib?
How do you apply the default styles from Seaborn globally for all charts?
What are the predefined styles available in Seaborn? Illustrate with examples.
What is a scatter plot?
How is a scatter plot different from a line chart?
How do you draw a scatter plot using Seaborn? Illustrate with an example.
How do you decide when to use a scatter plot vs a line chart?
How do you specify the colors for dots on a scatter plot using a categorical variable?
How do you customize the title, figure size, legend, and son on for Seaborn plots?
How do you use a Pandas dataframe with sns.scatterplot?
What is a histogram?
When should you use a histogram vs a line chart?
How do you draw a histogram using Matplotlib? Illustrate with an example.
What are "bins" in a histogram?
How do you change the sizes of bins in a histogram?
How do you change the number of bins in a histogram?
How do you show multiple histograms on the same axes?
How do you stack multiple histograms on top of one another?
What is a bar chart?
How do you draw a bar chart using Matplotlib? Illustrate with an example.
What is the difference between a bar chart and a histogram?
What is the difference between a bar chart and a line chart?
How do you stack bars on top of one another?
What is the difference between plt.bar and sns.barplot?
What do the lines cutting the bars in a Seaborn bar plot represent?
How do you show bar plots side-by-side?
How do you draw a horizontal bar plot?
What is a heat map?
What type of data is best visualized with a heat map?
What does the pivot method of a Pandas dataframe do?
How do you draw a heat map using Seaborn? Illustrate with an example.
How do you change the color scheme of a heat map?
How do you show the original values from the dataset on a heat map?
How do you download images from a URL in Python?
How do you open an image for processing in Python?
What is the purpose of the PIL module in Python?
How do you convert an image loaded using PIL into a Numpy array?
How many dimensions does a Numpy array for an image have? What does each dimension represent?
What are "color channels" in an image?
What is RGB?
How do you display an image using Matplotlib?
How do you turn off the axes and gridlines in a chart?
How do you display a portion of an image using Matplotlib?
How do you plot multiple charts in a grid using Matplotlib and Seaborn? Illustrate with examples.
What is the purpose of the plt.subplots function?
What are pair plots in Seaborn? Illustrate with an example.
How do you export a plot into a PNG image file using Matplotlib?
Where can you learn about the different types of charts you can create using Matplotlib and Seaborn?

Congratulations on making it to the end of this tutorial! You can now apply these skills to analyze real world datasets from sources like Kaggle.

If you're pursuing a career in data science and machine learning, consider joining the Zero to Data Science Bootcamp by Jovian. It's a 20-week part-time program where you'll complete 7 courses, 12 coding assignments and 4-real world projects. You will also receive 6 months of career support to help you find your first data science job.

https://www.jovian.ai/zero-to-data-science-bootcamp

Python Data Analysis: How to Visualize a Kaggle Dataset with Pandas, Matplotlib, and Seaborn

freeCodeCamp — Thu, 22 Oct 2020 17:49:27 +0000

By Srijan

The Indian Premier League or IPL is a T20 cricket tournament organized annually by the Board of Control for Cricket In India (BCCI). Eight city-based franchises compete with each other over 6 weeks to find the winner.

In this article, I'm going to analyze data from the IPL's past seasons to see which teams have won the most games, how teams behave when winning a toss, who has the greatest legacy, and so on.

I have done this analysis from a historical point of view, giving an overview of what has happened in the IPL over the years. I have used tools such as Pandas, Matplotlib and Seaborn along with _Pytho_n to give a visual as well as numeric representation of the data in front of us.

Pandas stands for Python Data Analysis library. It is typically used for working with tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL and perform operations on them.

Matplotlib and Seaborn are two Python libraries that are used to produce plots. Matplotlib is generally used for plotting lines, pie charts, and bar graphs.

Seaborn provides some more advanced visualization features with less syntax and more customizations. I switch back-and-forth between them during the analysis.

Getting the Dataset
Data Preparation and Cleaning
Exploratory Analysis and Visualization
Asking and Answering Questions
Inferences From the Analysis
Conclusion

1. Getting the Dataset

I downloaded the dataset from Kaggle. You will see there are two CSV (Comma Separated Value) files, matches.csv and deliveries.csv. I chose to do my analysis on matches.csv.

To find more interesting datasets, you can look at this page.

2. Data Preparation and Cleaning

A dataset contains many columns and rows. It is always possible that certain rows have missing values or NaN for one or more columns.

It is also possible that there might be certain columns or rows that you want to discard from your analysis. You can also combine two or more datasets for an in-depth analysis.

Cleaning the data involves making corrections to that data, leaving out unnecessary columns or rows, merging datasets, and so on.

Before taking these steps, I needed to install and import the tools (libraries) to be used during the analysis. I imported the libraries with different aliases such as pd, plt and sns. I then set some basic styles for the plots.

Notice the special command %matplotlib inline. It makes sure that plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.

Using the read_csv() method from the Pandas library, I loaded the matches.csv file.

Data from the file is read and stored in a DataFrame object - one of the core data structures in Pandas for storing and working with tabular data. I used the _df suffix in the variable names for data frames.

I used the name matches_raw_df for the data frame. This indicates that this is unprocessed data that I will clean, filter, and modify to prepare a data frame that's ready for analysis.

Using the shape property of a Dataframe object, I found that the dataset contains 756 rows and 18 columns. To find the names of those columns I used the columns property. It returned a list of the columns in a data frame.

To get a summary of what the data frame contains, I used info(). This gives information about columns, number of non-null values in each column, their data type, and memory usage.

Almost all columns except umpire3 have no or very few null values. The presence of null values could result from a lack of information or an incorrect data entry.

An interesting thing to observe is that, although there are no null values for the result column, there are some for winner and player_of_match columns. Let's find out why.

I first accessed the result column using dot notation (matches_raw_df.result). Then I used vaule_counts() method on the result column.

value_counts() returns a series which contains counts of unique values. Here, it tells us about the different values present in result and the total number for each of them.

So, out of 756 matches (rows), 4 matches ended as no result.

Cricket is an outdoor sport and unlike, say, football, play isn't possible when it's raining. It is very common to have matches abandoned due to incessant raining. Therefore, we have no winners or player of the match for these 4 matches.

For this analysis, the umpire3 column isn't needed. So I removed the column using the drop() method by passing the column name and axis value. If you want to remove multiple columns, the column names are to be given in a list.

I assigned this cleaned data frame to matches_df. I used this data frame for further analysis.

3. Exploratory Analysis and Visualization

Exploratory analysis involves performing operations on the dataset to understand the data and find patterns. It helps us make sense of the data we have.

Visualization is the graphic representation of data. It involves producing charts that communicate those patterns among the represented data to viewers.

Now, let's take a look at the data I analyzed and what I learned in the process.

Number of matches and teams

I tried to find the number of matches played in each season in the IPL from its inception to 2019.

Since I needed matches played each season, it made sense to group our data according to different seasons. Pandas has a groupby() method to achieve this, wherein I passed season as an argument.

Since an id is unique for each match (row), counting the number of ids for each season leads to what we want. I used the count() method on the id column to find the number of matches held each season. This series is assigned to the variable matches_per_season.

I then used the barplot() method from the Seaborn library to plot the series. The index of the series, that is the seasons, were given as the x-value while the values of those indices were given as y-values.

I used various matpllotlib.pyplot methods such as figure(), xticks() and title() to set the size of the plot, title of the plot, and so on.

figure takes a parameter, figsize, which I set to (12,6). Notice that the size was given as a tuple. To xticks(), I gave the rotation parameter a value of 75 to make it easier to read.

Each season, almost 60 matches were played. However, we see a spike in the number of matches from 2011 to 2013. This is because two new franchises, the Pune Warriors and Kochi Tuskers Kerala, were introduced, increasing the number of teams to 10.

However, Kochi was removed in the very next season, while the Pune Warriors were removed in 2013, bringing the number down to 8 from 2014 onwards.

Before the start of the 2016 season, two teams, the Chennai Super Kings and Rajasthan Royals were banned for two seasons. To make up for their absence, two new teams (the Rising Pune Supergiants and Gujarat Lions) entered the competition.

When the Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition.

Analyzing the Toss results

One of the most significant events in any cricket match is the toss, which happens at the very start of a match. The toss winner can choose whether they want to bat first or second (fielding first).

Let's see what the trend has been amongst the teams across different seasons.

Again I grouped the rows by season and then counted the different values of the toss_decision column by using value_counts().

Since a percentage gives a clearer picture, I divided the above result with matches_per_season and multiplied it by 100. This series was assigned to toss_decision_percentage.

Here, toss_decision_percentage is a series with multi-index. If we print the index of the series using the index property, we see it is of the form (2008, 'bat'), (2008, 'field') and so on.

The series used both season and toss_decision as an index. But I only wanted the seasons to be an index. I used unstack() to achieve this.

By using the unstack() method on the series, it converted the values of toss_decision (that is, bat and field) into separate columns.

Next I used the plot() method from Matplotlib to represent these values as bar charts. plot() has a parameter kind which decides what type of plot to draw. The value was set to bar.

For 2008-2013, teams seemed to favour both batting first and second. For this period, teams chose to bat first more in 2009, 2010 and 2013. On the other hand, they chose fielding first more in 2008 and 2011. Things were even-steven in 2012.

This could be because IPL and T20 cricket in general was in its budding stages. So, teams were probably learning and trying to figure out which option would be more beneficial.

However, since 2014, teams have overwhelmingly chosen to bat second. Especially since 2016, teams have chosen to field first more than 80% of the time.

Batting first requires that the team gauge the conditions and the pitch and then set a target accordingly. Chasing is less complicated, as there is a fixed target to achieve.

Conditions have also become more batsman-friendly and the skills of the batsmen have increased tremendously (read more here).

Number of Wins

We saw how teams in the recent past have chosen to bat second more than 4 out of 5 times. Did this decision transform the results? Let's see.

For wins_batting_first, the values of win_by_wickets has to be 0. Also, the result column should have a value of normal since tied matches also have win margins as 0. This condition was stored as filter1.

Similarly, for wins_fielding_first, the the value of win_by_runs has to be 0 and the result column should have a value of normal. This condition was stored as filter1.

In both the series, I used count() method on winner column to find the won matches in the filtered conditions. I divided the results with matches_per_season calculated earlier to give a better understanding.

To plot these two series together, I combined them using Pandas' concat() method. I passed the two series names as a list and set the value of axis as 1. This gives us a new data frame which was stored as combined_wins_df.

Next I plotted combined_wins_df as a bar chart using plot().

We saw earlier that for 2008-2013, teams faced a conundrum whether to bat first or field first. This is partially visible in the results as well.

The wins from batting first are very close to that from fielding first. However, there is just one season where teams batting first won more, with things being equal in 2013.

Again, since 2014, things have been in favour of teams chasing except 2015. Leaving out 2015, things have been overwhelmingly in favour of teams fielding first.

So, teams choosing to field more have been justified in their decisions.

Teams with "History"

In leagues across different sports, there is always talk about teams with "history" – teams that have played the most in the league and continue to do so. Let's find those teams in the IPL.

Now, between two teams A and B, it can be "A vs B" or "B vs A", depending on how the data entry has been done. So I decided to count the total number of different values for both the team1 and team2 columns using value_counts(). Then I added them together.

I sorted the results in descending order using the sort_values() method from Pandas. The ascending parameter was set to False.

Here, I used sns.barplot() to plot the graph.

The Mumbai Indians have played the most matches. They are followed by the Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings.

The Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned.

You will see there are two teams from Delhi, the Delhi Daredevils and Delhi Capitals. This resulted from a change in ownership and then team name in 2018.

It's a similar story for the Deccan Chargers and Sunrisers Hyderabad, as the Deccan Chargers were removed from the IPL in 2013 and the Sunrisers came in their place.

Also, there are two teams with almost same name: the Rising Pune Supergiants and Rising Pune Supergiant. They are same team, and there was no change in ownership – it has more to do with superstitions.

In the 2016 season, the Rising Pune Supergiants finished 7th. The owners changed the captain for 2017 and also dropped the 's' from Supergiants. Well, it paid off as they finished as runner-up that season!

Teams with "Legacy"

Now, teams may have a lot of history but it's their "legacy" – how often they win – that makes them popular and attracts new and neutral fans.

To find such teams, I simply used value_counts() on the winner column. This gives us the number of matches that each team has won.

So Mumbai has the most wins. But a better metric to judge would be the win percentage. To find the win percentage, I divided most_wins by total_matches_played to find the win_percentage for each team.

The Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This is largely because they have played fewer matches compared to most teams. Especially Rising Pune Supergiant, which technically became a new team after dropping the 's'.

The Chennai Super Kings, despite playing two fewer seasons than the Mumbai Indians, had only 9 fewer victories. They, along with the Mumbai Indians, are the only two teams in the top 5 that were also part of the IPL in 2008.

Chennai and Mumbai are the teams with the most legacy.

4. Asking and Answering Questions from the Data

We've already gained some insights about the IPL by exploring various columns of our dataset.

Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations.

Q. Who has won the IPL tournament?

Group the rows according to seasons using groupby().
Find the last match of each season, that is, the final using tail(). It returns the last n rows from a Dataframe object or series based on position.
Sort the values per season using sort_values().
Count the different winners and the times they won using value_counts() on winner.

Then I plotted the series ipl_winners using sns.barplot().

Mumbai and Chennai, our legacy teams, have won the IPL at least 3 times. The Sunrisers Hyderabad are the only team that joined the league later and won the trophy.

Q. Which are the most and least consistent teams across all seasons?

Created a data frame between different values of winner and season using pd.crosstab().
Plotted the data frame as a heatmap.

pd.crosstab() gives a simple cross-tabulation of the winner and season columns. For each different value of winner, pd.crosstab() finds its frequency for each different value in season.

Then I plotted matches_won_each_season using sns.heatmap(). I passed the data frame matches_won_each_season, with annot as True to have the values shown as well. Here, the darker color indicates more matches won.

The Chennai Super Kings have been the most consistent team, winning at least 8 matches in each of the seasons they have played. This is backed up by the fact that they are the only team to reach the playoffs stage every season.

At the other end of the spectrum are 3 teams, the Delhi Daredevils, Kings XI Punjab and Rajasthan Royals. All three of them have had two seasons where they performed really well. However, they have been pretty average during the other seasons.

Q. What has been the biggest margin of victory in terms of runs in the IPL?

Filter the data frame using the required condition.
Sort the values in descending order using sort_values().
Find the biggest 10 victories in the list using the head() method. It works opposite to tail(), returning the first n rows.

I plotted the filtered data frame highest_wins_by_runs_df using sns.scatterplot(). For the x parameter I used season, and I used win_by_runs as the y parameter. I made the size of the points bigger for the top 10 victories using the s parameter.

To put emphasis on the top 10 victories, I used a different color as well as annotated those data points using plt.annotate(). The first parameter is the text of the annotation. The position of the point to be annotated is given as a tuple.

The biggest margin of victory by runs is 146 runs. In 2017, the Mumbai Indians defeated the Delhi Daredevils by this margin. The Royal Challengers Bangalore have 3 victories amongst the top 5.

Q. Mumbai and Chennai are the two most successful teams so far. Which team leads in the head-to-head record?

Filter the data frame using the required condition to find the matches played between the two teams.
Use the value_counts() on the winner column to find how many times each of the teams have won.

I plotted the series mivcsk as a bar chart for a better visualization.

MI have dominated CSK and are leading the head-to-head record 17-11. We can see their dominance especially in the 2019 season, where the MI defeated the CSK 4 out of 4 times they met, including the playoff and the final.

5. Inferences from the Analysis

We have drawn some interesting inferences and now know more about the IPL than when we started. Here's a summary of what we learned through our analysis:

Almost 60 matches are played in every IPL season amongst 8 teams.
There has been an attempt to expand the IPL to 10 teams but the 8 teams idea was brought back and has been continued since.
For the first six seasons (2008-2013), teams were figuring out whether batting first or chasing would be better after winning the toss. This could be down to the fact that the IPL and T20 cricket were both in their early stages so teams were trying different strategies.
But, since 2014, teams have preferred chasing, especially in the past 4 seasons (2016-2019) where teams have chosen to field more than 4 times out of 5. This is likely because having a set total to chase makes things simpler. This could also result from teams preferring to chase in ODIs as well.
Though teams have overwhelmingly chosen to field first, the win percentage after choosing to bat or field is not that one-sided. However, their difference is on the rise.
Mumbai Indians have played the most matches in the IPL. Due to the brief expansion, change of owners, and removal and banning of teams, there have been 15 teams who have played in the IPL.
Chennai and Mumbai are the two teams with the highest win percentage. The fact that they are the only two teams that were part of the first season as well, in the top 5, shows their dominance.
Mumbai Indians have the won the IPL 4 times, the most. They are followed by Chennai at 3 and Kolkata Knight Riders at 2. Sunrisers Hyderabad, Deccan Chargers and Rajasthan Royals complete the IPL Champions list, all winning once each.
146 runs is the largest margin of victory by runs. Mumbai Indians defeated Delhi Daredevils by this margin in 2017. The largest margin for victory by wickets is 10, which has been achieved many times.
The two heavyweights, Mumbai and Chennai, have a head-to-head record in favour of Mumbai at 17-11. Mumbai have had the upper hand in the 2019 season every time they met, including the final.

6. Conclusion

In this article, we did a bunch of analysis and saw some interesting visualizations. However, this was just scratching the surface.

You can perform more interesting analysis on matches.csv as a standalone data set. But combining deliveries.csv with this dataset could lead to more in-depth analysis.

I did this data analysis and visualization as a project for the 6-week course Data Analysis with Python: Zero to Pandas. This course was conducted by Jovian.ml in partnership with freeCodeCamp.org. Check out the project here.

Also, the IPL is on right now. Go watch it and enjoy!

Matplotlib Course – Learn Python Data Visualization

freeCodeCamp — Wed, 20 May 2020 21:10:00 +0000

Learn the basics of Matplotlib in this crash course tutorial. Matplotlib is an amazing data visualization library for Python. You will also learn how to apply Matplotlib to real-world problems.

You can watch the full course here (90 minute watch).

Course Notes

? Source Code

? Matplotlib Pyplot Documentation

? Font List

? Matplotlib Style Options

? Kaggle Data Link

How to Install libraries Needed for Matplotlib

Option 1: How to Install Matplotlib directly using pip install

Open up a terminal window and type
pip install matplotlib
pip install numpy
pip install pandas

Option 2: How to install Anaconda

Download Anaconda, which will contain all the packages we need. Here's a video tutorial that walks you through how to do this.

Again, you can watch the full course here (90 minute watch).

Matplotlib - freeCodeCamp.org

How to Get Started with Matplotlib – With Code Examples and Visualizations

Importance of Data Visualization in Data Analysis

Brief Overview of Matplotlib

Getting Started with Matplotlib

Installation and Setup

How to Create Your First Plot

Exploring Different Types of Plots

Line Plots

Scatter Plots

Bar Charts

Histograms

Pie Charts

Advanced Plot Customizations

How to Work with Multiple Plots

How to Enhance Plot Aesthetics

How to Save and Export Plots

Interactive Plotting and Animation

Interactive Features in Matplotlib

How to Create Animations

How to Optimize Plots for Large Datasets

Efficient Plotting Techniques for Large Datasets

Downsampling

Data Aggregation

Statistical Data Visualization

Box Plots

Violin Plot

Common Visualization Pitfalls and How to Avoid Them

Overplotting

Misleading Scales and Axes

Color Misuse

Misleading Use of 3D Plots

Misleading Use of Area Charts

Conclusion

Data Visualization with Matplotlib – a Step by Step Guide

What is Data Visualization?

What is Matplotlib?

How to Create a Bar Chart

How to Create a Pie Chart

How to Create a Line chart

Conclusion

Conda Remove Package - How To Remove Matplotlib in Anaconda

How To Create an Environment in Conda

How To Install Packages in a Conda Environment

How To Remove a Package in Conda

Summary

Matplotlib Marker - How To Create a Marker in Matplotlib

List of Matplotlib Markers

Summary

How To Change Legend Font Size in Matplotlib

What Is a Legend in Matplotlib?

How To Change Legend Font Size in Matplotlib Using the fontsize Parameter

How To Change Legend Font Size in Matplotlib Using the prop Parameter

Summary

Matplotlib Add Color – How To Change Line Color in Matplotlib

How To Change Line Color in Matplotlib

How To Change Line Color in Matplotlib Example #1

How To Change Line Color in Matplotlib Example #2

How To Change Line Color in Matplotlib Example #3

Summary

Matplotlib Figure Size – How to Change Plot Size in Python with plt.figsize()

How to Change Plot Size in Matplotlib with plt.figsize()

Here's what the syntax looks like:

How to Change Plot Width in Matplotlib with set_figwidth()

How to Change Plot Height in Matplotlib with set_figheight()

How to Change Default Plot Size in Matplotlib with rcParams

Summary

What is Data Analysis? How to Visualize Data with Python, Numpy, Pandas, Matplotlib & Seaborn Tutorial

What is Numerical Computation? Python and Numpy for Beginners

How to Work with Numerical Data in Python

How to Turn Python Lists into Numpy Arrays

How to Operate on Numpy arrays

What are the Benefits of Using Numpy Arrays?

Multi-Dimensional Numpy Arrays

How to Work with CSV Data Files

Numpy Arithmetic Operations, Broadcasting, and Comparison

Numpy Array Broadcasting

Numpy Array Comparison

Numpy Array Indexing and Slicing

How to Create Numpy Arrays – Other Methods

How To Change Legend Font Size in Matplotlib Using the `fontsize` Parameter

How To Change Legend Font Size in Matplotlib Using the `prop` Parameter

How to Change Plot Size in Matplotlib with `plt.figsize()`

How to Change Plot Width in Matplotlib with `set_figwidth()`

How to Change Plot Height in Matplotlib with `set_figheight()`

How to Change Default Plot Size in Matplotlib with `rcParams`