Pandas is a fast and approachable open-source library in Python built for analyzing and manipulating data.
This library has a lot of functions and methods to expedite the data analysis process. One of my favorites is the
groupby method, mainly because it lets you get quick insights into your data by transforming, aggregating, and splitting data into various categories.
In this article, you will learn about the Pandas
groupby function, how to aggregate data, and group Pandas DataFrames with multiple columns using the
What do I need to install on my computer to follow this article?
For this article, I'll be using a Jupyter notebook. You can install Jupyter notebook and get it up and running on your computer via the official website.
After installing Juypter, create a new notebook and run
Import pandas as pd to import pandas and
Import numpy as np to import NumPy.
NumPy will let us work with multi-dimensional arrays and high-level mathematical functions. On the other hand, Pandas will allow us to manipulate our data and access the
Let's get started.
groupby in Pandas?
If you're familiar with SQL and its GROUP BY syntax, you already know how powerful it is in summarizing and categorizing data.
groupby method in Python does the same thing and is great when splitting and categorizing data into groups to analyze your data better.
Here is the syntax for Pandas
python DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=_NoDefault.no_default, squeeze=_NoDefault.no_default, observed=False, dropna=True)
Each attribute has a meaning:
by– List of the columns you want to group by.
axis– Defaults to 0. It takes 0 or 'index', 1 or 'columns'.
level– Used with MultiIndex.
as_index– SQL style grouped otput.
sort– Defaults to True. Specify whether to sort after grouping.
group_keys– add group keys or not.
squeeze– deprecated in new versions.
observed– Only use if any of the groupers are Categoricals.
dropna– Defaults to False. Use True to drop None/Nan.
Now let's see how this function works in action.
How to Load the Dataset
For this tutorial, we'll use the supermarket sales dataset from Kaggle, which you can access and download here.
After downloading the dataset, load the data into a pandas dataframe.
A DataFrame is a 2-dimensional data structure made up of rows and columns. This is very similar to your spreadsheet.
You can do that by running this code:
df = pd.read_csv(r"C:\Users\Double Arkad\Downloads\archive\supermarket_sales - Sheet1.csv")
After that, use the
df.head() method to show the first few rows of your dataset. After running
df.head(), you should get the result below. This indicates that the dataset got loaded successfully.
How to Use the
groupby Method in Pandas
Assume your employer asked you to total the number of items ordered and categorize them according to the different payment options. This will let you determine which payment method generates the most revenue.
You can answer this question with the
groupby function by simply grouping the data based on the 'Payment'.
The first column, 'Payments', is the column you want to group by. The second column, 'Quantity' is the column you'll perform an aggregate function on. Lastly, you have the aggregate function
Sum() is one of many functions you can use in a
groupby. You could also use other aggregate functions like the
Average() to find the minimum, mean, median, count, and average value in a group within your dataset.
But by using the
agg() function, you can perform two or more aggregations simultaneously.
Let's see how that works.
How to Aggregate Data Using
groupby in Pandas
Here's how to use
agg() in a
groupby function to find this supermarket's most used payment method.
There are more cash transactions done. Ewallets and credit card transactions follow in level of use.
Notice here we created a dictionary and passed the aggregate functions to be performed. This simultaneously performed two statistical computations on our data! Of course, you can add more aggregate functions in the dictionary depending on the insights you want to get.
Here is what I mean:
df.groupby(['Payment', 'Customer type'])['Quantity'].agg([np.sum, np.mean, np.max, np.min])
groupby function, we added more aggregate functions to our statistical computation to gain insight into the maximum and the minimum number of goods ordered in each payment group.
Here's how it works:
And here's the result you get:
From the output, we're counting the total number of orders placed in the store and grouping the results by each payment method.
How to Group Pandas DataFrames by Multiple Columns
You can also group multiple columns in the
groupby function. For example, we included a column below to our
groupby function called 'Customer type'.
df.groupby(['Payment', 'Customer type'])['Quantity'].sum()
Our output shows that the data was split and categorized into two groups based on the Customer type column. The output is becoming easier to analyze.
How to Aggregate Multiple Columns Using Pandas
You can also perform statistical computations on multiple columns with the
groupby function. For example, let's look at the total sales generated and quantity ordered and group our results by the "Payment" and "Customer type" columns.
Run the code:
df.groupby(['Payment', 'Customer type']) [['Quantity','Unit price']].sum()
We can see from the output that the Payment type "Ewallet" generated the most revenue, and you can move on to determine which type of Customers contributed the most revenue for the Store.
In this article, you learned about the importance of the Pandas
groupby method. You saw how the
groupby function allows you to do a lot of operations on your data, from splitting the data to applying a function like
Sum() to get more insight and add more functionality.