How to Get Started with Pandas in Python

The Pandas package in Python gives you a bunch of cool functions and features that help you manipulate data more efficiently. It also lets you perform numerous data cleaning and data preprocessing steps with very little hassle.

That's great isn't it? Here's a list of some of the most frequently used Pandas functions and tricks to help you enjoy your data science journey.

How to Remove Missing Values in DataFrame

Getting rid of missing values is one of the most common tasks in data cleaning. Missing values could be just across one row or column or across multiple rows and columns.

Depending on your application and problem domain, you can use different approaches to handle missing data – like interpolation, substituting with the mean, or simply removing the rows with missing values.

Pandas offers the dropna function which removes all rows (for axis=0) or all columns (for axis=1) where missing values are present. Some of the arguments for the dropna function are as follows:

axis which specifies if rows are to be dropped (axis=0) or if columns are to be dropped (axis=1)
subset which specifies a list of columns to consider for missing values when axis=0
inplace which specifies if changes are to be made in the existing DataFrame itself

Check out the docs linked here for more in-depth coverage.

In the example below, we're creating a small DataFrame with missing values and then discarding rows with missing values in any column.

How to Remove Duplicates in DataFrame

Another common data cleaning task is removing duplicate rows. The drop_duplicates function performs this with arguments similar to dropna such as:

subset, which specifies a subset of columns to consider for duplicate value when axis=0
inplace
keep, which specifies which duplicated values to keep. Keep can be equal to first, last, or False to drop all duplicates.

Check out the docs linked here for more detailed info.

Let's duplicate a few rows and remove them from our dataset:

How to Remove Rows with Column-specific Values

Suppose we want to keep only those rows where project type is Web or where the number of hours worked is equal to 12. Here's how we can do it.

Using this method, we can filter out rows based on certain specific column values:

How to Convert DataFrames to JSON

DataFrames are super cool optimized structures that are great to work with. And JSON is one of the most popular data formats for seamless data exchange.

Let's convert our DataFrame to JSON using to_json which requires arguments like:

orient, which specifies what should be the key and value pairs. Default is columns, so column name is the key and each column is the value.
date_format which specifies the format of the date. The default is epoch.

Look at the example below:

We can see that to_json has returned a string with the following schema:

column_0 :
{ row_index_0: column_value_0, row_index_1:column_value_1, ...}, 
column_1:
{ row_index_0: column_value_0, row_index_1:column_value_1, ...}, 
...
column_N:
{ row_index_0: column_value_0, row_index_1:column_value_1, ...}

If we want to convert each row to a dictionary, we need to specify that orient=records and parse it using the JSON module.

Convert DataFrame to JSON with orient=records

How to Count the Number of Unique Values in a Column

Let's say we want to know how many different project types exist. We can get that information using the nunique function.

Count number of unique values in a column

How to Save DataFrame as .csv File

Just one line of code is required to save the DataFrame as a csv file:

dataset.to_csv("save_as_csv.csv")

How to Save Multiple Lists as One .csv File

Suppose we have three separate lists as our data source and we want to save them together in one csv file. This just involves two steps:

converting it to a number of tuples using zip,
and then converting it to a list.

In the example below, we follow this approach to convert the three lists into one DataFrame which we can now save as a .csv file.

How to Read DataFrames in a Memory Efficient Way

Often we need to read files which are so large that they can't fit into memory. For such mammoth datasets, we use a different approach.

First, we create a TextFileReader object. Next we specify a parameter called chunksize which specifies how many rows of the file we want to read at a time, let's say 4 rows. So we read 4 rows at a time, perform some tasks on that chunk, and move on to the next 4 rows.

Small chunks are more likely to fit into memory than the entire file of thousands of rows. The following example shows how chunking works.

Read DataFrame in a memory efficient manner

Here we read the california dataset 1000 rows at a time, remove all rows where median_income is less than or equal to 3, and append these reduced chunks together to make a smaller dataset.

You can save more memory by reading only those columns which you need and specifying smaller datatypes for columns as described in detail in the docs linked here.

How to Change All Values in a DataFrame Using `apply`

Let's go back to our example of a projects DataFrame to illustrate this. We focus on the Hours_Worked column, increasing the count by 1 if it's an even number and by 2 if it's an odd number. We use a lambda function for this purpose.

Conclusion

Pandas is a powerful package which can seem daunting sometimes due to its vastness. This is why I tried to list out some of the most useful functions I've come across.

These Pandas functions will help you accelerate your data analysis endeavors. Thank you for your time and I hope you enjoyed reading this article.