python beginner - freeCodeCamp.org

How Passing by Object Reference Works in Python

Mokshita V P — Thu, 26 Mar 2026 14:23:20 +0000

If you've ever modified a variable inside a Python function and been surprised or confused by what happened to it outside the function, you're not alone. This tripped me up for a long time.

Coming from tutorials that talked about "call by value" and "call by reference," I assumed Python must follow one of those two models. It doesn't. Python does something slightly different, and once you understand it, a lot of previously confusing behavior will suddenly click.

In this article, you'll learn:

What calling by value and calling by reference mean
How other languages like C handle this
What Python actually does (passing by object reference)
How mutable and immutable types affect behavior inside functions

Call by Value and Call by Reference Explained
How It Works in C (with Examples)
What Python Does Instead
Mutable vs Immutable Types
Conclusion

Call by Value and Call by Reference Explained

Before we get to Python, let's quickly define these two terms.

Call by value means a copy of the variable is passed to the function. Whatever you do to it inside the function, the original stays unchanged.

Call by reference means the actual memory location of the variable is passed. Changes inside the function directly affect the original variable.

Many languages support one or both of these models. Python, however, uses neither – at least not in the traditional sense.

How It Works in C (with Examples)

C is a good example of a language that supports both models explicitly.

Here's how you call by value in C. The original variable is unaffected:

#include 

void modify(int *n) {

*n = *n + 10;

printf("Inside function: %d\n", *n); }

int main() {

int x = 5;

modify(&x);

printf("Outside function: %d\n", x);

return 0; }

Output:

Inside function: 15

Outside function: 15 ← original changed!

In C, you explicitly choose the behavior by deciding whether to pass a pointer or a plain value. Python doesn't give you that choice, but what it does instead is actually quite logical.

What Python Does Instead

Python uses a model called passing by object reference (sometimes called passing by assignment).

When you pass a variable to a function in Python, you're passing a reference to the object that variable points to, not a copy of the value, and not the variable itself.

What happens next depends entirely on whether that object is mutable (can be changed in place) or immutable (cannot be changed in place).

Mutable vs Immutable Types

Immutable types in Python include int, float, str, and tuple. These objects cannot be modified in place. When you "change" one inside a function, Python creates a brand new object and the original is left untouched.

def modify_number(n):
     n = n + 10
     print("Inside function:", n)

x = 5

modify_number(x)

print("Outside function:", x)

Output:

Inside function: 15

Outside function: 15 ← original unchanged

Mutable types include list, dict, and set. These can be changed in place. When you modify one inside a function, you're modifying the same object the caller is holding a reference to.

def modify_list(items):

    items.append(99)

    print("Inside function:", items)

my_list = [1, 2, 3]

modify_list(my_list)

print("Outside function:", my_list)

Output:

Inside function: [1, 2, 3, 99]

Outside function: [1, 2, 3, 99] ← original changed!

This is the key insight: Python doesn't decide behavior based on how you pass something, it decides based on what type of object you're passing.

Conclusion

Python doesn't use call by value or call by reference. It passes by object reference, where the function receives a reference to the object, and whether that object can be modified in place determines what happens next.

To recap:

Immutable types (int, str, tuple): a new object is created inside the function, original stays the same
Mutable types (list, dict, set): the original object is modified directly

Once this clicked for me, a lot of the "why is Python doing this?" moments started making sense. If you're just getting started with functions in Python, keep this in the back of your mind, it'll save you a lot of debugging headaches.

How to Use the Polars Library in Python for Data Analysis

Sara Jadhav — Wed, 10 Dec 2025 18:14:34 +0000

In this article, I’ll give you a beginner-friendly introduction to the Polars library in Python.

Polars is an open-source library, originally written in Rust, which makes data wrangling easier in Python. The syntax of Polars is very similar to Pandas, so if you’ve worked with Pandas or the PySpark library before, using Polars should be a breeze.

Polars excels at giving fast results. It’s also memory efficient and helps you optimize your code using parallelism. It also lets you convert data from and to various libraries like NumPy, Pandas, and others.

In this tutorial, we’ll be learning about the Polars Library from absolute scratch, from installing and importing the library on the system, to manipulating data in a dataset with the help of this library.

First, we’ll look at Polars basic functions. We’ll be also writing some practical code, which will help you apply what you’ve learned. Finally, we’ll be working with an example dataset to solidify some more key Polars concepts. Let’s dive in.

Prerequisites
Installing and Importing the Polars Library
What is a Series?
What is a DataFrame?
How to Read CSV Files with Polars
Some other Important Functions
Summary

Prerequisites

Even though this tutorial is beginner-friendly, having some basic knowledge of the following areas will help you understand this article better:

Basic Python syntax
Data structures
Ability to import libraries and knowledge of using functions and methods
Basics of NumPy and Pandas will come in handy (not necessary).

Now, that you’re aware of the prior requirements to follow along, let’s get started with our tutorial.

Installing and Importing the Polars Library

To install the Polars library, you can use the following command in your terminal:

pip install polars

Now, this works if you already have the pip package manager on your system. If you’re on a conda environment, you can work with this:

conda install -c conda-forge polars

But I strongly recommend using the pip package manager to avoid various inconveniences.

Let’s import Polars in our program. We’ll follow the same process as we use for importing other libraries in Python:

import polars as pl # pl is a conventional alias

While creating a Polars object with the data, it’s important to know the size of our data. Polars has the capacity to have 2³² rows in the DataFrame. To load more data, use the following command to install the Polars library:

pip install polars[rt64]

If you want to use the Polars library right away without actually installing it on your system, using a Google Colab notebook is the best option. When using a Google Colab Notebook, you can directly import and start using Polars in your program. I’ll be using Google Colab Notebook for this tutorial.

What is a Series?

A series is a fundamental element of a DataFrame. It’s a 1-dimensional data-structure that you can correlate with a ‘list’ in Python or a ‘1-D array’ in NumPy. But the difference between a series and a 1-D array is that the former is labeled while the later is not. Many series come together to form a DataFrame.

We can create a series with homogenous data as well as heterogenous data.

Creating a Series with Homogenous Data

In a series, the datatype of all the elements should be the same. If it’s not, an error is thrown.

The syntax to define a Polars series is as follows:

var_name = pl.Series(“column_name”, [values])

The following code shows an example of a homogenous series definition in Python:

import polars as pl
series_homo = pl.Series("Numbers", ['One', 'Two', 'Three', 'Four', 'Five'])
print(series_homo)

Output:

shape: (5,)
Series: 'Numbers' [str]
[
    "One"
    "Two"
    "Three"
    "Four"
    "Five"
]

In the above code, we first imported the Polars library using the pl alias to start using it throughout the code. Using aliases is a matter of choice, but pl is a conventional one (like np for NumPy and pd for Pandas). The benefit of using conventional aliases is that when you hand over the code to someone else, it’s easy for them to follow along.

Next, we used the pl.Series() function to create a Polars series object. As its first parameter, we passed the label for our series (Numbers in this case). Then we passed the values to be stores in the form of a list. Remember that the list of values that we pass acts as a single argument. Finally, we printed our series.

We can see that the output tells us about the dimensions of the the Polars object as well as the datatype of the series. The shape (rows, columns) tells us about the the number of rows and columns present in the Polars object.

We can find the data-type of a homogenous series explicitly by using the dtype method.

print(series_homo.dtype)

Output:

String

Creating a Series with Heterogenous Data

Heterogenous data means that the data-type of all the elements is not the same. The syntax to define a series with heterogenous data is as follows:

var_name = pl.Series(“Column_name”, [values], strict=False)

So you’re probably wondering, based on what I said above: how can we have a series with heterogenous data? Well, one thing to note is that a series is always homogenous irrespective of the data that is fed to it. I’ll explain below - first let’s look at this code:

import polars as pl

series_hetero = pl.Series("Numbers", [1, "Two", 3, "Four"], strict=False)
print(series_hetero)

Output:

shape: (4,)
Series: 'Numbers' [str]
[
    "1"
    "Two"
    "3"
    "Four"
]

Here, we created a series object using the pl.Series() function, labelled it, and passed the values that we want in our series.

But you’ll notice that we have provided heterogenous data (data that doesn’t have the same datatype) to the function. Usually, this throws an error. But as we have set the strict parameter as False, the function now becomes lenient with the schema of the series. (The schema is just the expected data-type of the values that are to be recorded in the series.)

If no particular schema is defined for a series that’s fed heterogenous data, pl.Series() sets the schema to pl.Utf8 (string datatype). You can see this automatic fixing of the schema in the above example. This prevents the program from bugging, as a string datatype can comprehend characters – numbers as well as symbols.

Also, we can see that datatype of all elements is the same (pl.Utf8). This means that the series is homogenous, even though we put heterogenous data in it.

If we define a schema for the series, then the Polars library converts all the records – which show a different datatype than the defined schema – to null objects. This should be clear in the following example:

import polars as pl
# defined the schema as Integer bit 32
series = pl.Series("ints", [1, -2, 3, 4, 5, 'Thirteen', 'Fourteen'], dtype=pl.Int32, strict=False)
print(series)

Output:

shape: (7,)
Series: 'ints' [i32]
[
    1
    -2
    3
    4
    5
    null
    null
]

Here, we can see that the last two entities were ‘String’, but since we set the schema as ‘Integer’, they were reflected as null records.

So as you can see, the leniency of the program depends on whether you set the strict parameter to True of False. If we set it as True, we enforce the schema to the data strictly. Upon failing to obey the schema, the program raises an exception. On the other hand, if we set the strict parameter as False, the series still preserves its homogenous nature by turning schema-disobeying elements to null.

Now that you understand how series work, we’re ready to move on to DataFrames.

What is a DataFrame?

A DataFrame is a two-dimensional data structure that you can use to store large numbers of related parameters of the collected data. It’s also useful for analyzing that data. A DataFrame is nothing more than the collection of many series, each labelled differently to store different aspects of data.

Here’s the syntax to create a Polars DataFrame object:

var_name = pl.DataFrame({key: value pairs}, schema)

The following example shows you how to define a DataFrame object in Python:

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
print(df)

Output:

shape: (10, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
│ ---    ┆ ---         ┆ ---         │
│ u32    ┆ f64         ┆ f64         │
╞════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.098612    ┆ 0.477121    │
│ 4      ┆ 1.386294    ┆ 0.60206     │
│ 5      ┆ 1.609438    ┆ 0.69897     │
│ 6      ┆ 1.791759    ┆ 0.778151    │
│ 7      ┆ 1.94591     ┆ 0.845098    │
│ 8      ┆ 2.079442    ┆ 0.90309     │
│ 9      ┆ 2.197225    ┆ 0.954243    │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘

Above, we created a Polars DataFrame object with the pl.DataFrame() function. In the function, we created a dictionary as an argument for passing the values of the DataFrame.

In the dictionary, each key-value pair represents a series. Each key represents the label of the series, whereas its value represent the values of the series. The values are passed in the form of a list as each key can map to only one value.

Then we defined the schema for the DataFrame. Again, the schema is a dictionary, where each key-value pair corresponds to the schema of the series. In the schema, every key represents the label of the series (to map the schema to the correct series) and its value represents the schema.

In the output, we can see that we got a nice table representing our data. The labels are neatly separated from the data and below them, their schema is also represented.

What is a Schema?

A schema refers to the definition of the datatype of the series. We fix a particular datatype to the homogenous series to avoid getting in mixed-data.

For example, in the above code, we set the datatype of the column Number to Unsigned Integer - 32 bit (pl.UInt32) as we don’t want to put negative integers in our NumPy logarithm function.

Now, if we want to hide the datatype (that’s written below each label), we can use the following function:

pl.Config.set_tbl_hide_column_data_types(active=True)

The Head, Tail, and Glimpse Functions

The head(), tail() and glimpse() functions are used to have a quick look at the data by reviewing certain records (rows). These are useful especially for large datasets for taking a look at the data, for example to see which columns are present, what type of data is present in each column, and so on.

The head() function prints the given number of rows (passed as the argument of the head() function) from the top of the DataFrame. If no argument is passed, it prints the first five rows of the DataFrame.

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=True)
print(df.head(3))

Output:

shape: (3, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.098612    ┆ 0.477121    │
└────────┴─────────────┴─────────────┘

In this example, we have the used the same DataFrame that we just created. Then we used the head() function to output the first three rows of the DataFrame. Also, you may now notice that the schema representation under column names has disappeared. This is because we used pl.Config.set_tbl_hide_column_data_types(active=True).

The glimpse() function presents the data briefly and in a horizontal manner (rows are represented as columns and columns are represented as rows) for better readability.

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=True)
print(df.glimpse())

Output:

Rows: 10
Columns: 3
$ Number       1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ Natural Log  0.0, 0.6931471805599453, 1.0986122886681098, 1.3862943611198906, 1.6094379124341003, 1.791759469228055, 1.9459101490553132, 2.0794415416798357, 2.1972245773362196, 2.302585092994046
$ Log Base 10  0.0, 0.3010299956639812, 0.47712125471966244, 0.6020599913279624, 0.6989700043360189, 0.7781512503836436, 0.8450980400142568, 0.9030899869919435, 0.9542425094393249, 1.0

None

Here, we used the glimpse() function on our previously created DataFrame df. We can see the output as our transposed DataFrame. Also, None is returned. This is because, by default, glimpse() sets its return_as_string parameter to None. To change it to string, we can set the return_as_string parameter to True. The following example shows how to do it:

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=True)
print(f'Returned as String: \n{df.glimpse(return_as_string=True)}')

Output:

Returned as String: 
Rows: 10
Columns: 3
$ Number       1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ Natural Log  0.0, 0.6931471805599453, 1.0986122886681098, 1.3862943611198906, 1.6094379124341003, 1.791759469228055, 1.9459101490553132, 2.0794415416798357, 2.1972245773362196, 2.302585092994046
$ Log Base 10  0.0, 0.3010299956639812, 0.47712125471966244, 0.6020599913279624, 0.6989700043360189, 0.7781512503836436, 0.8450980400142568, 0.9030899869919435, 0.9542425094393249, 1.0

In the above code, we can see that the DataFrame is returned as a string and None is not returned.

Finally, the tail() function outputs the given number of rows (passed as the argument of the tail() function) from the bottom of the dataset. When no argument is passed, it outputs the last 5 rows by default.

This is useful for checking if our data was completely loaded. Checking the first few records using the head() function and the last few records with the tail() function ensures that the data is correctly and totally loaded.

Also, we can check if there are any empty records at the end of the dataset. Having empty records at the end of the dataset can be fatal in some cases. For example, if you have to train an ML model on a dataset and you split the dataset statically into testing and training datasets, the empty rows at the end are going to cause an issue. So, checking our data beforehand is a best practice, and these functions help us do it.

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=True)
print(df.tail(3))

Output:

shape: (3, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 8      ┆ 2.079442    ┆ 0.90309     │
│ 9      ┆ 2.197225    ┆ 0.954243    │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘

In the above code, we used the tail() function on the dataset (that we created earlier) and passed ‘3’ as our argument. Thus our program returned the last three rows of the dataset.

The Sample Function

The sample() function returns a given number of random rows in random order based on their occurrence in the DataFrame. This helps to avoid biased sampling of data.

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=True)
print(df.sample(3))

Output:

shape: (3, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 6      ┆ 1.791759    ┆ 0.778151    │
│ 5      ┆ 1.609438    ┆ 0.69897     │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘

We can see in the output that we got random rows of the data in a random order of their occurrence in the dataset (row 5 comes before row 6 in the DataFrame, yet by sampling we got row 5 after row 6.) Sampling is a good practice as it helps avoid overfitting in ML in some cases and gives us a general idea about the entire dataset.

Concatenating Two DataFrames

In a nutshell, ‘concatenating’ simply means ‘linking’. Adding or linking one dataset to another – basically, stacking one on top of another – is concatenating the two datasets.

For example, in the previous DataFrame, we had numbers from 1 to 10 and their logarithms. Now, if we want to make it 1 to 20, we have to concatenate a different dataset containing numbers 11 to 20 to the former dataset.

The following code shows how this works:

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=True)

# new dataset created for concatenation
df1 = pl.DataFrame({
    "Number" : [x for x in range(11, 21)],
    "Log Base 10" : [np.log10(x) for x in range(11,21)],
    "Natural Log" : [np.log(x) for x in range(11, 21)]
}, schema=schema)

print(pl.concat([df, df1], how='vertical')) # concatenating the two datasets

Output:

shape: (20, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.098612    ┆ 0.477121    │
│ 4      ┆ 1.386294    ┆ 0.60206     │
│ 5      ┆ 1.609438    ┆ 0.69897     │
│ …      ┆ …           ┆ …           │
│ 16     ┆ 2.772589    ┆ 1.20412     │
│ 17     ┆ 2.833213    ┆ 1.230449    │
│ 18     ┆ 2.890372    ┆ 1.255273    │
│ 19     ┆ 2.944439    ┆ 1.278754    │
│ 20     ┆ 2.995732    ┆ 1.30103     │
└────────┴─────────────┴─────────────┘

In this code, we first created the DataFrame df. Then we created another DataFrame df1. Next, we used pl.concat() to concatenate the DataFrames.

The first argument that we passed is the list of the DataFrames that are to be linked. The how parameter defines the manner of concatenation. ‘Vertical’ in this context means that we are linking DataFrames vertically (adding more rows).

The important thing to note here is that schema incompatibility may raise an exception. If the DataFrames that are to be concatenated have different schemas, there will be a schema incompatibility problem. So it’s better to keep the schemas of both the datasets (that are to be concatenated) the same.

Here, we introduced a variable named schema containing the schema parameter of the DataFrame and we applied it to both the DataFrames to avoid schema incompatibility.

Also, concatenation occurs in the order of the passed arguments. For example, in the above code, df appears prior to df1, thus in the linked DataFrame, df appears first and then df1. If we had changed the sequence of values, the concatenated DataFrame would start from df1 and then df.

The following code explains that:

import polars as pl
import numpy as np

schema = {"Number": pl.UInt32, "Natural Log": None, "Log Base 10": None}

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=True)

# new dataset created for concatenation
df1 = pl.DataFrame({
    "Number" : [x for x in range(11, 21)],
    "Log Base 10" : [np.log10(x) for x in range(11,21)],
    "Natural Log" : [np.log(x) for x in range(11, 21)]
}, schema=schema)

print(pl.concat([df1, df], how='vertical')) # sequence changed from [df,df1] to [df1, df]

Output:

shape: (20, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 11     ┆ 2.397895    ┆ 1.041393    │
│ 12     ┆ 2.484907    ┆ 1.079181    │
│ 13     ┆ 2.564949    ┆ 1.113943    │
│ 14     ┆ 2.639057    ┆ 1.146128    │
│ 15     ┆ 2.70805     ┆ 1.176091    │
│ …      ┆ …           ┆ …           │
│ 6      ┆ 1.791759    ┆ 0.778151    │
│ 7      ┆ 1.94591     ┆ 0.845098    │
│ 8      ┆ 2.079442    ┆ 0.90309     │
│ 9      ┆ 2.197225    ┆ 0.954243    │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘

Here, we can see that the df1 appears first and then df appears (unlike the previous example). Thus, the sequence of the values matters.

How to Join Two DataFrames

Joining datasets and concatenating datasets are two different concepts. While concatenating means ‘linking’ two separate datasets, joining refers to combining datasets based on a shared column (a key).
The computer matches rows from both datasets where the key values are the same.

In the above dataset ‘df’, we’ll add a new column by joining the dataset ‘df’ with another DataFrame.

# new dataframe
new_col = pl.DataFrame({
    "Number" : [x for x in range(1, 11)],
    "Log Base 2" : [np.log2(x) for x in range(1, 11)]
})

new_data = df.join(new_col, on="Number", how="left") # Both have one column same to map values

print(new_data.head())

Output:

shape: (5, 4)
┌────────┬─────────────┬─────────────┬────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 ┆ Log Base 2 │
╞════════╪═════════════╪═════════════╪════════════╡
│ 1      ┆ 0.0         ┆ 0.0         ┆ 0.0        │
│ 2      ┆ 0.693147    ┆ 0.30103     ┆ 1.0        │
│ 3      ┆ 1.098612    ┆ 0.477121    ┆ 1.584963   │
│ 4      ┆ 1.386294    ┆ 0.60206     ┆ 2.0        │
│ 5      ┆ 1.609438    ┆ 0.69897     ┆ 2.321928   │
└────────┴─────────────┴─────────────┴────────────┘

In this example, we used the join function on df and passed new_col as its argument. This is why the columns of the df function occur prior to the column of the new_col dataset. The parameter on should be given a column name on the basis of which the two datasets are to be joined.

Here, we first mapped the elements of the column Number and its corresponding rows and joined the DataFrames accordingly.

If we used the join() function on the new_col DataFrame, the columns of df would appear later than the column in new_col. The following code will make it clear:

# new dataframe
new_col = pl.DataFrame({
    "Number" : [x for x in range(1, 11)],
    "Log Base 2" : [np.log2(x) for x in range(1, 11)]
})

new_data = new_col.join(df, on="Number", how="left") # passed df as argument

print(new_data.head())

Output:

shape: (5, 4)
┌────────┬────────────┬─────────────┬─────────────┐
│ Number ┆ Log Base 2 ┆ Natural Log ┆ Log Base 10 │
╞════════╪════════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0        ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 1.0        ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.584963   ┆ 1.098612    ┆ 0.477121    │
│ 4      ┆ 2.0        ┆ 1.386294    ┆ 0.60206     │
│ 5      ┆ 2.321928   ┆ 1.609438    ┆ 0.69897     │
└────────┴────────────┴─────────────┴─────────────┘

You can notice that the column ‘Log Base 2’ appears prior to other columns (unlike in the previous example). Thus this change is significant.

How to Use the `with_columns()` Function

The with_columns() function enables us to make changes to the column and print it as a new column with existing columns from the original dataset. This is similar to the join() function.

The following example will make it clear:

import polars as pl
import numpy as np

df = pl.DataFrame(
    {
        "Number" : np.arange(1, 11),
        "Natural Log" : [np.log(x) for x in range(1,11)],
        'Log Base 10' : [np.log10(x) for x in range(1,11)]
        },
    schema=schema
    )
new_data = df.with_columns((np.log2(pl.col("Number"))).alias("Log Base 2"))

print(new_data.head())

Output:

shape: (5, 4)
┌────────┬─────────────┬─────────────┬────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 ┆ Log Base 2 │
╞════════╪═════════════╪═════════════╪════════════╡
│ 1      ┆ 0.0         ┆ 0.0         ┆ 0.0        │
│ 2      ┆ 0.693147    ┆ 0.30103     ┆ 1.0        │
│ 3      ┆ 1.098612    ┆ 0.477121    ┆ 1.584963   │
│ 4      ┆ 1.386294    ┆ 0.60206     ┆ 2.0        │
│ 5      ┆ 1.609438    ┆ 0.69897     ┆ 2.321928   │
└────────┴─────────────┴─────────────┴────────────┘

In this example, we have a DataFrame df. To add a column to it , we use the with_columns() function. In this function, we selected column named ‘Number’ using the pl.col() function and put it inside the np.log2() to get the log base 2 value for every record. Finally, to label the new column, we used the alias() function, with the label passed to it as an argument.

Now that we know about the basics of DataFrames, let’s look at how we can work with CSV files.

How to Read CSV Files with Polars

Reading CSV files with Polars is extremely similar to how it works in Pandas. For this tutorial, I’ll be using the Titanic Dataset. Here’s the link to the dataset so you can download it. In this part of the tutorial, we’ll be mainly talking about column selection (useful in feature selection) and filtering the data.

Here’s the syntax for reading a CSV file:

var_name = pl.read_csv(“path_dataset“)

Example code:

import polars as pl

data = pl.read_csv("/titanic_dataset.csv")
print(data.head())

Output:

shape: (5, 12)
┌─────────────┬──────────┬────────┬─────────────────────┬───┬─────────┬─────────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name                ┆ … ┆ Ticket  ┆ Fare    ┆ Cabin ┆ Embarked │
╞═════════════╪══════════╪════════╪═════════════════════╪═══╪═════════╪═════════╪═══════╪══════════╡
│ 892         ┆ 0        ┆ 3      ┆ Kelly, Mr. James    ┆ … ┆ 330911  ┆ 7.8292  ┆ null  ┆ Q        │
│ 893         ┆ 1        ┆ 3      ┆ Wilkes, Mrs. James  ┆ … ┆ 363272  ┆ 7.0     ┆ null  ┆ S        │
│             ┆          ┆        ┆ (Ellen Need…        ┆   ┆         ┆         ┆       ┆          │
│ 894         ┆ 0        ┆ 2      ┆ Myles, Mr. Thomas   ┆ … ┆ 240276  ┆ 9.6875  ┆ null  ┆ Q        │
│             ┆          ┆        ┆ Francis             ┆   ┆         ┆         ┆       ┆          │
│ 895         ┆ 0        ┆ 3      ┆ Wirz, Mr. Albert    ┆ … ┆ 315154  ┆ 8.6625  ┆ null  ┆ S        │
│ 896         ┆ 1        ┆ 3      ┆ Hirvonen, Mrs.      ┆ … ┆ 3101298 ┆ 12.2875 ┆ null  ┆ S        │
│             ┆          ┆        ┆ Alexander (Helg…    ┆   ┆         ┆         ┆       ┆          │
└─────────────┴──────────┴────────┴─────────────────────┴───┴─────────┴─────────┴───────┴──────────┘

We can get the statistical analysis of the data by using the describe() function.

print(data.describe())

Output:

shape: (9, 13)
┌────────────┬─────────────┬──────────┬──────────┬───┬─────────────┬───────────┬───────┬──────────┐
│ statistic  ┆ PassengerId ┆ Survived ┆ Pclass   ┆ … ┆ Ticket      ┆ Fare      ┆ Cabin ┆ Embarked │
╞════════════╪═════════════╪══════════╪══════════╪═══╪═════════════╪═══════════╪═══════╪══════════╡
│ count      ┆ 418.0       ┆ 418.0    ┆ 418.0    ┆ … ┆ 418         ┆ 417.0     ┆ 91    ┆ 418      │
│ null_count ┆ 0.0         ┆ 0.0      ┆ 0.0      ┆ … ┆ 0           ┆ 1.0       ┆ 327   ┆ 0        │
│ mean       ┆ 1100.5      ┆ 0.363636 ┆ 2.26555  ┆ … ┆ null        ┆ 35.627188 ┆ null  ┆ null     │
│ std        ┆ 120.810458  ┆ 0.481622 ┆ 0.841838 ┆ … ┆ null        ┆ 55.907576 ┆ null  ┆ null     │
│ min        ┆ 892.0       ┆ 0.0      ┆ 1.0      ┆ … ┆ 110469      ┆ 0.0       ┆ A11   ┆ C        │
│ 25%        ┆ 996.0       ┆ 0.0      ┆ 1.0      ┆ … ┆ null        ┆ 7.8958    ┆ null  ┆ null     │
│ 50%        ┆ 1101.0      ┆ 0.0      ┆ 3.0      ┆ … ┆ null        ┆ 14.4542   ┆ null  ┆ null     │
│ 75%        ┆ 1205.0      ┆ 1.0      ┆ 3.0      ┆ … ┆ null        ┆ 31.5      ┆ null  ┆ null     │
│ max        ┆ 1309.0      ┆ 1.0      ┆ 3.0      ┆ … ┆ W.E.P. 5734 ┆ 512.3292  ┆ G6    ┆ S        │
└────────────┴─────────────┴──────────┴──────────┴───┴─────────────┴───────────┴───────┴──────────┘

How to Select Columns from the Dataset

Now we’re going to learn how to select certain columns from the dataset and transform those columns into a new DataFrame. This can be useful if we want to train an ML model based on only certain columns and not the entire dataset (that is, using feature selection).

Let’s first look at the code below:

new_df = data.select(
    pl.col("Survived"),
    pl.col("Name"),
    pl.col("Age"),
    pl.col("Sex")
)

print(new_df.head())

Output:

shape: (5, 4)
┌──────────┬─────────────────────────────────┬──────┬────────┐
│ Survived ┆ Name                            ┆ Age  ┆ Sex    │
╞══════════╪═════════════════════════════════╪══════╪════════╡
│ 0        ┆ Kelly, Mr. James                ┆ 34.5 ┆ male   │
│ 1        ┆ Wilkes, Mrs. James (Ellen Need… ┆ 47.0 ┆ female │
│ 0        ┆ Myles, Mr. Thomas Francis       ┆ 62.0 ┆ male   │
│ 0        ┆ Wirz, Mr. Albert                ┆ 27.0 ┆ male   │
│ 1        ┆ Hirvonen, Mrs. Alexander (Helg… ┆ 22.0 ┆ female │
└──────────┴─────────────────────────────────┴──────┴────────┘

In the code above, we selected four columns using the select() and pl.col() functions from the Titanic Dataset and transformed them into a new DataFrame called new_df.

Now, we can filter this data however we want. Let’s make a new DataFrame by filtering out only surviving passengers from the dataset:

survived_data = data.select(
    pl.col("Survived"),
    pl.col("Name"),
    pl.col("Age"),
    pl.col("Sex")
).filter(pl.col("Survived")==1)

print(survived_data.head())

Output:

shape: (5, 4)
┌──────────┬─────────────────────────────────┬──────┬────────┐
│ Survived ┆ Name                            ┆ Age  ┆ Sex    │
╞══════════╪═════════════════════════════════╪══════╪════════╡
│ 1        ┆ Wilkes, Mrs. James (Ellen Need… ┆ 47.0 ┆ female │
│ 1        ┆ Hirvonen, Mrs. Alexander (Helg… ┆ 22.0 ┆ female │
│ 1        ┆ Connolly, Miss. Kate            ┆ 30.0 ┆ female │
│ 1        ┆ Abrahim, Mrs. Joseph (Sophie H… ┆ 18.0 ┆ female │
│ 1        ┆ Snyder, Mrs. John Pillsbury (N… ┆ 23.0 ┆ female │
└──────────┴─────────────────────────────────┴──────┴────────┘

In the above code, we used the filter() function. This function helps us gather data that applies to our given condition. In the above example, we added the condition that, “Every element in the column named ‘Survived’ should be equal to 1”. Hence, we got our required data.

Some Other Important Functions

How to Print the Names of the Columns of a Dataset

You can print the names of a column using the columns method. The following code shows how to use the columns method:

print(data.columns) # data --> Titanic Dataset

Output:

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

How to Index a Dataset

Indexing a dataset means adding an index column to the existing dataset. It can prove useful in keeping track of the rows of the dataset.

We can index the dataset using the with_row_index() function. Inside this function, we can pass the argument to name this new index column. If we don’t pass any argument, the index column name is set as ‘index’ by default.

data = pl.read_csv("/titanic_dataset.csv").with_row_index('#') # naming the index column as '#'
print(data.head())

Output:

shape: (5, 13)
┌─────┬─────────────┬──────────┬────────┬───┬─────────┬─────────┬───────┬──────────┐
│ #   ┆ PassengerId ┆ Survived ┆ Pclass ┆ … ┆ Ticket  ┆ Fare    ┆ Cabin ┆ Embarked │
│ --- ┆ ---         ┆ ---      ┆ ---    ┆   ┆ ---     ┆ ---     ┆ ---   ┆ ---      │
│ u32 ┆ i64         ┆ i64      ┆ i64    ┆   ┆ str     ┆ f64     ┆ str   ┆ str      │
╞═════╪═════════════╪══════════╪════════╪═══╪═════════╪═════════╪═══════╪══════════╡
│ 0   ┆ 892         ┆ 0        ┆ 3      ┆ … ┆ 330911  ┆ 7.8292  ┆ null  ┆ Q        │
│ 1   ┆ 893         ┆ 1        ┆ 3      ┆ … ┆ 363272  ┆ 7.0     ┆ null  ┆ S        │
│ 2   ┆ 894         ┆ 0        ┆ 2      ┆ … ┆ 240276  ┆ 9.6875  ┆ null  ┆ Q        │
│ 3   ┆ 895         ┆ 0        ┆ 3      ┆ … ┆ 315154  ┆ 8.6625  ┆ null  ┆ S        │
│ 4   ┆ 896         ┆ 1        ┆ 3      ┆ … ┆ 3101298 ┆ 12.2875 ┆ null  ┆ S        │
└─────┴─────────────┴──────────┴────────┴───┴─────────┴─────────┴───────┴──────────┘

How to Rename Columns in the Dataset

Lastly, to rename columns in the Dataset, we use the rename() function.

data = pl.read_csv("/titanic_dataset.csv").with_row_index('#').rename({'PassengerId':'renamed_col'})
print(data.head())

Output:

shape: (5, 13)
┌─────┬─────────────┬──────────┬────────┬───┬─────────┬─────────┬───────┬──────────┐
│ #   ┆ renamed_col ┆ Survived ┆ Pclass ┆ … ┆ Ticket  ┆ Fare    ┆ Cabin ┆ Embarked │
│ --- ┆ ---         ┆ ---      ┆ ---    ┆   ┆ ---     ┆ ---     ┆ ---   ┆ ---      │
│ u32 ┆ i64         ┆ i64      ┆ i64    ┆   ┆ str     ┆ f64     ┆ str   ┆ str      │
╞═════╪═════════════╪══════════╪════════╪═══╪═════════╪═════════╪═══════╪══════════╡
│ 0   ┆ 892         ┆ 0        ┆ 3      ┆ … ┆ 330911  ┆ 7.8292  ┆ null  ┆ Q        │
│ 1   ┆ 893         ┆ 1        ┆ 3      ┆ … ┆ 363272  ┆ 7.0     ┆ null  ┆ S        │
│ 2   ┆ 894         ┆ 0        ┆ 2      ┆ … ┆ 240276  ┆ 9.6875  ┆ null  ┆ Q        │
│ 3   ┆ 895         ┆ 0        ┆ 3      ┆ … ┆ 315154  ┆ 8.6625  ┆ null  ┆ S        │
│ 4   ┆ 896         ┆ 1        ┆ 3      ┆ … ┆ 3101298 ┆ 12.2875 ┆ null  ┆ S        │
└─────┴─────────────┴──────────┴────────┴───┴─────────┴─────────┴───────┴──────────┘

In the above example, we renamed the column named ‘PassengerId’ to ‘renamed_col’.

Summary

Now you know how to work with the Polars Python library to analyze your data more effectively.

In this article, you learned:

What Polars is and how to install it
How to define series and DataFrames in Polars
Different functions to deal with DataFrames.
How to read and work with CSV files in Polars

Thanks for Reading, and happy data wrangling!

How to Transform JSON Data to Match Any Schema

Nneoma Uche — Thu, 10 Jul 2025 04:23:53 +0000

Whether you’re transferring data between APIs or just preparing JSON data for import, mismatched schemas can break your workflow. Learning how to clean and normalize JSON data ensures a smooth, error-free data transfer.

This tutorial demonstrates how to clean messy JSON and export the results into a new file, based on a predefined schema. The JSON file we’ll be cleaning contains a dataset of 200 synthetic customer records.

In this tutorial, we’ll apply two methods for cleaning the input data:

With pure Python
With pandas

You can apply either of these in your code. But the pandas method is better for large, complex data sets. Let’s jump right into the process.

Prerequisites

To follow along with this tutorial, you should have a basic understanding of:

Python dictionaries, lists, and loops
JSON data structure (keys, values, and nesting)
How to read and write JSON files with Python’s json module

Add and Inspect the JSON File

Before you begin writing any code, make sure that the .json file you intend to clean is in your project directory. This makes it easy to load in your script using the file name alone.

You can now inspect the data structure by viewing the file locally or loading it in your script, with Python’s built-in json module.

Here’s how (assuming the file name is “old_customers.json”):

This shows you whether the JSON file is structured as a dictionary or a list. It also prints out the entire file in your terminal. Mine is a dictionary that maps to a list of 200 customer entries. You should always open up the raw JSON file in your IDE to get a closer look at its structure and schema.

Define the Target Schema

If someone asks for JSON data to be cleaned, it probably means that the current schema is unsuitable for its intended purpose. At this point, you want to be clear on what the final JSON export should look like.

JSON schema is essentially a blueprint that describes:

required fields
field names
data type for each field
standardized formats (for example, lowercase emails, trimmed whitespace, etc.)

Here’s what the old schema versus the target schema looks like:

As you can see, the goal is to delete the ”customer_id” and ”address” fields in each entry and rename the rest from:

”name” to ”full_name”
”email” to ”email_address”
”phone” to ”mobile”
”membership_level” to ”tier”

The output should contain 4 response fields instead of 6, all renamed to fit the project requirements.

How to Clean JSON Data with Pure Python

Let’s explore using Python’s built-in json module to align the raw data with the predefined schema.

Step 1: Import `json` and `time` modules

Importing json is necessary because we’re working with JSON files. But we’ll use the time module to track how long the data cleaning process takes.

import json
import time

Step 2: Load the file with `json.load()`

start_time = time.time()
with open('old_customers.json') as file:
    crm_data = json.load(file)

Step 3: Write a function to loop through and clean each customer entry in the dictionary

def clean_data(records):
    transformed_records = []
    for customer in records["customers"]:
        transformed_records.append({
                "full_name": customer["name"],
                "email_address": customer["email"],
                "mobile": customer["phone"],
                "tier": customer["membership_level"],

                })
    return {"customers": transformed_records}

new_data = clean_data(crm_data)

clean_data() takes in the original data (temporarily) stored in the records variable, transforming it to match our target schema.

Since the JSON file we loaded is a dictionary containing a ”customers” key, which maps to a list of customer entries, we access this key and loop through each entry in the list.

In the for loop, we rename the relevant fields and store the cleaned entries in a new list called ”transformed_records”.

Then, we return the dictionary, with the ”customers” key intact.

Step 4: Save the output in a .json file

Decide on a name for your cleaned JSON data and assign that to an output_file variable, like so:

output_file = "transformed_data.json"
with open(output_file, "w") as f:
    json.dump(new_data, f, indent=4)

You can also add a print() statement below this block to confirm that the file has been saved in your project directory.

Step 5: Time the data cleaning process

At the beginning of this process, we imported the time module to measure how long it takes to clean up JSON data using pure Python. To track the runtime, we stored the current time in a start_time variable before the cleaning function, and we’ll now include an end_time variable at the end of the script.

The difference between the end_time and start_time values gives you the total runtime in seconds.

end_time = time.time()
elapsed_time = end_time - start_time

print(f"Transformed data saved to {output_file}")
print(f"Processing data took {elapsed_time:.2f} seconds")

Here’s how long the data cleaning process took with the pure Python approach:

How to Clean JSON Data with Pandas

Now we’re going to try achieving the same results as above, using Python and a third-party library called pandas. Pandas is an open-source library used for data manipulation and analysis in Python.

To get started, you need to have the Pandas library installed in your directory. In your terminal, run:

pip install pandas

Then follow these steps:

Step 1: Import the relevant libraries

import json
import time
import pandas as pd

Step 2: Load file and extract customer entries

Unlike the pure Python method, where we simply indexed the key name ”customers” to access the list of customer data, working with pandas requires a slightly different approach.

We must extract the list before loading it into a DataFrame because pandas expects structured data. Extracting the list of customer dictionaries upfront ensures that we isolate and clean the relevant records alone, preventing errors caused by nested or unrelated JSON data.

start_time = time.time()
with open('old_customers.json', 'r') as f:
    crm_data = json.load(f)

#Extract the list of customer entries
clients = crm_data.get("customers", [])

Step 3: Load customer entries into a DataFrame

Once you’ve got a clean list of customer dictionaries, load the list into a DataFrame and assign said list to a variable, like so:

#Load into a dataframe
df = pd.DataFrame(clients)

This creates a tabular or spreadsheet-like structure, where each row represents a customer. Loading the list into a DataFrame also allows you to access pandas’ powerful data cleaning methods like:

drop_duplicate(): removes duplicate rows or entries from a DataFrame
dropna(): drops rows with any missing or null data
fillna(value): replaces all missing or null data with a specified value
drop(columns): drops unused columns explicitly

Step 4: Write a custom function to rename relevant fields

At this point, we need a function that takes in a single customer entry – a row – and returns a cleaned version that fits the target schema (“full_name”, “email_address”, “mobile” and “tier”).

The function should also handle missing data by setting default values like ”Unknown” or ”N/A” when a field is absent.

P.S: At first, I used drop(columns) to explicitly remove the “address” and “customer_id” fields. But it’s not needed in this case, as the transform_fields() function only selects and renames the required fields. Any extra columns are automatically excluded from the cleaned data.

Step 5: Apply schema transformation to all rows

We’ll use pandas' apply() method to apply our custom function to each row in the DataFrame. This will creates a Series (for example, 0 → {...}, 1 → {...}, 2 → {...}), which is not JSON-friendly.

As json.dump() expects a list, not a Pandas Series, we’ll apply tolist(), converting the Series to a list of dictionaries.

#Apply schema transformation to all rows
transformed_df = df.apply(transform_fields, axis=1)

#Convert series to list of dicts
transformed_data = transformed_df.tolist()

Another way to approach this is with list comprehension. Instead of using apply() at all, you can write:

transformed_data = [transform_fields(row) for row in df.to_dict(orient="records")]

orient=”records” is an argument for df.to_dict that tells pandas to convert the DataFrame to a list of dictionaries, where each dictionary represents a single customer record (that is, one row).

Then the for loop iterates through every customer record on the list, calling the custom function on each row. Finally, the list comprehension ([...]) collects the cleaned rows into a new list.

Step 6: Save the output in a .json file

#Save the cleaned data
output_data = {"customers": transformed_data}
output_file = "applypandas_customer.json"
with open(output_file, "w") as f:
    json.dump(output_data, f, indent=4)

I recommend picking a different file name for your pandas output. You can inspect both files side by side to see if this output matches the result you got from cleaning with pure Python.

Step 7: Track runtime

Once again, check for the difference between start time and end time to determine the program’s execution time.

end_time = time.time()
elapsed_time = end_time - start_time

#print(f"Transformed data saved to {output_file}")
print(f"Transformed data saved to {output_file}")
print(f"Processing data took {elapsed_time:.2f} seconds")

When I used list comprehension to apply the custom function, my script’s runtime was 0.03 seconds, but with pandas’ apply() function, the total runtime dropped to 0.01 seconds.

Final output preview:

If you followed this tutorial closely, your JSON output should look like this – whether you used the pandas method or the pure Python approach:

How to Validate the Cleaned JSON

Validating your output ensures that the cleaned data follows the expected structure before being used or shared. This step helps to catch formatting errors, missing fields, and wrong data types early.

Below are the steps for validating your cleaned JSON file:

Step 1: Install and import `jsonschema`

jsonschema is a third-party validation library for Python. It helps you define the expected structure of your JSON data and automatically check if your output matches that structure.

In your terminal, run:

pip install jsonschema

Import the required libraries:

import json
from jsonschema import validate, ValidationError

validate() checks whether your JSON data matches the rules defined in your schema. If the data is valid, nothing happens. But if there’s an error – like a missing field or wrong data type – it raises a ValidationError.

Step 2: Define a schema

As you know, JSON schema changes with each file structure. If your JSON data differs from what we’ve been working with so far, learn how to create a schema here. Otherwise, the schema below defines the structure we expect for our cleaned JSON:

schema = {
    "type": "object",
    "properties": {
        "customers": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "full_name": {"type": "string"},
                    "email_address": {"type": "string"},
                    "mobile": {"type": "string"},
                    "tier": {"type": "string"}
                },
                "required": ["full_name", "email_address", "mobile", "tier"]
            }
        }
    },
    "required": ["customers"]
}

The data is an object that must contain a key: "customers".
"customers" must be an array (a list), with each object representing one customer entry.
Each customer entry must have four fields–all strings:
- "full_name"
- "email_address"
- "mobile"
- "tier"
The "required" fields ensure that none of the relevant fields are missing in any customer record.

Step 3: Load the cleaned JSON file

with open("transformed_data.json") as f:
    data = json.load(f)

Step 4: Validate the data

For this step, we’ll use a try. . . except block to end the process safely, and display a helpful message if the code raises a ValidationError.

try:
    validate(instance=data, schema=schema)
    print("JSON is valid.")
except ValidationError as e:
    print("JSON is invalid:", e.message)

Pandas vs Pure Python for Data Cleaning

From this tutorial, you can probably tell that using pure Python to clean and restructure JSON is the more straightforward approach. It is fast and ideal for handling small datasets or simple transformations.

But as data grows and becomes more complex, you might need advanced data cleaning methods that Python alone does not provide. In such cases, pandas becomes the better choice. It handles large, complex datasets effectively, providing built-in functions for handling missing data and removing duplicates.

You can study the Pandas cheatsheet to learn more data manipulation methods.

How to Create a Basic CI/CD Pipeline with Webhooks on Linux

Juan P. Romano — Tue, 28 Jan 2025 22:46:46 +0000

In the fast-paced world of software development, delivering high-quality applications quickly and reliably is crucial. This is where CI/CD (Continuous Integration and Continuous Delivery/Deployment) comes into play.

CI/CD is a set of practices and tools designed to automate and streamline the process of integrating code changes, testing them, and deploying them to production. By adopting CI/CD, your team can reduce manual errors, speed up release cycles, and ensure that your code is always in a deployable state.

In this tutorial, we’ll focus on a beginner-friendly approach to setting up a basic CI/CD pipeline using Bitbucket, a Linux server, and Python with Flask. Specifically, we’ll create an automated process that pulls the latest changes from a Bitbucket repository to your Linux server whenever there’s a push or merge to a specific branch.

This process will be powered by Bitbucket webhooks and a simple Flask-based Python server that listens for incoming webhook events and triggers the deployment.

It’s important to note that CI/CD is a vast and complex field, and this tutorial is designed to provide a foundational understanding rather than to be an exhaustive guide.

We’ll cover the basics of setting up a CI/CD pipeline using tools that are accessible to beginners. Just keep in mind that real-world CI/CD systems often involve more advanced tools and configurations, such as containerization, orchestration, and multi-stage testing environments.

By the end of this tutorial, you’ll have a working example of how to automate deployments using Bitbucket, Linux, and Python, which you can build upon as you grow more comfortable with CI/CD concepts.

Why is CI/CD Important?

CI/CD has become a cornerstone of modern software development for several reasons. First and foremost, it accelerates the development process. By automating repetitive tasks like testing and deployment, developers can focus more on writing code and less on manual processes. This leads to faster delivery of new features and bug fixes, which is especially important in competitive markets where speed can be a differentiator.

Another key benefit of CI/CD is reduced errors and improved reliability. Automated testing ensures that every code change is rigorously checked for issues before it’s integrated into the main codebase. This minimizes the risk of introducing bugs that could disrupt the application or require costly fixes later. Automated deployment pipelines also reduce the likelihood of human error during the release process, ensuring that deployments are consistent and predictable.

CI/CD also fosters better collaboration among team members. In traditional development workflows, integrating code changes from multiple developers can be a time-consuming and error-prone process. With CI/CD, code is integrated and tested frequently, often multiple times a day. This means that conflicts are detected and resolved early, and the codebase remains in a stable state. As a result, teams can work more efficiently and with greater confidence, even when multiple contributors are working on different parts of the project simultaneously.

Finally, CI/CD supports continuous improvement and innovation. By automating the deployment process, teams can release updates to production more frequently and with less risk. This enables them to gather feedback from users faster and iterate on their products more effectively.

What We’ll Cover in This Tutorial

In this tutorial, we’ll walk through the process of setting up a simple CI/CD pipeline that automates the deployment of code changes from a Bitbucket repository to a Linux server. Here’s what you’ll learn:

How to configure a Bitbucket repository to send webhook notifications whenever there’s a push or merge to a specific branch.
How to set up a Flask-based Python server on your Linux server to listen for incoming webhook events.
How to write a script that pulls the latest changes from the repository and deploys them to the server.
How to test and troubleshoot your automated deployment process.

By the end of this tutorial, you’ll have a working example of a basic CI/CD pipeline that you can customize and expand as needed. Let’s get started!

Step 1: Set Up a Webhook in Bitbucket

Before starting with the setup, let’s briefly explain what a webhook is and how it fits into our CI/CD process.

A webhook is a mechanism that allows one system to notify another system about an event in real-time. In the context of Bitbucket, a webhook can be configured to send an HTTP request (often a POST request with payload data) to a specified URL whenever a specific event occurs in your repository, such as a push to a branch or a pull request merge.

In our case, the webhook will notify our Flask-based Python server (running on your Linux server) whenever there’s a push or merge to a specific branch. This notification will trigger a script on the server to pull the latest changes from the repository and deploy them automatically. Essentially, the webhook acts as the bridge between Bitbucket and your server, enabling seamless automation of the deployment process.

Now that you understand the role of a webhook, let’s set one up in Bitbucket:

Log in to Bitbucket and navigate to your repository.
On the left-hand sidebar, click on Settings.
Under the Workflow section, find and click on Webhooks.
Click the Add webhook button.
Enter a name for your webhook (for example, "Automatic Pull").
In the URL field, provide the URL to your server where the webhook will send the request. If you’re running a Flask app locally, this would be something like http://your-server-ip/pull-repo. (For production environments, it’s highly recommended to use HTTPS to secure the communication between Bitbucket and your server.)
In the Triggers section, choose the events you want to listen to. For this example, we will select Push (and optionally, Pull Request Merged if you want to deploy after merges, too).
Save the webhook with a self-explanatory name so it’s easy to identify later.

Once the webhook is set up, Bitbucket will send a POST request to the specified URL every time the selected event occurs. In the next steps, we’ll set up a Flask server to handle these incoming requests and trigger the deployment process.

Here is what you should see when you setup up the Bitbucket webhook

Step 2: Set Up the Flask Listener on Your Linux Server

In the next step, you’ll set up a simple web server on your Linux machine that will listen for the webhook from Bitbucket. When it receives the notification, it will execute a git pull or a force pull (in case of local changes) to update the repository.

Install Flask:

To create the Flask application, first install Flask by running:

pip install flask

Create the Flask App:

Create a new Python script (for example, app_repo_pull.py) on your server and add the following code:

from flask import Flask
import subprocess

app = Flask(__name__)

@app.route('/pull-repo', methods=['POST'])
def pull_repo():
    try:
        # Fetch the latest changes from the remote repository
        subprocess.run(["git", "-C", "/path/to/your/repository", "fetch"], check=True)
        # Force reset the local branch to match the remote 'test' branch
        subprocess.run(["git", "-C", "/path/to/your/repository", "reset", "--hard", "origin/test"], check=True)  # Replace 'test' with your branch name
        return "Force pull successful", 200
    except subprocess.CalledProcessError:
        return "Failed to force pull the repository", 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Here’s what this code does:

subprocess.run(["git", "-C", "/path/to/your/repository", "fetch"]): This command fetches the latest changes from the remote repository without affecting the local working directory.
subprocess.run(["git", "-C", "/path/to/your/repository", "reset", "--hard", "origin/test"]): This command performs a hard reset, forcing the local repository to match the remote test branch. Replace test with the name of your branch.

Make sure to replace /path/to/your/repository with the actual path to your local Git repository.

Step 3: Expose the Flask App (Optional)

If you want the Flask app to be accessible from outside your server, you need to expose it publicly. For this, you can set up a reverse proxy with NGINX. Here's how to do that:

First, install NGINX if you don't have it already by running this command:

sudo apt-get install nginx

Next, you’ll need to configure NGINX to proxy requests to your Flask app. Open the NGINX configuration file:

sudo nano /etc/nginx/sites-available/default

Modify the configuration to include this block:

server {
    listen 80;
    server_name your-server-ip;

    location /pull-repo {
        proxy_pass http://localhost:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Now just reload NGINX to apply the changes:

sudo systemctl reload nginx

Step 4: Test the Setup

Now that everything is set up, go ahead and start the Flask app by executing this Python script:

python3 app_repo_pull.py

Now to test if everything is working:

Make a commit: Push a commit to the test branch in your Bitbucket repository. This action will trigger the webhook.

Webhook trigger: The webhook will send a POST request to your server. The Flask app will receive this request, perform a force pull from the test branch, and update the local repository.
Verify the pull: Check the log output of your Flask app or inspect the local repository to verify that the changes have been pulled and applied successfully.

Step 5: Security Considerations

When exposing a Flask app to the internet, securing your server and application is crucial to protect it from unauthorized access, data breaches, and attacks. Here are the key areas to focus on:

1. Use a Secure Server with Proper Firewall Rules

A secure server is one that is configured to minimize exposure to external threats. This involves using firewall rules, minimizing unnecessary services, and ensuring that only required ports are open for communication.

Example of a secure server setup:

Minimal software: Only install the software you need (for example, Python, Flask, NGINX) and remove unnecessary services.
Operating system updates: Ensure your server's operating system is up-to-date with the latest security patches.
Firewall configuration: Use a firewall to control incoming and outgoing traffic and limit access to your server.

For example, a basic UFW (Uncomplicated Firewall) configuration on Ubuntu might look like this:

# Allow SSH (port 22) for remote access
sudo ufw allow ssh

# Allow HTTP (port 80) and HTTPS (port 443) for web traffic
sudo ufw allow http
sudo ufw allow https

# Enable the firewall
sudo ufw enable

# Check the status of the firewall
sudo ufw status

In this case:

The firewall allows incoming SSH connections on port 22, HTTP on port 80, and HTTPS on port 443.
Any unnecessary ports or services should be blocked by default to limit exposure to attacks.

Additional Firewall Rules:

Limit access to webhook endpoint: Ideally, only allow traffic to the webhook endpoint from Bitbucket's IP addresses to prevent external access. You can set this up in your firewall or using your web server (for example, NGINX) by only accepting requests from Bitbucket's IP range.
Deny all other incoming traffic: For any service that does not need to be exposed to the internet (for example, database ports), ensure those ports are blocked.

2. Add Authentication to the Flask App

Since your Flask app will be publicly accessible via the webhook URL, you should consider adding authentication to ensure only authorized users (such as Bitbucket's servers) can trigger the pull.

Basic Authentication Example:

You can use a simple token-based authentication to secure your webhook endpoint. Here’s an example of how to modify your Flask app to require an authentication token:

from flask import Flask, request, abort
import subprocess

app = Flask(__name__)

# Define a secret token for webhook verification
SECRET_TOKEN = 'your-secret-token'

@app.route('/pull-repo', methods=['POST'])
def pull_repo():
    # Check if the request contains the correct token
    token = request.headers.get('X-Hub-Signature')
    if token != SECRET_TOKEN:
        abort(403)  # Forbidden if the token is incorrect

    try:
        subprocess.run(["git", "-C", "/path/to/your/repository", "fetch"], check=True)
        subprocess.run(["git", "-C", "/path/to/your/repository", "reset", "--hard", "origin/test"], check=True)
        return "Force pull successful", 200
    except subprocess.CalledProcessError:
        return "Failed to force pull the repository", 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

How it works:

The X-Hub-Signature is a custom header that you add to the request when setting up the webhook in Bitbucket.
Only requests with the correct token will be allowed to trigger the pull. If the token is missing or incorrect, the request is rejected with a 403 Forbidden response.

You can also use more complex forms of authentication, such as OAuth or HMAC (Hash-based Message Authentication Code), but this simple token approach works for many cases.

3. Use HTTPS for Secure Communication

It’s crucial to encrypt the data transmitted between your Flask app and the Bitbucket webhook, as well as any sensitive data (such as tokens or passwords) being transmitted over the network. This ensures that attackers cannot intercept or modify the data.

Why HTTPS?

Data encryption: HTTPS encrypts the communication, ensuring that sensitive data like your authentication token is not exposed to man-in-the-middle attacks.
Trust and integrity: HTTPS helps ensure that the data received by your server hasn’t been tampered with.

Using Let’s Encrypt to Secure Your Flask App with SSL:

Install Certbot (the tool for obtaining Let’s Encrypt certificates):

sudo apt-get update
sudo apt-get install certbot python3-certbot-nginx

Obtain a free SSL certificate for your domain:

sudo certbot --nginx -d your-domain.com

This command will automatically configure Nginx to use HTTPS with a free SSL certificate from Let’s Encrypt.
Ensure HTTPS is used: Make sure that your Flask app or Nginx configuration forces all traffic to use HTTPS. You can do this by setting up a redirection rule in Nginx:

server {
    listen 80;
    server_name your-domain.com;

    # Redirect HTTP to HTTPS
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name your-domain.com;

    ssl_certificate /etc/letsencrypt/live/your-domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/your-domain.com/privkey.pem;

    # Other Nginx configuration...
}

Automatic Renewal: Let’s Encrypt certificates are valid for 90 days, so it’s important to set up automatic renewal:

sudo certbot renew --dry-run

This command tests the renewal process to make sure everything is working.

4. Logging and Monitoring

Implement logging and monitoring for your Flask app to track any unauthorized attempts, errors, or unusual activity:

Log requests: Log all incoming requests, including the IP address, request headers, and response status, so you can monitor for any suspicious activity.
Use monitoring tools: Set up tools like Prometheus, Grafana, or New Relic to monitor server performance and app health.

Wrapping Up

In this tutorial, we explored how to set up a simple, beginner-friendly CI/CD pipeline that automates deployments using Bitbucket, a Linux server, and Python with Flask. Here’s a recap of what you’ve learned:

CI/CD Fundamentals: We discussed the basics of Continuous Integration (CI) and Continuous Delivery/Deployment (CD), which are essential practices for automating the integration, testing, and deployment of code. You learned how CI/CD helps speed up development, reduce errors, and improve collaboration among developers.
Setting Up Bitbucket Webhooks: You learned how to configure a Bitbucket webhook to notify your server whenever there’s a push or merge to a specific branch. This webhook serves as a trigger to initiate the deployment process automatically.
Creating a Flask-based Webhook Listener: We showed you how to set up a Flask app on your Linux server to listen for incoming webhook requests from Bitbucket. This Flask app receives the notifications and runs the necessary Git commands to pull and deploy the latest changes.
Automating the Deployment Process: Using Python and Flask, we automated the process of pulling changes from the Bitbucket repository and performing a force pull to ensure the latest code is deployed. You also learned how to configure the server to expose the Flask app and accept requests securely.
Security Considerations: We covered critical security steps to protect your deployment process:
- Firewall Rules: We discussed configuring firewall rules to limit exposure and ensure only authorized traffic (from Bitbucket) can access your server.
- Authentication: We added token-based authentication to ensure only authorized requests can trigger deployments.
- HTTPS: We explained how to secure the communication between your server and Bitbucket using SSL certificates from Let's Encrypt.
- Logging and Monitoring: Lastly, we recommended setting up logging and monitoring to keep track of any unusual activity or errors.

Next Steps

By the end of this tutorial, you now have a working example of an automated deployment pipeline. While this is a basic implementation, it serves as a foundation you can build on. As you grow more comfortable with CI/CD, you can explore advanced topics like:

Multi-stage deployment pipelines
Integration with containerization tools like Docker
More complex testing and deployment strategies
Use of orchestration tools like Kubernetes for scaling

CI/CD practices are continually evolving, and by mastering the basics, you’ve set yourself up for success as you expand your skills in this area. Happy automating and thank you for reading!

You can fork the code from here.

Python’s zip() Function Explained with Simple Examples

Sahil — Thu, 10 Oct 2024 14:58:09 +0000

The zip() function in Python is a neat tool that allows you to combine multiple lists or other iterables (like tuples, sets, or even strings) into one iterable of tuples. Think of it like a zipper on a jacket that brings two sides together.

In this guide, we’ll explore the ins and outs of the zip() function with simple, practical examples that will help you understand how to use it effectively.

How Does the `zip()` Function Work?

The zip() function pairs elements from multiple iterables, like lists, based on their positions. This means that the first elements of each list will be paired, then the second, and so on. If the iterables are not the same length, zip() will stop at the end of the shortest iterable.

The syntax for zip() is pretty straightforward:

zip(*iterables)

You can pass in multiple iterables (lists, tuples, and so on), and it will combine them into tuples.

Example 1: Combining Two Lists

Let’s start with a simple case where we have two lists, and we want to combine them. Imagine you have a list of names and a corresponding list of scores, and you want to pair them up.

# Two lists to combine
names = ["Alice", "Bob", "Charlie"]
scores = [85, 90, 88]

# Using zip() to combine them
zipped = zip(names, scores)

# Convert the result to a list so we can see it
zipped_list = list(zipped)
print(zipped_list)

In this example, the zip() function takes the two lists—names and scores—and pairs them element by element. The first element from names ("Alice") is paired with the first element from scores (85), and so on. When we convert the result into a list, it looks like this:

Output:

[('Alice', 85), ('Bob', 90), ('Charlie', 88)]

This makes it easy to work with related data in a structured way.

Example 2: What Happens When the Lists Are Uneven?

Let’s say you have lists of different lengths. What happens then? The zip() function is smart enough to stop as soon as it reaches the end of the shortest list.

# Lists of different lengths
fruits = ["apple", "banana"]
prices = [100, 200, 150]

# Zipping them together
result = list(zip(fruits, prices))
print(result)

In this case, the fruits list has two elements, and the prices list has three. But zip() will only combine the first two elements, ignoring the extra value in prices.

Output:

[('apple', 100), ('banana', 200)]

Notice how the last value (150) in the prices list is ignored because there’s no third fruit to pair it with. The zip() function ensures that you don’t get errors when working with uneven lists, but it also means you might lose some data if your lists are not balanced.

Example 3: Unzipping a Zipped Object

What if you want to reverse the zip() operation? For example, after zipping two lists together, you might want to split them back into individual lists. You can do this easily using the unpacking operator *.

# Zipped lists
cities = ["New York", "London", "Tokyo"]
populations = [8000000, 9000000, 14000000]

zipped = zip(cities, populations)

# Unzipping them
unzipped_cities, unzipped_populations = zip(*zipped)

print(unzipped_cities)
print(unzipped_populations)

Here, we first zip the cities and populations lists together. Then, using zip(*zipped), we can "unzip" the combined tuples back into two separate lists. The * operator unpacks the zipped tuples into their original components.

Output:

('New York', 'London', 'Tokyo')
(8000000, 9000000, 14000000)

This shows how you can reverse the zipping process to get the original data back.

Example 4: Zipping More Than Two Lists

You aren’t limited to just two lists with zip(). You can zip together as many iterables as you want. Here’s an example with three lists.

# Three lists to zip
subjects = ["Math", "English", "Science"]
grades = [88, 79, 92]
teachers = ["Mr. Smith", "Ms. Johnson", "Mrs. Lee"]

# Zipping three lists together
zipped_info = zip(subjects, grades, teachers)

# Convert to a list to see the result
print(list(zipped_info))

In this example, we are zipping three lists—subjects, grades, and teachers. The first item from each list is grouped together, then the second, and so on.

Output:

[('Math', 88, 'Mr. Smith'), ('English', 79, 'Ms. Johnson'), ('Science', 92, 'Mrs. Lee')]

This way, you can combine multiple related pieces of information into easy-to-handle tuples.

Example 5: Zipping Strings

Strings are also iterables in Python, so you can zip over them just like you would with lists. Let’s try combining two strings.

# Zipping two strings
str1 = "ABC"
str2 = "123"

# Zipping the characters together
zipped_strings = list(zip(str1, str2))
print(zipped_strings)

Here, the first character of str1 is combined with the first character of str2, and so on.

Output:

[('A', '1'), ('B', '2'), ('C', '3')]

This is especially useful if you need to process or pair characters from multiple strings together.

Example 6: Zipping Dictionaries

Although dictionaries are slightly different from lists, you can still use zip() to combine them. By default, zip() will only zip the dictionary keys. Let’s look at an example:

# Two dictionaries
dict1 = {"name": "Alice", "age": 25"}
dict2 = {"name": "Bob", "age": 30"}

# Zipping dictionary keys
zipped_keys = list(zip(dict1, dict2))
print(zipped_keys)

Here, zip() pairs up the keys from both dictionaries.

Output:

[('name', 'name'), ('age', 'age')]

If you want to zip the values of the dictionaries, you can do that using the .values() method:

zipped_values = list(zip(dict1.values(), dict2.values()))
print(zipped_values)

Output:

[('Alice', 'Bob'), (25, 30)]

Now you can easily combine the values of the two dictionaries.

Example 7: Using `zip()` in Loops

One of the most common uses of zip() is in loops when you want to process multiple lists at the same time. Here’s an example:

# Lists of names and scores
names = ["Alice", "Bob", "Charlie"]
scores = [85, 90, 88]

# Using zip() in a loop
for name, score in zip(names, scores):
    print(f"{name} scored {score}")

This loop iterates over both the names and scores lists simultaneously, pairing up each name with its corresponding score.

Output:

Alice scored 85
Bob scored 90
Charlie scored 88

Using zip() in loops like this makes your code cleaner and easier to read when working with related data.

Conclusion

The zip() function is a handy tool in Python that lets you combine multiple iterables into tuples, making it easier to work with related data. Whether you're pairing up items from lists, tuples, or strings, zip() simplifies your code and can be especially useful in loops.

With the examples in this article, you should now have a good understanding of how to use zip() in various scenarios.

If you found this explanation of Python's zip() function helpful, you might also enjoy more in-depth programming tutorials and concepts I cover on my blog.

Happy coding!

How to Build Good Coding Habits as a New Python Developer

Eleanor Hecks — Tue, 20 Aug 2024 20:46:01 +0000

When you're starting out as a new Python developer, you'll likely develop some habits, both good and bad.

Coding is something of an art form. Flexibility and customization are encouraged — and you can usually write code how you want within the context of the language.

The problem is, you're communicating with the computer publicly. You need to write your code in a way that makes sense to others.

Also, using improper syntax or not ensuring you’re writing effectively can lead to errors in your programming. Messy code makes it extremely difficult to find those errors later. Readable, clean writing is the way to go, which means forming good coding habits early on so you’re following them throughout your entire career.

Here are six tips for building good coding habits as you start out in Python.

1. Follow the PEP 8 Style Guide

Copywriters and other content writers typically use something called a style guide. A style guide sets rules about the formatting and organization of the text. It might explain whether to use the Oxford comma or when to use title caps and other structured approaches.

Python has a style guide just like this, known as PEP 8, PEP8, or PEP-8. Several skilled Python developers published the guide in 2001 to share how to write perfectly readable and consistent code.

Some tenets include:

Using proper indentation techniques.
Staying below the maximum line length of 79 characters.
Using line breaks.
Employing blank lines — double or single — for functions, class, and method definitions.
Using proper naming conventions for variables, classes, functions, and so on.

If you haven’t yet, read through the Python Pep 8 style guide and make sure you’re following the techniques.

2. Use the Newest Python Version

Programming languages like Python go through many iterations during their life cycles. Old versions are typically phased out for newer releases. Generally, the newest release includes bug fixes, as well as security or performance improvements.

At a minimum, use Python 3 over Python 2, as the older version has reached end-of-life status as of January 2020. Also, when working with third-party modules, frameworks or repositories, always reference the Minimum Required Python Version. This is the oldest version of Python that is compatible with the related components.

3. Always Comment Out Specific Code

In the moment as you’re writing your code, you know what you’re trying to achieve. When you read that code later, you might forget — or worse yet, if someone else is reading that code, they might find themselves perplexed. That’s what comments are for.

Every language has a way to “comment out” certain sections of code. The idea is to use descriptive yet succinct comments to explain what’s happening. Some developers forget to do this entirely, but if you start early and always follow the rule, you’ll be able to write easily followable syntax.

In Python, you use a “#” symbol at the start of the comment to comment out a line. To write a multi-line comment, you can use triple quotes (''') at the beginning or end or multiple hashtags per line.

#This is a regular comment.

‘’’
This is a multi-line comment.
To explain what the code is doing.
‘’’

Commenting can be a vital part of the coding process as it allows you to better remember and visualize the ideas going through your mind as you’re coding.

According to experts, handwriting your notes and then transcribing them digitally through things like commenting improves your retention by 75 percent. This means, when you discover a bug or want to make improvements later, you can more easily recall the relevant code snippets.

Inline comments can also appear in the same line as a point of code. For example:

print (“Hello World. This is my first code.”) # This is how you create an inline comment

4. Use a Linter

A Python linter reviews code spacing, line length and various design qualifications like argument positioning. As a result, your code looks clean, organized and consistently written across multiple files in your project.

Bear in mind that a linter is different from an auto-formatter or beautifier — although, in modern coding, the same tool may handle both of these support functions. You can think of a linter as something that fixes practical issues versus an auto-formatter, which fixes more of the styling.

Linters can analyze and identify coding errors, potential bugs, misspellings or syntax problems, but also stylistic inconsistencies, such as how you’re using indents and spacing. Auto-formatters focus on the writing or stylistic part of syntax like commas, quotes, proper line length and so on. Both are helpful, but you seldom want to code without a linter handy.

Some examples of the best Python linters include Pylint, Flake8, Ruff, Xenon and Radon, among others. The linter used in the following screenshot is Ruff, installed via VSCode.

5. Rely on Built-In Functions and Libraries

The beauty of Python and languages like it is that you’re never starting from scratch. You don’t have to write every single function or achievement yourself — instead, you can rely on built-in functions, libraries, frameworks, and repositories.

Built-in functions save you time, give you working functions, and are generally managed by a group of developers. More importantly, they boost the performance of your code and software. You can reference the official Python documentation to see built-in language functions.

Some examples include:

append(): Takes a single item and adds it to a list, modifying an existing list by adding to it and increasing the list by one
eval(): Evaluates any specified expression as if it’s an official Python statement
id(): Used to reference the unique identity of an object or integer
max(): Returns the maximum value of an iterable from multiple given values
print(): Displays or returns text variables to the Python console
round(): Rounds up a number or value to a given decimal place

Using the most common beginner’s tutorial, when you use the print() function, it looks something like this:

print(“Hello world I am coding.”)

That will return:

Hello world I am coding

That built-in function will always be recognized regardless of the IDE or coding environment you’re using, which applies to all built-in functions from append() to round().

On the other hand, libraries are numerous and varied — they’re much larger collections of pre-written code or functions. To use or reference libraries and their functions, you merely import them into your Python script. Examples are Requests, FastAPI, Asyncio, aiohttp, Tkinter, and more.

6. Fix Code Issues as Soon as Possible

When writing code, if you notice something is awry, fix it right then and there. Don’t put it off or wait until you’re testing later. You might misplace the bug or error — and imagine if you cannot find it again. Between 23%-42% of a developer’s time is wasted due to bad code, which is valuable time you could be spending elsewhere.

Most of all, bugs and errors compound over time, so the longer you leave it, the more likely entire segments of your code will error out or stop working. Many IDEs and linters can help with this process, especially if you’re using the logging module instead of merely printing results.

Python’s logging module tracks events during runtime — when a program is running. Essentially, this allows you to identify problems or errors while testing your code. It may flag warnings pertaining to errors, debugging or code-related events, but it can also help you understand the runtime behavior of your project — all things you might overlook during the writing process.

You can see and analyze user interactions, for example, especially if external users are testing your application. Most importantly, the logging module is an audit tool that’s invaluable once you start testing or running the code you’ve written. Don’t code without it.

Practice Makes Perfect

There are many things to consider when working with Python, and it doesn’t matter how skilled or adept you are. Following Python best practices is always the way to go. But in the end, the best way to learn is always to take a hands-on approach, which means practice.

Continue using Python, even just to create simple or small projects for yourself. Practice using the habits discussed here and writing clean code. You should also read code from other developers to see how they approach the process.

python beginner - freeCodeCamp.org

How Passing by Object Reference Works in Python

Table of Contents

Call by Value and Call by Reference Explained

How It Works in C (with Examples)

What Python Does Instead

Mutable vs Immutable Types

Conclusion

How to Use the Polars Library in Python for Data Analysis

Table of Contents

Prerequisites

Installing and Importing the Polars Library

What is a Series?

Creating a Series with Homogenous Data

Creating a Series with Heterogenous Data

What is a DataFrame?

What is a Schema?

The Head, Tail, and Glimpse Functions

The Sample Function

Concatenating Two DataFrames

How to Join Two DataFrames

How to Use the with_columns() Function

How to Read CSV Files with Polars

How to Select Columns from the Dataset

Some Other Important Functions

How to Print the Names of the Columns of a Dataset

How to Index a Dataset

How to Rename Columns in the Dataset

Summary

How to Transform JSON Data to Match Any Schema

Here’s what we’ll cover:

Prerequisites

Add and Inspect the JSON File

Define the Target Schema

How to Clean JSON Data with Pure Python

Step 1: Import json and time modules

Step 2: Load the file with json.load()

Step 3: Write a function to loop through and clean each customer entry in the dictionary

Step 4: Save the output in a .json file

Step 5: Time the data cleaning process

How to Clean JSON Data with Pandas

Step 1: Import the relevant libraries

Step 2: Load file and extract customer entries

Step 3: Load customer entries into a DataFrame

Step 4: Write a custom function to rename relevant fields

Step 5: Apply schema transformation to all rows

Step 6: Save the output in a .json file

Step 7: Track runtime

Final output preview:

How to Validate the Cleaned JSON

Step 1: Install and import jsonschema

Step 2: Define a schema

Step 3: Load the cleaned JSON file

Step 4: Validate the data

Pandas vs Pure Python for Data Cleaning

How to Create a Basic CI/CD Pipeline with Webhooks on Linux

Table of Contents:

Why is CI/CD Important?

What We’ll Cover in This Tutorial

Step 1: Set Up a Webhook in Bitbucket

Step 2: Set Up the Flask Listener on Your Linux Server

Install Flask:

Create the Flask App:

Step 3: Expose the Flask App (Optional)

Step 4: Test the Setup

Step 5: Security Considerations

1. Use a Secure Server with Proper Firewall Rules

Example of a secure server setup:

Additional Firewall Rules:

2. Add Authentication to the Flask App

Basic Authentication Example:

How it works:

3. Use HTTPS for Secure Communication

Why HTTPS?

Using Let’s Encrypt to Secure Your Flask App with SSL:

4. Logging and Monitoring

Wrapping Up

Next Steps

Python’s zip() Function Explained with Simple Examples

How Does the zip() Function Work?

How to Use the `with_columns()` Function

Step 1: Import `json` and `time` modules

Step 2: Load the file with `json.load()`

Step 1: Install and import `jsonschema`

How Does the `zip()` Function Work?

Example 7: Using `zip()` in Loops