<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ dataframe - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ dataframe - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 25 May 2026 10:49:28 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/dataframe/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Use the Polars Library in Python for Data Analysis ]]>
                </title>
                <description>
                    <![CDATA[ In this article, I’ll give you a beginner-friendly introduction to the Polars library in Python. Polars is an open-source library, originally written in Rust, which makes data wrangling easier in Python. The syntax of Polars is very similar to Pandas... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-the-polars-library-in-python-for-data-analysis/</link>
                <guid isPermaLink="false">6939b88a5a4b3354fde8c07b</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python 3 ]]>
                    </category>
                
                    <category>
                        <![CDATA[ python beginner ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Polars ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Programming Blogs ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dataset ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dataframe ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Sara Jadhav ]]>
                </dc:creator>
                <pubDate>Wed, 10 Dec 2025 18:14:34 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765325732081/94ab547b-fdaf-41bb-ae60-ad03be31211a.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In this article, I’ll give you a beginner-friendly introduction to the Polars library in Python.</p>
<p>Polars is an open-source library, originally written in Rust, which makes data wrangling easier in Python. The syntax of Polars is very similar to Pandas, so if you’ve worked with Pandas or the PySpark library before, using Polars should be a breeze.</p>
<p>Polars excels at giving fast results. It’s also memory efficient and helps you optimize your code using parallelism. It also lets you convert data from and to various libraries like NumPy, Pandas, and others.</p>
<p>In this tutorial, we’ll be learning about the Polars Library from absolute scratch, from installing and importing the library on the system, to manipulating data in a dataset with the help of this library.</p>
<p>First, we’ll look at Polars basic functions. We’ll be also writing some practical code, which will help you apply what you’ve learned. Finally, we’ll be working with an example dataset to solidify some more key Polars concepts. Let’s dive in.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-installing-and-importing-the-polars-library">Installing and Importing the Polars Library</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-a-series">What is a Series?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-a-dataframe">What is a DataFrame?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-read-csv-files-with-polars">How to Read CSV Files with Polars</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-some-other-important-functions">Some other Important Functions</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-summary">Summary</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>Even though this tutorial is beginner-friendly, having some basic knowledge of the following areas will help you understand this article better:</p>
<ul>
<li><p>Basic Python syntax</p>
</li>
<li><p>Data structures</p>
</li>
<li><p>Ability to import libraries and knowledge of using functions and methods</p>
</li>
<li><p>Basics of NumPy and Pandas will come in handy (not necessary).</p>
</li>
</ul>
<p>Now, that you’re aware of the prior requirements to follow along, let’s get started with our tutorial.</p>
<h2 id="heading-installing-and-importing-the-polars-library">Installing and Importing the Polars Library</h2>
<p>To install the Polars library, you can use the following command in your terminal:</p>
<p><code>pip install polars</code></p>
<p>Now, this works if you already have the pip package manager on your system. If you’re on a conda environment, you can work with this:</p>
<p><code>conda install -c conda-forge polars</code></p>
<p>But I strongly recommend using the pip package manager to avoid various inconveniences.</p>
<p>Let’s import Polars in our program. We’ll follow the same process as we use for importing other libraries in Python:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl <span class="hljs-comment"># pl is a conventional alias</span>
</code></pre>
<p>While creating a Polars object with the data, it’s important to know the size of our data. Polars has the capacity to have 2³² rows in the DataFrame. To load more data, use the following command to install the Polars library:</p>
<p><code>pip install polars[rt64]</code></p>
<p>If you want to use the Polars library right away without actually installing it on your system, using a Google Colab notebook is the best option. When using a Google Colab Notebook, you can directly import and start using Polars in your program. I’ll be using Google Colab Notebook for this tutorial.</p>
<h2 id="heading-what-is-a-series">What is a Series?</h2>
<p>A series is a fundamental element of a DataFrame. It’s a 1-dimensional data-structure that you can correlate with a ‘list’ in Python or a ‘1-D array’ in NumPy. But the difference between a series and a 1-D array is that the former is labeled while the later is not. Many series come together to form a DataFrame.</p>
<p>We can create a series with homogenous data as well as heterogenous data.</p>
<h3 id="heading-creating-a-series-with-homogenous-data">Creating a Series with Homogenous Data</h3>
<p>In a series, the datatype of all the elements should be the same. If it’s not, an error is thrown.</p>
<p>The syntax to define a Polars series is as follows:</p>
<p><code>var_name = pl.Series(“column_name”, [values])</code></p>
<p>The following code shows an example of a homogenous series definition in Python:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
series_homo = pl.Series(<span class="hljs-string">"Numbers"</span>, [<span class="hljs-string">'One'</span>, <span class="hljs-string">'Two'</span>, <span class="hljs-string">'Three'</span>, <span class="hljs-string">'Four'</span>, <span class="hljs-string">'Five'</span>])
print(series_homo)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5,)
Series: 'Numbers' [str]
[
    "One"
    "Two"
    "Three"
    "Four"
    "Five"
]
</code></pre>
<p>In the above code, we first imported the Polars library using the <code>pl</code> alias to start using it throughout the code. Using aliases is a matter of choice, but <code>pl</code> is a conventional one (like <code>np</code> for NumPy and <code>pd</code> for Pandas). The benefit of using conventional aliases is that when you hand over the code to someone else, it’s easy for them to follow along.</p>
<p>Next, we used the <code>pl.Series()</code> function to create a Polars series object. As its first parameter, we passed the label for our series (<code>Numbers</code> in this case). Then we passed the values to be stores in the form of a list. Remember that the list of values that we pass acts as a single argument. Finally, we printed our series.</p>
<p>We can see that the output tells us about the dimensions of the the Polars object as well as the datatype of the series. The shape (rows, columns) tells us about the the number of rows and columns present in the Polars object.</p>
<p>We can find the data-type of a homogenous series explicitly by using the <code>dtype</code> method.</p>
<pre><code class="lang-python">print(series_homo.dtype)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">String
</code></pre>
<h3 id="heading-creating-a-series-with-heterogenous-data">Creating a Series with Heterogenous Data</h3>
<p>Heterogenous data means that the data-type of all the elements is not the same. The syntax to define a series with heterogenous data is as follows:</p>
<p><code>var_name = pl.Series(“Column_name”, [values], strict=False)</code></p>
<p>So you’re probably wondering, based on what I said above: how can we have a series with heterogenous data? Well, one thing to note is that a series is always homogenous irrespective of the data that is fed to it. I’ll explain below - first let’s look at this code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl

series_hetero = pl.Series(<span class="hljs-string">"Numbers"</span>, [<span class="hljs-number">1</span>, <span class="hljs-string">"Two"</span>, <span class="hljs-number">3</span>, <span class="hljs-string">"Four"</span>], strict=<span class="hljs-literal">False</span>)
print(series_hetero)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (4,)
Series: 'Numbers' [str]
[
    "1"
    "Two"
    "3"
    "Four"
]
</code></pre>
<p>Here, we created a series object using the <code>pl.Series()</code> function, labelled it, and passed the values that we want in our series.</p>
<p>But you’ll notice that we have provided heterogenous data (data that doesn’t have the same datatype) to the function. Usually, this throws an error. But as we have set the <code>strict</code> parameter as False, the function now becomes lenient with the schema of the series. (The schema is just the expected data-type of the values that are to be recorded in the series.)</p>
<p>If no particular schema is defined for a series that’s fed heterogenous data, <code>pl.Series()</code> sets the schema to <code>pl.Utf8</code> (string datatype). You can see this automatic fixing of the schema in the above example. This prevents the program from bugging, as a string datatype can comprehend characters – numbers as well as symbols.</p>
<p>Also, we can see that datatype of all elements is the same (<code>pl.Utf8</code>). This means that the series is homogenous, even though we put heterogenous data in it.</p>
<p>If we define a schema for the series, then the Polars library converts all the records – which show a different datatype than the defined schema – to null objects. This should be clear in the following example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-comment"># defined the schema as Integer bit 32</span>
series = pl.Series(<span class="hljs-string">"ints"</span>, [<span class="hljs-number">1</span>, <span class="hljs-number">-2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-string">'Thirteen'</span>, <span class="hljs-string">'Fourteen'</span>], dtype=pl.Int32, strict=<span class="hljs-literal">False</span>)
print(series)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (7,)
Series: 'ints' [i32]
[
    1
    -2
    3
    4
    5
    null
    null
]
</code></pre>
<p>Here, we can see that the last two entities were ‘String’, but since we set the schema as ‘Integer’, they were reflected as null records.</p>
<p>So as you can see, the leniency of the program depends on whether you set the <code>strict</code> parameter to True of False. If we set it as True, we enforce the schema to the data strictly. Upon failing to obey the schema, the program raises an exception. On the other hand, if we set the <code>strict</code> parameter as False, the series still preserves its homogenous nature by turning schema-disobeying elements to null.</p>
<p>Now that you understand how series work, we’re ready to move on to DataFrames.</p>
<h2 id="heading-what-is-a-dataframe">What is a DataFrame?</h2>
<p>A DataFrame is a two-dimensional data structure that you can use to store large numbers of related parameters of the collected data. It’s also useful for analyzing that data. A DataFrame is nothing more than the collection of many series, each labelled differently to store different aspects of data.</p>
<p>Here’s the syntax to create a Polars DataFrame object:</p>
<p><code>var_name = pl.DataFrame({key: value pairs}, schema)</code></p>
<p>The following example shows you how to define a DataFrame object in Python:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
print(df)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (10, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
│ ---    ┆ ---         ┆ ---         │
│ u32    ┆ f64         ┆ f64         │
╞════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.098612    ┆ 0.477121    │
│ 4      ┆ 1.386294    ┆ 0.60206     │
│ 5      ┆ 1.609438    ┆ 0.69897     │
│ 6      ┆ 1.791759    ┆ 0.778151    │
│ 7      ┆ 1.94591     ┆ 0.845098    │
│ 8      ┆ 2.079442    ┆ 0.90309     │
│ 9      ┆ 2.197225    ┆ 0.954243    │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘
</code></pre>
<p>Above, we created a Polars DataFrame object with the <code>pl.DataFrame()</code> function. In the function, we created a dictionary as an argument for passing the values of the DataFrame.</p>
<p>In the dictionary, each key-value pair represents a series. Each key represents the label of the series, whereas its value represent the values of the series. The values are passed in the form of a list as each key can map to only one value.</p>
<p>Then we defined the schema for the DataFrame. Again, the schema is a dictionary, where each key-value pair corresponds to the schema of the series. In the schema, every key represents the label of the series (to map the schema to the correct series) and its value represents the schema.</p>
<p>In the output, we can see that we got a nice table representing our data. The labels are neatly separated from the data and below them, their schema is also represented.</p>
<h3 id="heading-what-is-a-schema">What is a Schema?</h3>
<p>A schema refers to the definition of the datatype of the series. We fix a particular datatype to the homogenous series to avoid getting in mixed-data.</p>
<p>For example, in the above code, we set the datatype of the column <code>Number</code> to <code>Unsigned Integer - 32 bit (pl.UInt32)</code> as we don’t want to put negative integers in our NumPy logarithm function.</p>
<p>Now, if we want to hide the datatype (that’s written below each label), we can use the following function:</p>
<pre><code class="lang-python">pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)
</code></pre>
<h3 id="heading-the-head-tail-and-glimpse-functions">The Head, Tail, and Glimpse Functions</h3>
<p>The <code>head()</code>, <code>tail()</code> and <code>glimpse()</code> functions are used to have a quick look at the data by reviewing certain records (rows). These are useful especially for large datasets for taking a look at the data, for example to see which columns are present, what type of data is present in each column, and so on.</p>
<p>The <code>head()</code> function prints the given number of rows (passed as the argument of the <code>head()</code> function) from the top of the DataFrame. If no argument is passed, it prints the first five rows of the DataFrame.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)
print(df.head(<span class="hljs-number">3</span>))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (3, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.098612    ┆ 0.477121    │
└────────┴─────────────┴─────────────┘
</code></pre>
<p>In this example, we have the used the same DataFrame that we just created. Then we used the <code>head()</code> function to output the first three rows of the DataFrame. Also, you may now notice that the schema representation under column names has disappeared. This is because we used <code>pl.Config.set_tbl_hide_column_data_types(active=True)</code>.</p>
<p>The <code>glimpse()</code> function presents the data briefly and in a horizontal manner (rows are represented as columns and columns are represented as rows) for better readability.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)
print(df.glimpse())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">Rows: 10
Columns: 3
$ Number      &lt;u32&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ Natural Log &lt;f64&gt; 0.0, 0.6931471805599453, 1.0986122886681098, 1.3862943611198906, 1.6094379124341003, 1.791759469228055, 1.9459101490553132, 2.0794415416798357, 2.1972245773362196, 2.302585092994046
$ Log Base 10 &lt;f64&gt; 0.0, 0.3010299956639812, 0.47712125471966244, 0.6020599913279624, 0.6989700043360189, 0.7781512503836436, 0.8450980400142568, 0.9030899869919435, 0.9542425094393249, 1.0

None
</code></pre>
<p>Here, we used the <code>glimpse()</code> function on our previously created DataFrame <code>df</code>. We can see the output as our transposed DataFrame. Also, <code>None</code> is returned. This is because, by default, <code>glimpse()</code> sets its <code>return_as_string</code> parameter to <code>None</code>. To change it to string, we can set the <code>return_as_string</code> parameter to True. The following example shows how to do it:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)
print(<span class="hljs-string">f'Returned as String: \n<span class="hljs-subst">{df.glimpse(return_as_string=<span class="hljs-literal">True</span>)}</span>'</span>)
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">Returned as String: 
Rows: 10
Columns: 3
$ Number      &lt;u32&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ Natural Log &lt;f64&gt; 0.0, 0.6931471805599453, 1.0986122886681098, 1.3862943611198906, 1.6094379124341003, 1.791759469228055, 1.9459101490553132, 2.0794415416798357, 2.1972245773362196, 2.302585092994046
$ Log Base 10 &lt;f64&gt; 0.0, 0.3010299956639812, 0.47712125471966244, 0.6020599913279624, 0.6989700043360189, 0.7781512503836436, 0.8450980400142568, 0.9030899869919435, 0.9542425094393249, 1.0
</code></pre>
<p>In the above code, we can see that the DataFrame is returned as a string and <code>None</code> is not returned.</p>
<p>Finally, the <code>tail()</code> function outputs the given number of rows (passed as the argument of the <code>tail()</code> function) from the bottom of the dataset. When no argument is passed, it outputs the last 5 rows by default.</p>
<p>This is useful for checking if our data was completely loaded. Checking the first few records using the <code>head()</code> function and the last few records with the <code>tail()</code> function ensures that the data is correctly and totally loaded.</p>
<p>Also, we can check if there are any empty records at the end of the dataset. Having empty records at the end of the dataset can be fatal in some cases. For example, if you have to train an ML model on a dataset and you split the dataset statically into testing and training datasets, the empty rows at the end are going to cause an issue. So, checking our data beforehand is a best practice, and these functions help us do it.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)
print(df.tail(<span class="hljs-number">3</span>))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (3, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 8      ┆ 2.079442    ┆ 0.90309     │
│ 9      ┆ 2.197225    ┆ 0.954243    │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘
</code></pre>
<p>In the above code, we used the <code>tail()</code> function on the dataset (that we created earlier) and passed ‘3’ as our argument. Thus our program returned the last three rows of the dataset.</p>
<h3 id="heading-the-sample-function">The Sample Function</h3>
<p>The <code>sample()</code> function returns a given number of random rows in random order based on their occurrence in the DataFrame. This helps to avoid biased sampling of data.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)
print(df.sample(<span class="hljs-number">3</span>))
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (3, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 6      ┆ 1.791759    ┆ 0.778151    │
│ 5      ┆ 1.609438    ┆ 0.69897     │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘
</code></pre>
<p>We can see in the output that we got random rows of the data in a random order of their occurrence in the dataset (row 5 comes before row 6 in the DataFrame, yet by sampling we got row 5 after row 6.) Sampling is a good practice as it helps avoid overfitting in ML in some cases and gives us a general idea about the entire dataset.</p>
<h3 id="heading-concatenating-two-dataframes">Concatenating Two DataFrames</h3>
<p>In a nutshell, ‘concatenating’ simply means ‘linking’. Adding or linking one dataset to another – basically, stacking one on top of another – is concatenating the two datasets.</p>
<p>For example, in the previous DataFrame, we had numbers from 1 to 10 and their logarithms. Now, if we want to make it 1 to 20, we have to concatenate a different dataset containing numbers 11 to 20 to the former dataset.</p>
<p>The following code shows how this works:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># new dataset created for concatenation</span>
df1 = pl.DataFrame({
    <span class="hljs-string">"Number"</span> : [x <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">11</span>, <span class="hljs-number">21</span>)],
    <span class="hljs-string">"Log Base 10"</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">11</span>,<span class="hljs-number">21</span>)],
    <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">11</span>, <span class="hljs-number">21</span>)]
}, schema=schema)

print(pl.concat([df, df1], how=<span class="hljs-string">'vertical'</span>)) <span class="hljs-comment"># concatenating the two datasets</span>
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (20, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.098612    ┆ 0.477121    │
│ 4      ┆ 1.386294    ┆ 0.60206     │
│ 5      ┆ 1.609438    ┆ 0.69897     │
│ …      ┆ …           ┆ …           │
│ 16     ┆ 2.772589    ┆ 1.20412     │
│ 17     ┆ 2.833213    ┆ 1.230449    │
│ 18     ┆ 2.890372    ┆ 1.255273    │
│ 19     ┆ 2.944439    ┆ 1.278754    │
│ 20     ┆ 2.995732    ┆ 1.30103     │
└────────┴─────────────┴─────────────┘
</code></pre>
<p>In this code, we first created the DataFrame <code>df</code>. Then we created another DataFrame <code>df1</code>. Next, we used <code>pl.concat()</code> to concatenate the DataFrames.</p>
<p>The first argument that we passed is the list of the DataFrames that are to be linked. The <code>how</code> parameter defines the manner of concatenation. ‘Vertical’ in this context means that we are linking DataFrames vertically (adding more rows).</p>
<p>The important thing to note here is that schema incompatibility may raise an exception. If the DataFrames that are to be concatenated have different schemas, there will be a schema incompatibility problem. So it’s better to keep the schemas of both the datasets (that are to be concatenated) the same.</p>
<p>Here, we introduced a variable named <code>schema</code> containing the schema parameter of the DataFrame and we applied it to both the DataFrames to avoid schema incompatibility.</p>
<p>Also, concatenation occurs in the order of the passed arguments. For example, in the above code, <code>df</code> appears prior to <code>df1</code>, thus in the linked DataFrame, <code>df</code> appears first and then <code>df1</code>. If we had changed the sequence of values, the concatenated DataFrame would start from <code>df1</code> and then <code>df</code>.</p>
<p>The following code explains that:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

schema = {<span class="hljs-string">"Number"</span>: pl.UInt32, <span class="hljs-string">"Natural Log"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"Log Base 10"</span>: <span class="hljs-literal">None</span>}

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
pl.Config.set_tbl_hide_column_data_types(active=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># new dataset created for concatenation</span>
df1 = pl.DataFrame({
    <span class="hljs-string">"Number"</span> : [x <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">11</span>, <span class="hljs-number">21</span>)],
    <span class="hljs-string">"Log Base 10"</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">11</span>,<span class="hljs-number">21</span>)],
    <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">11</span>, <span class="hljs-number">21</span>)]
}, schema=schema)

print(pl.concat([df1, df], how=<span class="hljs-string">'vertical'</span>)) <span class="hljs-comment"># sequence changed from [df,df1] to [df1, df]</span>
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (20, 3)
┌────────┬─────────────┬─────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 │
╞════════╪═════════════╪═════════════╡
│ 11     ┆ 2.397895    ┆ 1.041393    │
│ 12     ┆ 2.484907    ┆ 1.079181    │
│ 13     ┆ 2.564949    ┆ 1.113943    │
│ 14     ┆ 2.639057    ┆ 1.146128    │
│ 15     ┆ 2.70805     ┆ 1.176091    │
│ …      ┆ …           ┆ …           │
│ 6      ┆ 1.791759    ┆ 0.778151    │
│ 7      ┆ 1.94591     ┆ 0.845098    │
│ 8      ┆ 2.079442    ┆ 0.90309     │
│ 9      ┆ 2.197225    ┆ 0.954243    │
│ 10     ┆ 2.302585    ┆ 1.0         │
└────────┴─────────────┴─────────────┘
</code></pre>
<p>Here, we can see that the <code>df1</code> appears first and then <code>df</code> appears (unlike the previous example). Thus, the sequence of the values matters.</p>
<h3 id="heading-how-to-join-two-dataframes">How to Join Two DataFrames</h3>
<p><strong>Joining</strong> datasets and <strong>concatenating</strong> datasets are two different concepts. While concatenating means ‘linking’ two separate datasets, <a target="_blank" href="https://www.freecodecamp.org/news/understanding-sql-joins/">joining</a> refers to combining datasets based on a shared column (a key).<br>The computer matches rows from both datasets where the key values are the same.</p>
<p>In the above dataset ‘df’, we’ll add a new column by joining the dataset ‘df’ with another DataFrame.</p>
<pre><code class="lang-python"><span class="hljs-comment"># new dataframe</span>
new_col = pl.DataFrame({
    <span class="hljs-string">"Number"</span> : [x <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>)],
    <span class="hljs-string">"Log Base 2"</span> : [np.log2(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>)]
})

new_data = df.join(new_col, on=<span class="hljs-string">"Number"</span>, how=<span class="hljs-string">"left"</span>) <span class="hljs-comment"># Both have one column same to map values</span>

print(new_data.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 4)
┌────────┬─────────────┬─────────────┬────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 ┆ Log Base 2 │
╞════════╪═════════════╪═════════════╪════════════╡
│ 1      ┆ 0.0         ┆ 0.0         ┆ 0.0        │
│ 2      ┆ 0.693147    ┆ 0.30103     ┆ 1.0        │
│ 3      ┆ 1.098612    ┆ 0.477121    ┆ 1.584963   │
│ 4      ┆ 1.386294    ┆ 0.60206     ┆ 2.0        │
│ 5      ┆ 1.609438    ┆ 0.69897     ┆ 2.321928   │
└────────┴─────────────┴─────────────┴────────────┘
</code></pre>
<p>In this example, we used the join function on <code>df</code> and passed <code>new_col</code> as its argument. This is why the columns of the <code>df</code> function occur prior to the column of the <code>new_col</code> dataset. The parameter <code>on</code> should be given a column name on the basis of which the two datasets are to be joined.</p>
<p>Here, we first mapped the elements of the column <code>Number</code> and its corresponding rows and joined the DataFrames accordingly.</p>
<p>If we used the <code>join()</code> function on the <code>new_col</code> DataFrame, the columns of <code>df</code> would appear later than the column in <code>new_col</code>. The following code will make it clear:</p>
<pre><code class="lang-python"><span class="hljs-comment"># new dataframe</span>
new_col = pl.DataFrame({
    <span class="hljs-string">"Number"</span> : [x <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>)],
    <span class="hljs-string">"Log Base 2"</span> : [np.log2(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>)]
})

new_data = new_col.join(df, on=<span class="hljs-string">"Number"</span>, how=<span class="hljs-string">"left"</span>) <span class="hljs-comment"># passed df as argument</span>

print(new_data.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 4)
┌────────┬────────────┬─────────────┬─────────────┐
│ Number ┆ Log Base 2 ┆ Natural Log ┆ Log Base 10 │
╞════════╪════════════╪═════════════╪═════════════╡
│ 1      ┆ 0.0        ┆ 0.0         ┆ 0.0         │
│ 2      ┆ 1.0        ┆ 0.693147    ┆ 0.30103     │
│ 3      ┆ 1.584963   ┆ 1.098612    ┆ 0.477121    │
│ 4      ┆ 2.0        ┆ 1.386294    ┆ 0.60206     │
│ 5      ┆ 2.321928   ┆ 1.609438    ┆ 0.69897     │
└────────┴────────────┴─────────────┴─────────────┘
</code></pre>
<p>You can notice that the column ‘Log Base 2’ appears prior to other columns (unlike in the previous example). Thus this change is significant.</p>
<h3 id="heading-how-to-use-the-withcolumns-function">How to Use the <code>with_columns()</code> Function</h3>
<p>The <code>with_columns()</code> function enables us to make changes to the column and print it as a new column with existing columns from the original dataset. This is similar to the <code>join()</code> function.</p>
<p>The following example will make it clear:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

df = pl.DataFrame(
    {
        <span class="hljs-string">"Number"</span> : np.arange(<span class="hljs-number">1</span>, <span class="hljs-number">11</span>),
        <span class="hljs-string">"Natural Log"</span> : [np.log(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)],
        <span class="hljs-string">'Log Base 10'</span> : [np.log10(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>,<span class="hljs-number">11</span>)]
        },
    schema=schema
    )
new_data = df.with_columns((np.log2(pl.col(<span class="hljs-string">"Number"</span>))).alias(<span class="hljs-string">"Log Base 2"</span>))

print(new_data.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 4)
┌────────┬─────────────┬─────────────┬────────────┐
│ Number ┆ Natural Log ┆ Log Base 10 ┆ Log Base 2 │
╞════════╪═════════════╪═════════════╪════════════╡
│ 1      ┆ 0.0         ┆ 0.0         ┆ 0.0        │
│ 2      ┆ 0.693147    ┆ 0.30103     ┆ 1.0        │
│ 3      ┆ 1.098612    ┆ 0.477121    ┆ 1.584963   │
│ 4      ┆ 1.386294    ┆ 0.60206     ┆ 2.0        │
│ 5      ┆ 1.609438    ┆ 0.69897     ┆ 2.321928   │
└────────┴─────────────┴─────────────┴────────────┘
</code></pre>
<p>In this example, we have a DataFrame <code>df</code>. To add a column to it , we use the <code>with_columns()</code> function. In this function, we selected column named ‘Number’ using the <code>pl.col()</code> function and put it inside the <code>np.log2()</code> to get the log base 2 value for every record. Finally, to label the new column, we used the <code>alias()</code> function, with the label passed to it as an argument.</p>
<p>Now that we know about the basics of DataFrames, let’s look at how we can work with CSV files.</p>
<h2 id="heading-how-to-read-csv-files-with-polars">How to Read CSV Files with Polars</h2>
<p>Reading CSV files with Polars is extremely similar to how it works in Pandas. For this tutorial, I’ll be using the Titanic Dataset. Here’s the <a target="_blank" href="https://www.kaggle.com/datasets/yasserh/titanic-dataset?select=Titanic-Dataset.csv">link to the dataset</a> so you can download it. In this part of the tutorial, we’ll be mainly talking about column selection (useful in feature selection) and filtering the data.</p>
<p>Here’s the syntax for reading a CSV file:</p>
<p><code>var_name = pl.read_csv(“path_dataset“)</code></p>
<p>Example code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl

data = pl.read_csv(<span class="hljs-string">"/titanic_dataset.csv"</span>)
print(data.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 12)
┌─────────────┬──────────┬────────┬─────────────────────┬───┬─────────┬─────────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name                ┆ … ┆ Ticket  ┆ Fare    ┆ Cabin ┆ Embarked │
╞═════════════╪══════════╪════════╪═════════════════════╪═══╪═════════╪═════════╪═══════╪══════════╡
│ 892         ┆ 0        ┆ 3      ┆ Kelly, Mr. James    ┆ … ┆ 330911  ┆ 7.8292  ┆ null  ┆ Q        │
│ 893         ┆ 1        ┆ 3      ┆ Wilkes, Mrs. James  ┆ … ┆ 363272  ┆ 7.0     ┆ null  ┆ S        │
│             ┆          ┆        ┆ (Ellen Need…        ┆   ┆         ┆         ┆       ┆          │
│ 894         ┆ 0        ┆ 2      ┆ Myles, Mr. Thomas   ┆ … ┆ 240276  ┆ 9.6875  ┆ null  ┆ Q        │
│             ┆          ┆        ┆ Francis             ┆   ┆         ┆         ┆       ┆          │
│ 895         ┆ 0        ┆ 3      ┆ Wirz, Mr. Albert    ┆ … ┆ 315154  ┆ 8.6625  ┆ null  ┆ S        │
│ 896         ┆ 1        ┆ 3      ┆ Hirvonen, Mrs.      ┆ … ┆ 3101298 ┆ 12.2875 ┆ null  ┆ S        │
│             ┆          ┆        ┆ Alexander (Helg…    ┆   ┆         ┆         ┆       ┆          │
└─────────────┴──────────┴────────┴─────────────────────┴───┴─────────┴─────────┴───────┴──────────┘
</code></pre>
<p>We can get the statistical analysis of the data by using the <code>describe()</code> function.</p>
<pre><code class="lang-python">print(data.describe())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (9, 13)
┌────────────┬─────────────┬──────────┬──────────┬───┬─────────────┬───────────┬───────┬──────────┐
│ statistic  ┆ PassengerId ┆ Survived ┆ Pclass   ┆ … ┆ Ticket      ┆ Fare      ┆ Cabin ┆ Embarked │
╞════════════╪═════════════╪══════════╪══════════╪═══╪═════════════╪═══════════╪═══════╪══════════╡
│ count      ┆ 418.0       ┆ 418.0    ┆ 418.0    ┆ … ┆ 418         ┆ 417.0     ┆ 91    ┆ 418      │
│ null_count ┆ 0.0         ┆ 0.0      ┆ 0.0      ┆ … ┆ 0           ┆ 1.0       ┆ 327   ┆ 0        │
│ mean       ┆ 1100.5      ┆ 0.363636 ┆ 2.26555  ┆ … ┆ null        ┆ 35.627188 ┆ null  ┆ null     │
│ std        ┆ 120.810458  ┆ 0.481622 ┆ 0.841838 ┆ … ┆ null        ┆ 55.907576 ┆ null  ┆ null     │
│ min        ┆ 892.0       ┆ 0.0      ┆ 1.0      ┆ … ┆ 110469      ┆ 0.0       ┆ A11   ┆ C        │
│ 25%        ┆ 996.0       ┆ 0.0      ┆ 1.0      ┆ … ┆ null        ┆ 7.8958    ┆ null  ┆ null     │
│ 50%        ┆ 1101.0      ┆ 0.0      ┆ 3.0      ┆ … ┆ null        ┆ 14.4542   ┆ null  ┆ null     │
│ 75%        ┆ 1205.0      ┆ 1.0      ┆ 3.0      ┆ … ┆ null        ┆ 31.5      ┆ null  ┆ null     │
│ max        ┆ 1309.0      ┆ 1.0      ┆ 3.0      ┆ … ┆ W.E.P. 5734 ┆ 512.3292  ┆ G6    ┆ S        │
└────────────┴─────────────┴──────────┴──────────┴───┴─────────────┴───────────┴───────┴──────────┘
</code></pre>
<h3 id="heading-how-to-select-columns-from-the-dataset">How to Select Columns from the Dataset</h3>
<p>Now we’re going to learn how to select certain columns from the dataset and transform those columns into a new DataFrame. This can be useful if we want to train an ML model based on only certain columns and not the entire dataset (that is, using feature selection).</p>
<p>Let’s first look at the code below:</p>
<pre><code class="lang-python">new_df = data.select(
    pl.col(<span class="hljs-string">"Survived"</span>),
    pl.col(<span class="hljs-string">"Name"</span>),
    pl.col(<span class="hljs-string">"Age"</span>),
    pl.col(<span class="hljs-string">"Sex"</span>)
)

print(new_df.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 4)
┌──────────┬─────────────────────────────────┬──────┬────────┐
│ Survived ┆ Name                            ┆ Age  ┆ Sex    │
╞══════════╪═════════════════════════════════╪══════╪════════╡
│ 0        ┆ Kelly, Mr. James                ┆ 34.5 ┆ male   │
│ 1        ┆ Wilkes, Mrs. James (Ellen Need… ┆ 47.0 ┆ female │
│ 0        ┆ Myles, Mr. Thomas Francis       ┆ 62.0 ┆ male   │
│ 0        ┆ Wirz, Mr. Albert                ┆ 27.0 ┆ male   │
│ 1        ┆ Hirvonen, Mrs. Alexander (Helg… ┆ 22.0 ┆ female │
└──────────┴─────────────────────────────────┴──────┴────────┘
</code></pre>
<p>In the code above, we selected four columns using the <code>select()</code> and <code>pl.col()</code> functions from the Titanic Dataset and transformed them into a new DataFrame called <code>new_df</code>.</p>
<p>Now, we can filter this data however we want. Let’s make a new DataFrame by filtering out only surviving passengers from the dataset:</p>
<pre><code class="lang-python">survived_data = data.select(
    pl.col(<span class="hljs-string">"Survived"</span>),
    pl.col(<span class="hljs-string">"Name"</span>),
    pl.col(<span class="hljs-string">"Age"</span>),
    pl.col(<span class="hljs-string">"Sex"</span>)
).filter(pl.col(<span class="hljs-string">"Survived"</span>)==<span class="hljs-number">1</span>)

print(survived_data.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 4)
┌──────────┬─────────────────────────────────┬──────┬────────┐
│ Survived ┆ Name                            ┆ Age  ┆ Sex    │
╞══════════╪═════════════════════════════════╪══════╪════════╡
│ 1        ┆ Wilkes, Mrs. James (Ellen Need… ┆ 47.0 ┆ female │
│ 1        ┆ Hirvonen, Mrs. Alexander (Helg… ┆ 22.0 ┆ female │
│ 1        ┆ Connolly, Miss. Kate            ┆ 30.0 ┆ female │
│ 1        ┆ Abrahim, Mrs. Joseph (Sophie H… ┆ 18.0 ┆ female │
│ 1        ┆ Snyder, Mrs. John Pillsbury (N… ┆ 23.0 ┆ female │
└──────────┴─────────────────────────────────┴──────┴────────┘
</code></pre>
<p>In the above code, we used the <code>filter()</code> function. This function helps us gather data that applies to our given condition. In the above example, we added the condition that, “Every element in the column named ‘Survived’ should be equal to 1”. Hence, we got our required data.</p>
<h2 id="heading-some-other-important-functions">Some Other Important Functions</h2>
<h3 id="heading-how-to-print-the-names-of-the-columns-of-a-dataset">How to Print the Names of the Columns of a Dataset</h3>
<p>You can print the names of a column using the <code>columns</code> method. The following code shows how to use the columns method:</p>
<pre><code class="lang-python">print(data.columns) <span class="hljs-comment"># data --&gt; Titanic Dataset</span>
</code></pre>
<p><strong>Output:</strong></p>
<blockquote>
<p>['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']</p>
</blockquote>
<h3 id="heading-how-to-index-a-dataset">How to Index a Dataset</h3>
<p>Indexing a dataset means adding an index column to the existing dataset. It can prove useful in keeping track of the rows of the dataset.</p>
<p>We can index the dataset using the <code>with_row_index()</code> function. Inside this function, we can pass the argument to name this new index column. If we don’t pass any argument, the index column name is set as ‘index’ by default.</p>
<pre><code class="lang-python">data = pl.read_csv(<span class="hljs-string">"/titanic_dataset.csv"</span>).with_row_index(<span class="hljs-string">'#'</span>) <span class="hljs-comment"># naming the index column as '#'</span>
print(data.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 13)
┌─────┬─────────────┬──────────┬────────┬───┬─────────┬─────────┬───────┬──────────┐
│ #   ┆ PassengerId ┆ Survived ┆ Pclass ┆ … ┆ Ticket  ┆ Fare    ┆ Cabin ┆ Embarked │
│ --- ┆ ---         ┆ ---      ┆ ---    ┆   ┆ ---     ┆ ---     ┆ ---   ┆ ---      │
│ u32 ┆ i64         ┆ i64      ┆ i64    ┆   ┆ str     ┆ f64     ┆ str   ┆ str      │
╞═════╪═════════════╪══════════╪════════╪═══╪═════════╪═════════╪═══════╪══════════╡
│ 0   ┆ 892         ┆ 0        ┆ 3      ┆ … ┆ 330911  ┆ 7.8292  ┆ null  ┆ Q        │
│ 1   ┆ 893         ┆ 1        ┆ 3      ┆ … ┆ 363272  ┆ 7.0     ┆ null  ┆ S        │
│ 2   ┆ 894         ┆ 0        ┆ 2      ┆ … ┆ 240276  ┆ 9.6875  ┆ null  ┆ Q        │
│ 3   ┆ 895         ┆ 0        ┆ 3      ┆ … ┆ 315154  ┆ 8.6625  ┆ null  ┆ S        │
│ 4   ┆ 896         ┆ 1        ┆ 3      ┆ … ┆ 3101298 ┆ 12.2875 ┆ null  ┆ S        │
└─────┴─────────────┴──────────┴────────┴───┴─────────┴─────────┴───────┴──────────┘
</code></pre>
<h3 id="heading-how-to-rename-columns-in-the-dataset">How to Rename Columns in the Dataset</h3>
<p>Lastly, to rename columns in the Dataset, we use the <code>rename()</code> function.</p>
<pre><code class="lang-python">data = pl.read_csv(<span class="hljs-string">"/titanic_dataset.csv"</span>).with_row_index(<span class="hljs-string">'#'</span>).rename({<span class="hljs-string">'PassengerId'</span>:<span class="hljs-string">'renamed_col'</span>})
print(data.head())
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-plaintext">shape: (5, 13)
┌─────┬─────────────┬──────────┬────────┬───┬─────────┬─────────┬───────┬──────────┐
│ #   ┆ renamed_col ┆ Survived ┆ Pclass ┆ … ┆ Ticket  ┆ Fare    ┆ Cabin ┆ Embarked │
│ --- ┆ ---         ┆ ---      ┆ ---    ┆   ┆ ---     ┆ ---     ┆ ---   ┆ ---      │
│ u32 ┆ i64         ┆ i64      ┆ i64    ┆   ┆ str     ┆ f64     ┆ str   ┆ str      │
╞═════╪═════════════╪══════════╪════════╪═══╪═════════╪═════════╪═══════╪══════════╡
│ 0   ┆ 892         ┆ 0        ┆ 3      ┆ … ┆ 330911  ┆ 7.8292  ┆ null  ┆ Q        │
│ 1   ┆ 893         ┆ 1        ┆ 3      ┆ … ┆ 363272  ┆ 7.0     ┆ null  ┆ S        │
│ 2   ┆ 894         ┆ 0        ┆ 2      ┆ … ┆ 240276  ┆ 9.6875  ┆ null  ┆ Q        │
│ 3   ┆ 895         ┆ 0        ┆ 3      ┆ … ┆ 315154  ┆ 8.6625  ┆ null  ┆ S        │
│ 4   ┆ 896         ┆ 1        ┆ 3      ┆ … ┆ 3101298 ┆ 12.2875 ┆ null  ┆ S        │
└─────┴─────────────┴──────────┴────────┴───┴─────────┴─────────┴───────┴──────────┘
</code></pre>
<p>In the above example, we renamed the column named ‘PassengerId’ to ‘renamed_col’.</p>
<h2 id="heading-summary">Summary</h2>
<p>Now you know how to work with the Polars Python library to analyze your data more effectively.</p>
<p>In this article, you learned:</p>
<ul>
<li><p>What Polars is and how to install it</p>
</li>
<li><p>How to define series and DataFrames in Polars</p>
</li>
<li><p>Different functions to deal with DataFrames.</p>
</li>
<li><p>How to read and work with CSV files in Polars</p>
</li>
</ul>
<p>Thanks for Reading, and happy data wrangling!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Dataframe to CSV – How to Save Pandas Dataframes by Exporting ]]>
                </title>
                <description>
                    <![CDATA[ By Shittu Olumide Pandas is a widely used open-source library in Python for data manipulation and analysis. It provides a range of data structures and functions for working with data, one of which is the DataFrame.  DataFrames are a powerful tool for... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/dataframe-to-csv-how-to-save-pandas-dataframes-by-exporting/</link>
                <guid isPermaLink="false">66d460f43dce891ac3a9681a</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dataframe ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 24 Mar 2023 18:06:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/Shittu-Olumide-Dataframe-to-CSV---How-to-Save-Pandas-Dataframes-by-Exporting.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shittu Olumide</p>
<p>Pandas is a widely used open-source library in Python for data manipulation and analysis. It provides a range of data structures and functions for working with data, one of which is the DataFrame. </p>
<p>DataFrames are a powerful tool for storing and analyzing large sets of data, but they can be challenging to work with if they are not saved or exported correctly.</p>
<p>It is common practice in data analysis to export data from Pandas DataFrames into CSV files because it can help conserve time and resources. Due to their portability and ability to be easily read by numerous applications, CSV files are a common file format for storing and distributing tabular data. </p>
<p>Regardless of whether you are a novice or an expert data analyst, this article will walk you through the process of saving Pandas DataFrames into CSV files and give you useful tips on how to do so.</p>
<h2 id="heading-how-to-save-pandas-dataframes-using-the-tocsv-method">How to Save Pandas DataFrames Using the <code>.to_csv()</code> Method</h2>
<p>The <code>.to_csv()</code> method is a built-in function in Pandas that allows you to save a Pandas DataFrame as a CSV file. This method exports the DataFrame into a comma-separated values (CSV) file, which is a simple and widely used format for storing tabular data.</p>
<p>The syntax for using the <code>.to_csv()</code> method is as follows:</p>
<pre><code class="lang-py">DataFrame.to_csv(filename, sep=<span class="hljs-string">','</span>, index=<span class="hljs-literal">False</span>, encoding=<span class="hljs-string">'utf-8'</span>)
</code></pre>
<p>Here, <code>DataFrame</code> refers to the Pandas DataFrame that we want to export, and <code>filename</code> refers to the name of the file that you want to save your data to.</p>
<p>The <code>sep</code> parameter specifies the separator that should be used to separate values in the CSV file. By default, it is set to <code>,</code> for comma-separated values. We can also set it to a different separator like <code>\t</code> for tab-separated values.</p>
<p>The <code>index</code> parameter is a boolean value that determines whether to include the index of the DataFrame in the CSV file. By default, it is set to <code>False</code>, which means the index is not included.</p>
<p>The <code>encoding</code> parameter specifies the character encoding to be used for the CSV file. By default, it is set to <code>utf-8</code>, which is a standard encoding for text files.</p>
<h3 id="heading-code-example">Code example</h3>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Create a sample dataframe</span>
Biodata = {<span class="hljs-string">'Name'</span>: [<span class="hljs-string">'John'</span>, <span class="hljs-string">'Emily'</span>, <span class="hljs-string">'Mike'</span>, <span class="hljs-string">'Lisa'</span>],
        <span class="hljs-string">'Age'</span>: [<span class="hljs-number">28</span>, <span class="hljs-number">23</span>, <span class="hljs-number">35</span>, <span class="hljs-number">31</span>],
        <span class="hljs-string">'Gender'</span>: [<span class="hljs-string">'M'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'M'</span>, <span class="hljs-string">'F'</span>]
        }
df = pd.DataFrame(Biodata)

<span class="hljs-comment"># Save the dataframe to a CSV file</span>
df.to_csv(<span class="hljs-string">'Biodata.csv'</span>, index=<span class="hljs-literal">False</span>)
</code></pre>
<h3 id="heading-code-explanation">Code explanation</h3>
<p>Let's break down what each part of this code does:</p>
<ul>
<li><code>import pandas as pd</code>: This imports the Pandas library and assigns it the alias <code>pd</code>, which is a commonly used convention.</li>
<li><code>Biodata = {'Name': ['John', 'Emily', 'Mike', 'Lisa'], 'Age': [28, 23, 35, 31], 'Gender': ['M', 'F', 'M', 'F']}</code>: This creates a Python dictionary with the data we want to store in the DataFrame. Each key represents a column in the DataFrame, and its corresponding value is a list of values for that column.</li>
<li><code>df = pd.DataFrame(Biodata)</code>: This creates a Pandas DataFrame from the <code>Biodata</code> dictionary.</li>
<li><code>df.to_csv('Biodata.csv', index=False)</code>: This saves the DataFrame to a CSV file named <code>Biodata.csv</code>.</li>
</ul>
<h2 id="heading-other-ways-to-save-pandas-dataframes">Other Ways to Save Pandas DataFrames</h2>
<p>There are several alternative methods to <code>.to_csv()</code> for saving Pandas DataFrames into various file formats, including:</p>
<ol>
<li><code>to_excel()</code>: This method is used to save a DataFrame as an Excel file. </li>
<li><code>to_json()</code>: This method is used to save a DataFrame as a JSON file. </li>
<li><code>to_hdf()</code>: This method is used to save a DataFrame as an HDF5 file, which is a hierarchical data format commonly used in scientific computing.</li>
<li><code>to_sql()</code>: This method is used to save a DataFrame to a SQL database. </li>
<li><code>to_pickle()</code>: This method is used to save a DataFrame as a pickled object, which is a serialized representation of the DataFrame. </li>
</ol>
<p>These alternative methods provide flexibility in choosing the file format that best suits your use case and can be particularly useful for advanced data analysis and sharing.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Thanks for reading! I hope you now understand how you can easily convert your Pandas Dataframes by exporting into a CSV file using the build-in <code>to_csv()</code> method. </p>
<p>Let's connect on <a target="_blank" href="https://www.twitter.com/Shittu_Olumide_">Twitter</a> and on <a target="_blank" href="https://www.linkedin.com/in/olumide-shittu">LinkedIn</a>. You can also subscribe to my <a target="_blank" href="https://www.youtube.com/channel/UCNhFxpk6hGt5uMCKXq0Jl8A">YouTube</a> channel.</p>
<p>Happy Coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Analytics with Pandas – How to Drop a List of Rows from a Pandas Dataframe ]]>
                </title>
                <description>
                    <![CDATA[ A Pandas dataframe is a two dimensional data structure which allows you to store data in rows and columns. It's very useful when you're analyzing data. When you have a list of data records in a dataframe, you may need to drop a specific list of rows ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/drop-list-of-rows-from-pandas-dataframe/</link>
                <guid isPermaLink="false">66bb8ac1c332a9c775d15b63</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dataframe ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Vikram Aruchamy ]]>
                </dc:creator>
                <pubDate>Tue, 01 Jun 2021 20:47:43 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/05/cut_lemons--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A Pandas dataframe is a two dimensional data structure which allows you to store data in rows and columns. It's very useful when you're analyzing data.</p>
<p>When you have a list of data records in a dataframe, you may need to drop a specific list of rows depending on the needs of your model and your goals when studying your analytics. </p>
<p>In this tutorial, you'll learn how to drop a list of rows from a Pandas dataframe. </p>
<p>To learn how to drop columns, you can read here about <a target="_blank" href="https://www.stackvidhya.com/drop-column-in-pandas/">How to Drop Columns in Pandas</a>. </p>
<h2 id="heading-how-to-drop-a-row-or-column-in-a-pandas-dataframe">How to Drop a Row or Column in a Pandas Dataframe</h2>
<p>To drop a row or column in a dataframe, you need to use the <code>drop()</code> method available in the dataframe. You can read more about the <code>drop()</code> method in the docs <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">here</a>. </p>
<p><strong>Dataframe Axis</strong></p>
<ul>
<li>Rows are denoted using <code>axis=0</code></li>
<li>Columns are denoted using <code>axis=1</code></li>
</ul>
<p><strong>Dataframe Labels</strong></p>
<ul>
<li>Rows are labelled using the index number starting with 0, by default.</li>
<li>Columns are labelled using names. </li>
</ul>
<p><strong>Drop() Method Parameters</strong></p>
<ul>
<li><code>index</code> - the list of rows to be deleted</li>
<li><code>axis=0</code> - Marks the rows in the dataframe to be deleted</li>
<li><code>inplace=True</code> - Performs the drop operation in the same dataframe, rather than creating a new dataframe object during the delete operation. </li>
</ul>
<h3 id="heading-sample-pandas-dataframe">Sample Pandas DataFrame</h3>
<p>Our sample dataframe contains the columns <em>product_name</em>, <em>Unit_Price</em>, <em>No_Of_Units</em>, <em>Available_Quantity</em>, and <em>Available_Since_Date</em> columns. It also has rows with NaN values which are used to denote missing values. </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {<span class="hljs-string">"product_name"</span>:[<span class="hljs-string">"Keyboard"</span>,<span class="hljs-string">"Mouse"</span>, <span class="hljs-string">"Monitor"</span>, <span class="hljs-string">"CPU"</span>,<span class="hljs-string">"CPU"</span>, <span class="hljs-string">"Speakers"</span>,pd.NaT],
        <span class="hljs-string">"Unit_Price"</span>:[<span class="hljs-number">500</span>,<span class="hljs-number">200</span>, <span class="hljs-number">5000.235</span>, <span class="hljs-number">10000.550</span>, <span class="hljs-number">10000.550</span>, <span class="hljs-number">250.50</span>,<span class="hljs-literal">None</span>],
        <span class="hljs-string">"No_Of_Units"</span>:[<span class="hljs-number">5</span>,<span class="hljs-number">5</span>, <span class="hljs-number">10</span>, <span class="hljs-number">20</span>, <span class="hljs-number">20</span>, <span class="hljs-number">8</span>,pd.NaT],
        <span class="hljs-string">"Available_Quantity"</span>:[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">10</span>,<span class="hljs-string">"Not Available"</span>,<span class="hljs-string">"Not Available"</span>, pd.NaT,pd.NaT],
        <span class="hljs-string">"Available_Since_Date"</span>:[<span class="hljs-string">'11/5/2021'</span>, <span class="hljs-string">'4/23/2021'</span>, <span class="hljs-string">'08/21/2021'</span>,<span class="hljs-string">'09/18/2021'</span>,<span class="hljs-string">'09/18/2021'</span>,<span class="hljs-string">'01/05/2021'</span>,pd.NaT]
       }

df = pd.DataFrame(data)

df
</code></pre>
<p>The dataframe will look like this:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.000</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.000</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Monitor</td>
      <td>5000.235</td>
      <td>10</td>
      <td>10</td>
      <td>08/21/2021</td>
    </tr>
    <tr>
      <th>3</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
    <tr>
      <th>4</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Speakers</td>
      <td>250.500</td>
      <td>8</td>
      <td>NaT</td>
      <td>01/05/2021</td>
    </tr>
    <tr>
      <th>6</th>
      <td>NaT</td>
      <td>NaN</td>
      <td>NaT</td>
      <td>NaT</td>
      <td>NaT</td>
    </tr>
  </tbody>
</table>
</div>

<p>And just like that we've created our sample dataframe. </p>
<p>After each drop operation, you'll print the dataframe by using <code>df</code> which will print the dataframe in a regular <code>HTML</code> table format. </p>
<p>You can read here about how to <a target="_blank" href="https://www.stackvidhya.com/pretty-print-dataframe/">Pretty Print a Dataframe</a> to print the dataframe in different visual formats. </p>
<p>Next, you'll learn how to drop a list of rows in different use cases. </p>
<h2 id="heading-how-to-drop-a-list-of-rows-by-index-in-pandas">How to Drop a List of Rows by Index in Pandas</h2>
<p>You can delete a list of rows from Pandas by passing the list of indices to the <code>drop()</code> method. </p>
<pre><code class="lang-python">df.drop([<span class="hljs-number">5</span>,<span class="hljs-number">6</span>], axis=<span class="hljs-number">0</span>, inplace=<span class="hljs-literal">True</span>)

df
</code></pre>
<p>In this code,</p>
<ul>
<li><code>[5,6]</code> is the index of the rows you want to delete</li>
<li><code>axis=0</code> denotes that rows should be deleted from the dataframe</li>
<li><code>inplace=True</code> performs the drop operation in the same dataframe</li>
</ul>
<p>After dropping rows with the index 5 and 6, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.000</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.000</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Monitor</td>
      <td>5000.235</td>
      <td>10</td>
      <td>10</td>
      <td>08/21/2021</td>
    </tr>
    <tr>
      <th>3</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
    <tr>
      <th>4</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can delete rows with a specific index. </p>
<p>Next, you'll learn about dropping a range of indices. </p>
<h2 id="heading-how-to-drop-rows-by-index-range-in-pandas">How to Drop Rows by Index Range in Pandas</h2>
<p>You can also drop a list of rows within a specific range. </p>
<p>A range is a set of values with a lower limit and an upper limit. </p>
<p>This may be useful in cases where you want to create a sample dataset exlcuding specific ranges of data. </p>
<p>You can create a range of rows in a dataframe by using the <code>df.index()</code> method. Then you can pass this range to the <code>drop()</code> method to drop the rows as shown below. </p>
<pre><code class="lang-python">df.drop(df.index[<span class="hljs-number">2</span>:<span class="hljs-number">4</span>], inplace=<span class="hljs-literal">True</span>)

df
</code></pre>
<p>Here's what this code is doing:</p>
<ul>
<li><code>df.index[2:4]</code> generates a range of rows from 2 to 4. The lower limit of the range is inclusive and the upper limit of the range is exclusive. This means that rows 2 and 3 will be deleted and row 4 will <em>not</em> be deleted. </li>
<li><code>inplace=True</code> performs the drop operation in the same dataframe</li>
</ul>
<p>After dropping rows within the range 2-4, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.00</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.00</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
    <tr>
      <th>4</th>
      <td>CPU</td>
      <td>10000.55</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can drop the list of rows in the dataframe using its range. </p>
<h2 id="heading-how-to-drop-all-rows-after-an-index-in-pandas">How to Drop All Rows after an Index in Pandas</h2>
<p>You can drop all rows after a specific index by using <code>iloc[]</code>. </p>
<p>You can use <code>iloc[]</code> to select rows by using its position index. You can specify the start and end position separated by a <code>:</code>. For example, you'd use <code>2:3</code> to select rows from 2 to 3. If you want to select all the rows, you can just use <code>:</code> in <code>iloc[]</code>. </p>
<p>This may be useful in cases where you want to split the dataset for training and testing purposes. </p>
<p>Use the below snippet to select rows from 0 to the index 2. This results in dropping the rows after the index 2. </p>
<pre><code class="lang-python">df = df.iloc[:<span class="hljs-number">2</span>]

df
</code></pre>
<p>In this code, <code>:2</code> selects the rows until the index 2. </p>
<p>This is how you can drop all rows after a specific index. </p>
<p>After dropping rows after the index 2, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.0</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.0</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can drop rows after a specific index. </p>
<p>Next, you'll learn how to drop rows with conditions. </p>
<h2 id="heading-how-to-drop-rows-with-multiple-conditions-in-pandas">How to Drop Rows with Multiple Conditions in Pandas</h2>
<p>You can drop rows in the dataframe based on specific conditions. </p>
<p>For example, you can drop rows where the column value is greater than <em>X</em> and less than <em>Y</em>. </p>
<p>This may be useful in cases where you want to create a dataset that ignores columns with specific values. </p>
<p>To drop rows based on certain conditions, select the index of the rows which pass the specific condition and pass that index to the <code>drop()</code> method.  </p>
<pre><code class="lang-python">df.drop(df[(df[<span class="hljs-string">'Unit_Price'</span>] &gt;<span class="hljs-number">400</span>) &amp; (df[<span class="hljs-string">'Unit_Price'</span>] &lt; <span class="hljs-number">600</span>)].index, inplace=<span class="hljs-literal">True</span>)

df
</code></pre>
<p>In this code, </p>
<ul>
<li><code>(df['Unit_Price'] &gt;400) &amp; (df['Unit_Price'] &lt; 600)</code> is the condition to drop the rows. </li>
<li><code>df[].index</code> selects the index of rows which passes the condition. </li>
<li><code>inplace=True</code> performs the drop operation in the same dataframe rather than creating a new one.</li>
</ul>
<p>After dropping the rows with the condition which has the <code>unit_price</code> greater than 400 and less than 600, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.0</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can drop rows in the dataframe using certain conditions. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>To summarize, in this article you've learnt what the <code>drop()</code> method is in a Pandas dataframe. You've also seen how dataframe rows and columns are labelled. And finally you've learnt how to drop rows using indices, a range of indices, and based on conditions. </p>
<p>If you liked this article, feel free to share it. </p>
<h3 id="heading-you-may-also-like">You May Also Like</h3>
<ul>
<li><a target="_blank" href="https://www.stackvidhya.com/add-column-to-dataframe/">How to Add a Column to a Dataframe in Pandas</a><ul>
<li><a target="_blank" href="https://www.stackvidhya.com/rename-column-in-pandas/">How to Rename a Column in Pandas</a></li>
</ul>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Obtain Historical Weather Forecast data in CSV format using Python ]]>
                </title>
                <description>
                    <![CDATA[ By Ekapope Viriyakovithya Recently, I worked on a machine learning project related to renewable energy, which required historical weather forecast data from multiple cities. Despite intense research, I had a hard time finding the good data source. Mo... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/obtain-historical-weather-forecast-data-in-csv-format-using-python/</link>
                <guid isPermaLink="false">66d45e42d14641365a0508ac</guid>
                
                    <category>
                        <![CDATA[ worldweatheronline ]]>
                    </category>
                
                    <category>
                        <![CDATA[ api ]]>
                    </category>
                
                    <category>
                        <![CDATA[ csv ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dataframe ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Machine Learning ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ weather ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 09 Jul 2019 10:00:00 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/5f9ca193740569d1a4ca4f62.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Ekapope Viriyakovithya</p>
<p>Recently, I worked on a machine learning project related to renewable energy, which required <strong>historical weather forecast data from multiple cities</strong>.</p>
<p>Despite intense research, I had a hard time finding the good data source. Most websites restrict the access to only past two weeks of historical data. If you need more, you need to pay. In my case, I needed five years of data — hourly historical forecast, which can be costly.</p>
<h3 id="heading-my-requirements-are"><strong>My requirements are...</strong></h3>
<p><strong>1. Free — at least during trial period</strong></p>
<p>No need to provide credit card info.</p>
<p><strong>2. Flexible</strong></p>
<p>Flexible to change forecast interval, time periods, locations.</p>
<p><strong>3. Reproducible</strong></p>
<p>Easy to reproduce and implement in the production phase.</p>
<p>In the end, I decided to use data from <a target="_blank" href="https://www.worldweatheronline.com/developer/api/historical-weather-api.aspx">World Weather Online</a>. This took me less than two minutes to subscribe free trial premium API — without filling credit card info. (500 free requests/key/day for 60 days, as of 30-May-2019).</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*kVPI57az2iE7Kjni3AhxOw.png" alt="Image" width="859" height="245" loading="lazy"></p>
<p><a target="_blank" href="https://www.worldweatheronline.com/developer/signup.aspx"><em>https://www.worldweatheronline.com/developer/signup.aspx</em></a></p>
<p>You can try out requests in JSON or XML format <a target="_blank" href="https://www.worldweatheronline.com/developer/premium-api-explorer.aspx">here</a>. The result is nested JSON which needed a bit pre-processing work before feeding into ML models. Therefore, I wrote some <a target="_blank" href="https://github.com/ekapope/WorldWeatherOnline">scripts</a> to parse them into pandas DataFrames and save as CSV for further use.</p>
<h3 id="heading-introducing-wwo-hist-package">Introducing wwo-hist package</h3>
<p>This <a target="_blank" href="https://pypi.org/project/wwo-hist/">wwo-hist package</a> is used to retrieve and parse historical weather data from <a target="_blank" href="https://www.worldweatheronline.com/developer/api/historical-weather-api.aspx">World Weather Online</a> into pandas DataFrame and CSV file.</p>
<p><strong>Input:</strong> api_key, location_list, start_date, end_date, frequency</p>
<p><strong>Output:</strong> location_name.csv</p>
<p><strong>Output column names:</strong> date_time, maxtempC, mintempC, totalSnow_cm, sunHour, uvIndex, uvIndex, moon_illumination, moonrise, moonset, sunrise, sunset, DewPointC, FeelsLikeC, HeatIndexC, WindChillC, WindGustKmph, cloudcover, humidity, precipMM, pressure, tempC, visibility, winddirDegree, windspeedKmph</p>
<h4 id="heading-install-and-import-the-package">Install and import the package:</h4>
<pre><code class="lang-py">pip install wwo-hist
</code></pre>
<pre><code class="lang-py"><span class="hljs-comment"># import the package and function</span>
<span class="hljs-keyword">from</span> wwo_hist <span class="hljs-keyword">import</span> retrieve_hist_data

<span class="hljs-comment"># set working directory to store output csv file(s)</span>
<span class="hljs-keyword">import</span> os
os.chdir(<span class="hljs-string">".\YOUR_PATH"</span>)
</code></pre>
<p><strong>Example code:</strong></p>
<p>Specify input parameters and call <strong><em>retrieve_hist_data()*</em></strong>.* Please visit <a target="_blank" href="https://github.com/ekapope/WorldWeatherOnline">my github repo</a> for more info about parameters setup.</p>
<p>This will retrieve <strong>3-hour interval</strong> historical weather forecast data for <strong>Singapore</strong> and <strong>California</strong> from <strong>11-Dec-2018</strong> to <strong>11-Mar-2019</strong>, save output into hist_weather_data variable and <strong>CSV</strong> files.frequency = 3</p>
<pre><code class="lang-py">FREQUENCY = <span class="hljs-number">3</span>
START_DATE = <span class="hljs-string">'11-DEC-2018'</span>
END_DATE = <span class="hljs-string">'11-MAR-2019'</span>
API_KEY = <span class="hljs-string">'YOUR_API_KEY'</span>
LOCATION_LIST = [<span class="hljs-string">'singapore'</span>,<span class="hljs-string">'california'</span>]

hist_weather_data = retrieve_hist_data(API_KEY,
                                LOCATION_LIST,
                                START_DATE,
                                END_DATE,
                                FREQUENCY,
                                location_label = <span class="hljs-literal">False</span>,
                                export_csv = <span class="hljs-literal">True</span>,
                                store_df = <span class="hljs-literal">True</span>)
</code></pre>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*Ga1fGBrxkoKgfzbu" alt="Image" width="795" height="747" loading="lazy"></p>
<p><em>This is what you will see in your console.</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/1*0-UJ7XO4ZG76oDlVvHQNVA.png" alt="Image" width="761" height="121" loading="lazy"></p>
<p><em>Result CSV(s) exported to your working directory.</em></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/0*gdtco3Zi0Kv03uz9" alt="Image" width="1200" height="245" loading="lazy"></p>
<p><em>Check the CSV output.</em></p>
<p>There you have it! The script detailed is also <a target="_blank" href="https://github.com/ekapope/WorldWeatherOnline">documented on GitHub</a>.</p>
<hr>
<p>Thank you for reading. Please give it a try, and let me know your feedback! If you like what I did, consider following me on <a target="_blank" href="https://github.com/ekapope">GitHub</a>, <a target="_blank" href="https://medium.com/@ekapope.v">Medium</a>, and <a target="_blank" href="https://twitter.com/EkapopeV">Twitter</a> to get more articles and tutorials on your feed.</p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
