<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ pandas - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ pandas - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Sun, 31 May 2026 14:26:10 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/pandas/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ How to Transform JSON Data to Match Any Schema ]]>
                </title>
                <description>
                    <![CDATA[ Whether you’re transferring data between APIs or just preparing JSON data for import, mismatched schemas can break your workflow.  Learning how to clean and normalize JSON data ensures a smooth, error-free data transfer. This tutorial demonstrates ho... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/transform-json-data-schema/</link>
                <guid isPermaLink="false">686f40595293ca3e659585b7</guid>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ json ]]>
                    </category>
                
                    <category>
                        <![CDATA[ json-schema ]]>
                    </category>
                
                    <category>
                        <![CDATA[ python beginner ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Nneoma Uche ]]>
                </dc:creator>
                <pubDate>Thu, 10 Jul 2025 04:23:53 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1752121420492/513db316-cdc7-47ef-8f20-4911cf5d41f9.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Whether you’re transferring data between APIs or just preparing JSON data for import, mismatched schemas can break your workflow.  Learning how to clean and normalize JSON data ensures a smooth, error-free data transfer.</p>
<p>This tutorial demonstrates how to clean messy JSON and export the results into a new file, based on a predefined schema. The JSON file we’ll be cleaning contains a dataset of 200 synthetic customer records.</p>
<p>In this tutorial, we’ll apply two methods for cleaning the input data:</p>
<ul>
<li><p>With pure Python</p>
</li>
<li><p>With <code>pandas</code></p>
</li>
</ul>
<p>You can apply either of these in your code. But the <code>pandas</code> method is better for large, complex data sets. Let’s jump right into the process.</p>
<h3 id="heading-heres-what-well-cover">Here’s what we’ll cover:</h3>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-add-and-inspect-the-json-file">Add and Inspect the JSON File</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-define-the-target-schema">Define the Target Schema</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-clean-json-data-with-pure-python">How to Clean JSON Data with Pure Python</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-clean-json-data-with-pandas">How to Clean JSON Data with Pandas</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-validate-the-cleaned-json">How to Validate the Cleaned JSON</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-pandas-vs-pure-python-for-data-cleaning">Pandas vs Pure Python for Data Cleaning</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<p>To follow along with this tutorial, you should have a basic understanding of:</p>
<ul>
<li><p>Python dictionaries, lists, and loops</p>
</li>
<li><p>JSON data structure (keys, values, and nesting)</p>
</li>
<li><p>How to read and write JSON files with Python’s <code>json</code> module</p>
</li>
</ul>
<h2 id="heading-add-and-inspect-the-json-file">Add and Inspect the JSON File</h2>
<p>Before you begin writing any code, make sure that the <strong>.json</strong> file you intend to clean is in your project directory. This makes it easy to load in your script using the file name alone.</p>
<p>You can now inspect the data structure by viewing the file locally or loading it in your script, with Python’s built-in <code>json</code> module.</p>
<p>Here’s how (assuming the file name is <strong>“old_customers.json”</strong>):</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752079424973/3cd77410-6fa9-483d-9a73-edbe4c035327.jpeg" alt="Code to view or print contents of the raw JSON file in terminal" class="image--center mx-auto" width="407" height="231" loading="lazy"></p>
<p>This shows you whether the JSON file is structured as a dictionary or a list. It also prints out the entire file in your terminal. Mine is a dictionary that maps to a list of 200 customer entries. You should always open up the raw JSON file in your IDE to get a closer look at its structure and schema.</p>
<h2 id="heading-define-the-target-schema">Define the Target Schema</h2>
<p>If someone asks for JSON data to be cleaned, it probably means that the <a target="_blank" href="https://json-schema.org/understanding-json-schema/about">current schema</a> is unsuitable for its intended purpose. At this point, you want to be clear on what the final JSON export should look like.</p>
<p>JSON schema is essentially a blueprint that describes:</p>
<ul>
<li><p>required fields</p>
</li>
<li><p>field names</p>
</li>
<li><p>data type for each field</p>
</li>
<li><p>standardized formats (for example, lowercase emails, trimmed whitespace, etc.)</p>
</li>
</ul>
<p>Here’s what the old schema versus the target schema looks like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751956173106/d5957404-57ae-4de9-b61b-90eefa0b9260.jpeg" alt="A screenshot of the old JSON Schema to be transformed" class="image--center mx-auto" width="597" height="222" loading="lazy"></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751956365336/dcf6a024-1ae6-4c95-92ae-5544ba4cbb3e.jpeg" alt="The expected JSON Schema" class="image--center mx-auto" width="460" height="186" loading="lazy"></p>
<p>As you can see, the goal is to delete the <code>”customer_id”</code> and <code>”address”</code> fields in each entry and rename the rest from:</p>
<ul>
<li><p><code>”name”</code> to <code>”full_name”</code></p>
</li>
<li><p><code>”email”</code> to <code>”email_address”</code></p>
</li>
<li><p><code>”phone”</code> to <code>”mobile”</code></p>
</li>
<li><p><code>”membership_level”</code> to <code>”tier”</code></p>
</li>
</ul>
<p>The output should contain 4 response fields instead of 6, all renamed to fit the project requirements.</p>
<h2 id="heading-how-to-clean-json-data-with-pure-python">How to Clean JSON Data with Pure Python</h2>
<p>Let’s explore using Python’s built-in <code>json</code> module to align the raw data with the predefined schema.</p>
<h3 id="heading-step-1-import-json-and-time-modules">Step 1: Import <code>json</code> and <code>time</code> modules</h3>
<p>Importing <code>json</code> is necessary because we’re working with JSON files. But we’ll use the <code>time</code> module to track how long the data cleaning process takes.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> time
</code></pre>
<h3 id="heading-step-2-load-the-file-with-jsonload">Step 2: Load the file with <code>json.load()</code></h3>
<pre><code class="lang-python">start_time = time.time()
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'old_customers.json'</span>) <span class="hljs-keyword">as</span> file:
    crm_data = json.load(file)
</code></pre>
<h3 id="heading-step-3-write-a-function-to-loop-through-and-clean-each-customer-entry-in-the-dictionary">Step 3: Write a function to loop through and clean each customer entry in the dictionary</h3>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_data</span>(<span class="hljs-params">records</span>):</span>
    transformed_records = []
    <span class="hljs-keyword">for</span> customer <span class="hljs-keyword">in</span> records[<span class="hljs-string">"customers"</span>]:
        transformed_records.append({
                <span class="hljs-string">"full_name"</span>: customer[<span class="hljs-string">"name"</span>],
                <span class="hljs-string">"email_address"</span>: customer[<span class="hljs-string">"email"</span>],
                <span class="hljs-string">"mobile"</span>: customer[<span class="hljs-string">"phone"</span>],
                <span class="hljs-string">"tier"</span>: customer[<span class="hljs-string">"membership_level"</span>],

                })
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"customers"</span>: transformed_records}

new_data = clean_data(crm_data)
</code></pre>
<p><code>clean_data()</code> takes in the original data (<strong>temporarily</strong>) stored in the records variable, transforming it to match our target schema.</p>
<p>Since the JSON file we loaded is a dictionary containing a <code>”customers”</code> key, which maps to a list of customer entries, we access this key and loop through each entry in the list.</p>
<p>In the for loop, we rename the relevant fields and store the cleaned entries in a new list called <code>”transformed_records”</code>.</p>
<p>Then, we return the dictionary, with the <code>”customers”</code> key intact.</p>
<h3 id="heading-step-4-save-the-output-in-a-json-file">Step 4: Save the output in a .json file</h3>
<p>Decide on a name for your cleaned JSON data and assign that to an <code>output_file</code> variable, like so:</p>
<pre><code class="lang-python">output_file = <span class="hljs-string">"transformed_data.json"</span>
<span class="hljs-keyword">with</span> open(output_file, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
    json.dump(new_data, f, indent=<span class="hljs-number">4</span>)
</code></pre>
<p>You can also add a <code>print()</code> statement below this block to confirm that the file has been saved in your project directory.</p>
<h3 id="heading-step-5-time-the-data-cleaning-process">Step 5: Time the data cleaning process</h3>
<p>At the beginning of this process, we imported the time module to measure how long it takes to clean up JSON data using pure Python. To track the runtime, we stored the current time in a <code>start_time</code> variable before the cleaning function, and we’ll now include an <code>end_time</code> variable at the end of the script.</p>
<p>The difference between the <code>end_time</code> and <code>start_time</code> values gives you the total runtime in seconds.</p>
<pre><code class="lang-python">end_time = time.time()
elapsed_time = end_time - start_time

print(<span class="hljs-string">f"Transformed data saved to <span class="hljs-subst">{output_file}</span>"</span>)
print(<span class="hljs-string">f"Processing data took <span class="hljs-subst">{elapsed_time:<span class="hljs-number">.2</span>f}</span> seconds"</span>)
</code></pre>
<p>Here’s how long the data cleaning process took with the pure Python approach:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751957367537/4a33fc16-7158-427e-b715-bec10a586857.jpeg" alt="Script runtime displayed in terminal" class="image--center mx-auto" width="766" height="88" loading="lazy"></p>
<h2 id="heading-how-to-clean-json-data-with-pandas">How to Clean JSON Data with Pandas</h2>
<p>Now we’re going to try achieving the same results as above, using Python and a third-party library called <code>pandas</code>. Pandas is an open-source library used for data manipulation and analysis in Python.</p>
<p>To get started, you need to have the Pandas library installed in your directory. In your terminal, run:</p>
<pre><code class="lang-python">pip install pandas
</code></pre>
<p>Then follow these steps:</p>
<h3 id="heading-step-1-import-the-relevant-libraries">Step 1: Import the relevant libraries</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
</code></pre>
<h3 id="heading-step-2-load-file-and-extract-customer-entries">Step 2: Load file and extract customer entries</h3>
<p>Unlike the pure Python method, where we simply indexed the key name <code>”customers”</code> to access the list of customer data, working with <code>pandas</code> requires a slightly different approach.</p>
<p>We must extract the list before loading it into a DataFrame because <code>pandas</code> expects structured data. Extracting the list of customer dictionaries upfront ensures that we isolate and clean the relevant records alone, preventing errors caused by nested or unrelated JSON data.</p>
<pre><code class="lang-python">start_time = time.time()
<span class="hljs-keyword">with</span> open(<span class="hljs-string">'old_customers.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
    crm_data = json.load(f)

<span class="hljs-comment">#Extract the list of customer entries</span>
clients = crm_data.get(<span class="hljs-string">"customers"</span>, [])
</code></pre>
<h3 id="heading-step-3-load-customer-entries-into-a-dataframe">Step 3: Load customer entries into a DataFrame</h3>
<p>Once you’ve got a clean list of customer dictionaries, load the list into a DataFrame and assign said list to a variable, like so:</p>
<pre><code class="lang-python"><span class="hljs-comment">#Load into a dataframe</span>
df = pd.DataFrame(clients)
</code></pre>
<p>This creates a tabular or spreadsheet-like structure, where each row represents a customer. Loading the list into a DataFrame also allows you to access <code>pandas</code>’ powerful data cleaning methods like:</p>
<ul>
<li><p><code>drop_duplicate()</code>: removes duplicate rows or entries from a DataFrame</p>
</li>
<li><p><code>dropna()</code>: drops rows with any missing or null data</p>
</li>
<li><p><code>fillna(value)</code>: replaces all missing or null data with a specified value</p>
</li>
<li><p><code>drop(columns)</code>: drops unused columns explicitly</p>
</li>
</ul>
<h3 id="heading-step-4-write-a-custom-function-to-rename-relevant-fields">Step 4: Write a custom function to rename relevant fields</h3>
<p>At this point, we need a function that takes in a single customer entry – a row – and returns a cleaned version that fits the target schema (<code>“full_name”</code>, <code>“email_address”</code>, <code>“mobile”</code> and <code>“tier”</code>).</p>
<p>The function should also handle missing data by setting default values like <strong>”Unknown”</strong> or <strong>”N/A”</strong> when a field is absent.</p>
<p><strong>P.S:</strong> At first, I used <code>drop(columns)</code> to explicitly remove the <code>“address”</code> and <code>“customer_id”</code> fields. But it’s not needed in this case, as the <code>transform_fields()</code> function only selects and renames the required fields. Any extra columns are automatically excluded from the cleaned data.</p>
<h3 id="heading-step-5-apply-schema-transformation-to-all-rows">Step 5: Apply schema transformation to all rows</h3>
<p>We’ll use <code>pandas</code>' <code>apply()</code> method to apply our custom function to each row in the DataFrame. This will creates a Series (for example, 0 → {...}, 1 → {...}, 2 → {...}), which is not JSON-friendly.</p>
<p>As <code>json.dump()</code> expects a list, not a Pandas Series, we’ll apply <code>tolist()</code>, converting the Series to a list of dictionaries.</p>
<pre><code class="lang-python"><span class="hljs-comment">#Apply schema transformation to all rows</span>
transformed_df = df.apply(transform_fields, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment">#Convert series to list of dicts</span>
transformed_data = transformed_df.tolist()
</code></pre>
<p>Another way to approach this is with list comprehension. Instead of using <code>apply()</code> at all, you can write:</p>
<pre><code class="lang-python">transformed_data = [transform_fields(row) <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> df.to_dict(orient=<span class="hljs-string">"records"</span>)]
</code></pre>
<p><code>orient=”records”</code> is an argument for <code>df.to_dict</code> that tells pandas to convert the DataFrame to a list of dictionaries, where each dictionary represents a single customer record (that is, one row).</p>
<p>Then the <strong>for loop</strong> iterates through every customer record on the list, calling the custom function on each row. Finally, the list comprehension (<strong>[...]</strong>) collects the cleaned rows into a new list.</p>
<h3 id="heading-step-6-save-the-output-in-a-json-file">Step 6: Save the output in  a .json file</h3>
<pre><code class="lang-python"><span class="hljs-comment">#Save the cleaned data</span>
output_data = {<span class="hljs-string">"customers"</span>: transformed_data}
output_file = <span class="hljs-string">"applypandas_customer.json"</span>
<span class="hljs-keyword">with</span> open(output_file, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
    json.dump(output_data, f, indent=<span class="hljs-number">4</span>)
</code></pre>
<p>I recommend picking a different file name for your <code>pandas</code> output. You can inspect both files side by side to see if this output matches the result you got from cleaning with pure Python.</p>
<h3 id="heading-step-7-track-runtime">Step 7: Track runtime</h3>
<p>Once again, check for the difference between start time and end time to determine the program’s execution time.</p>
<pre><code class="lang-python">end_time = time.time()
elapsed_time = end_time - start_time

<span class="hljs-comment">#print(f"Transformed data saved to {output_file}")</span>
print(<span class="hljs-string">f"Transformed data saved to <span class="hljs-subst">{output_file}</span>"</span>)
print(<span class="hljs-string">f"Processing data took <span class="hljs-subst">{elapsed_time:<span class="hljs-number">.2</span>f}</span> seconds"</span>)
</code></pre>
<p>When I used <strong>list comprehension</strong> to apply the custom function, my script’s runtime was <strong>0.03 seconds</strong>, but with <code>pandas</code>’ <code>apply()</code> function, the total runtime dropped to <strong>0.01 seconds</strong>.</p>
<h3 id="heading-final-output-preview">Final output preview:</h3>
<p>If you followed this tutorial closely, your JSON output should look like this – whether you used the <code>pandas</code> method or the pure Python approach:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751961256627/d7b585f7-4585-4354-9fa7-a171adb31f90.jpeg" alt="The expected JSON output after schema transformation" class="image--center mx-auto" width="455" height="310" loading="lazy"></p>
<h2 id="heading-how-to-validate-the-cleaned-json">How to Validate the Cleaned JSON</h2>
<p>Validating your output ensures that the cleaned data follows the expected structure before being used or shared. This step helps to catch formatting errors, missing fields, and wrong data types early.</p>
<p>Below are the steps for validating your cleaned JSON file:</p>
<h3 id="heading-step-1-install-and-import-jsonschema">Step 1: Install and import <code>jsonschema</code></h3>
<p><code>jsonschema</code> is a third-party validation library for Python. It helps you define the expected structure of your JSON data and automatically check if your output matches that structure.</p>
<p>In your terminal, run:</p>
<pre><code class="lang-python">pip install jsonschema
</code></pre>
<p>Import the required libraries:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> jsonschema <span class="hljs-keyword">import</span> validate, ValidationError
</code></pre>
<p><code>validate()</code> checks whether your JSON data matches the rules defined in your schema. If the data is valid, nothing happens. But if there’s an error – like a missing field or wrong data type – it raises a <code>ValidationError</code>.</p>
<h3 id="heading-step-2-define-a-schema">Step 2: Define a schema</h3>
<p>As you know, JSON schema changes with each file structure. If your JSON data differs from what we’ve been working with so far, learn how to create a schema <a target="_blank" href="https://json-schema.org/learn/getting-started-step-by-step#validate-json-data-against-the-schema">here</a>. Otherwise, the schema below defines the structure we expect for our cleaned JSON:</p>
<pre><code class="lang-python">schema = {
    <span class="hljs-string">"type"</span>: <span class="hljs-string">"object"</span>,
    <span class="hljs-string">"properties"</span>: {
        <span class="hljs-string">"customers"</span>: {
            <span class="hljs-string">"type"</span>: <span class="hljs-string">"array"</span>,
            <span class="hljs-string">"items"</span>: {
                <span class="hljs-string">"type"</span>: <span class="hljs-string">"object"</span>,
                <span class="hljs-string">"properties"</span>: {
                    <span class="hljs-string">"full_name"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>},
                    <span class="hljs-string">"email_address"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>},
                    <span class="hljs-string">"mobile"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>},
                    <span class="hljs-string">"tier"</span>: {<span class="hljs-string">"type"</span>: <span class="hljs-string">"string"</span>}
                },
                <span class="hljs-string">"required"</span>: [<span class="hljs-string">"full_name"</span>, <span class="hljs-string">"email_address"</span>, <span class="hljs-string">"mobile"</span>, <span class="hljs-string">"tier"</span>]
            }
        }
    },
    <span class="hljs-string">"required"</span>: [<span class="hljs-string">"customers"</span>]
}
</code></pre>
<ul>
<li><p>The data is an object that must contain a key: <code>"customers"</code>.</p>
</li>
<li><p><code>"customers"</code> must be an <strong>array</strong> (a list), with each object representing one customer entry.</p>
</li>
<li><p>Each customer entry must have four fields–all strings:</p>
<ul>
<li><p><code>"full_name"</code></p>
</li>
<li><p><code>"email_address"</code></p>
</li>
<li><p><code>"mobile"</code></p>
</li>
<li><p><code>"tier"</code></p>
</li>
</ul>
</li>
<li><p>The <code>"required"</code> fields ensure that none of the relevant fields are missing in any customer record.</p>
</li>
</ul>
<h3 id="heading-step-3-load-the-cleaned-json-file">Step 3: Load the cleaned JSON file</h3>
<pre><code class="lang-python"><span class="hljs-keyword">with</span> open(<span class="hljs-string">"transformed_data.json"</span>) <span class="hljs-keyword">as</span> f:
    data = json.load(f)
</code></pre>
<h3 id="heading-step-4-validate-the-data">Step 4: Validate the data</h3>
<p>For this step, we’ll use a <code>try. . . except</code> block to end the process safely, and display a helpful message if the code raises a <code>ValidationError</code>.</p>
<pre><code class="lang-python"><span class="hljs-keyword">try</span>:
    validate(instance=data, schema=schema)
    print(<span class="hljs-string">"JSON is valid."</span>)
<span class="hljs-keyword">except</span> ValidationError <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">"JSON is invalid:"</span>, e.message)
</code></pre>
<h2 id="heading-pandas-vs-pure-python-for-data-cleaning">Pandas vs Pure Python for Data Cleaning</h2>
<p>From this tutorial, you can probably tell that using pure Python to clean and restructure JSON is the more straightforward approach. It is fast and ideal for handling small datasets or simple transformations.</p>
<p>But as data grows and becomes more complex, you might need advanced data cleaning methods that Python alone does not provide. In such cases, <code>pandas</code> becomes the better choice. It handles large, complex datasets effectively, providing built-in functions for handling missing data and removing duplicates.</p>
<p>You can study the <a target="_blank" href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Pandas cheatsheet</a> to learn more data manipulation methods.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use Pandas for Data Cleaning and Preprocessing ]]>
                </title>
                <description>
                    <![CDATA[ Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/data-cleaning-and-preprocessing-with-pandasbdvhj/</link>
                <guid isPermaLink="false">66d4608c733861e3a22a734d</guid>
                
                    <category>
                        <![CDATA[ data ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Oluwadamisi Samuel ]]>
                </dc:creator>
                <pubDate>Tue, 30 Jan 2024 14:55:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2024/01/Cream-Neutral-Minimalist-New-Business-Pitch-Deck-Presentation--1-.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Steve Lohr of The New York Times said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets."</p>
<p>This statement is 100% accurate, as this encompasses a series of steps that ensure data used for data science, machine learning and analysis projects are complete, accurate, unbiased and reliable.</p>
<p>The quality of your dataset plays a pivotal role in the success of your analysis or model. As the saying goes, “garbage in, garbage out”, the quality and reliability of your model and analysis heavily depends on the quality of your data.</p>
<p>Raw data, collected from various sources, are often messy, contain errors, inconsistencies, missing values and outliers. Data cleaning and preprocessing aims to identify and rectify these issues to ensure accurate, reliable and meaningful results during model building and data analysis as wrong conclusions could be costly.</p>
<p>This is where Pandas comes into play, it is a wonderful tool used in the data world to do both data cleaning and preprocessing. In this article, we'll delve into the essential concepts of data cleaning and preprocessing using the powerful Python library, Pandas.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-prerequisites">Prerequisites</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-introduction">Introduction</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-data-cleaning">What is Data Cleaning?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-data-processing">What is Data Processing?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-import-the-necessary-libraries">How to Import the Necessary Libraries</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-load-the-dataset">How to Load the Dataset</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-handle-missing-values">How to Handle Missing Values</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-remove-duplicate-records">How to Remove Duplicate Records</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-data-types-and-conversion">Data Types and Conversion</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-encode-categorical-variables">How to Encode Categorical Variables</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-to-handle-outliers">How to Handle Outliers</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-prerequisites">Prerequisites</h2>
<ul>
<li><p>A basic understanding of Python.</p>
</li>
<li><p>Basic understanding of data cleaning.</p>
</li>
</ul>
<h2 id="heading-introduction">Introduction</h2>
<p>Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use functions needed to work with structured data seamlessly.</p>
<p>Pandas also integrates seamlessly with other popular Python libraries, such as NumPy for numerical computing and Matplotlib for data visualization. This makes it a powerful asset for data driven tasks.</p>
<p>Pandas excels in handling missing data, reshaping datasets, merging and joining multiple datasets, and performing complex operations on data, making it exceptionally useful for data cleaning and manipulation.</p>
<p>At its core, Pandas introduces two key data structures: <code>Series</code> and <code>DataFrame</code>. A <code>Series</code> is a one-dimensional array-like object that can hold any data type, while a <code>DataFrame</code> is a two-dimensional table with labeled axes (rows and columns). These structures allow users to manipulate, clean, and analyze datasets efficiently.</p>
<h2 id="heading-what-is-data-cleaning">What is Data Cleaning?</h2>
<p>Before we embark on our data adventure with Pandas, let's take a moment to explain the term "data cleaning." Think of it as the digital detox for your dataset, where we tidy up, and and prioritize accuracy above all else.</p>
<p>Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values within a dataset. It's like preparing your ingredients before cooking; you want everything in order to get the perfect analysis or visualization.</p>
<p>Why bother with data cleaning? Well, imagine trying to analyze sales trends when some entries are missing, or working with a dataset that has duplicate records throwing off your calculations. Not ideal, right?</p>
<p>In this digital detox, we use tools like Pandas to get rid of inconsistencies, straighten out errors, and let the true clarity of your data shine through.</p>
<h2 id="heading-what-is-data-processing">What is Data Processing?</h2>
<p>You may be wondering, "Does data cleaning and data preprocessing mean the same thing?" The answer is no – they do not.</p>
<p>Picture this: you stumble upon an ancient treasure chest buried in the digital sands of your dataset. Data cleaning is like carefully unearthing that chest, dusting off the cobwebs, and ensuring that what's inside is authentic and reliable.</p>
<p>As for data preprocessing, you can think of it as taking that discovered treasure and preparing its contents for public display. It goes beyond cleaning; it's about transforming and optimizing the data for specific analyses or tasks.</p>
<p>Data cleaning is the initial phase of refining your dataset, making it readable and usable with techniques like removing duplicates, handling missing values and data type conversion while data preprocessing is similar to taking this refined data and scaling with more advanced techniques such as feature engineering, encoding categorical variables and and handling outliers to achieve better and more advanced results.</p>
<p>The goal is to turn your dataset into a refined masterpiece, ready for analysis or modeling.</p>
<h2 id="heading-how-to-import-the-necessary-libraries">How to Import the Necessary Libraries</h2>
<p>Before we embark on data cleaning and preprocessing, let's import the <code>Pandas</code> library.</p>
<p>To save time and typing, we often import Pandas as <code>pd</code>. This lets us use the shorter <code>pd.read_csv()</code> instead of <code>pandas.read_csv()</code> for reading CSV files, making our code more efficient and readable.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
</code></pre>
<h2 id="heading-how-to-load-the-dataset">How to Load the Dataset</h2>
<p>Start by loading your dataset into a Pandas DataFrame.</p>
<p>In this example, we'll use a hypothetical dataset named <strong>your_dataset.csv</strong>. We will load the dataset into a variable called <code>df</code>.</p>
<pre><code class="lang-py"><span class="hljs-comment">#Replace 'your_dataset.csv' with the actual dataset name or file path</span>
df = pd.read_csv(<span class="hljs-string">'your_dataset.csv'</span>)
</code></pre>
<h2 id="heading-exploratory-data-analysis-eda">Exploratory Data Analysis (EDA)</h2>
<p>EDA helps you understand the structure and characteristics of your dataset. Some Pandas functions help us gain insights into our dataset. We call these functions by calling the dataset variable plus the function.</p>
<p>For example:</p>
<ul>
<li><p><code>df.head()</code> will call the first 5 rows of the dataset. You can specify the number of rows to be displayed in the parentheses.</p>
</li>
<li><p><code>df.describe()</code> gives some statistical data like percentile, mean and standard deviation of the numerical values of the Series or DataFrame.</p>
</li>
<li><p><code>df.info()</code> gives the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).</p>
</li>
</ul>
<p>Here's a code example below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Display the first few rows of the dataset</span>
print(df.head())

<span class="hljs-comment">#Summary statistics</span>
print(df.describe())

<span class="hljs-comment">#Information about the dataset</span>
print(df.info())
</code></pre>
<h2 id="heading-how-to-handle-missing-values">How to Handle Missing Values</h2>
<p>As a newbie in this field, missing values pose a significant stress as they come in different formats and can adversely impact your analysis or model.</p>
<p>Machine learning models cannot be trained with data that has missing or "NAN" values as they can alter your end result during analysis. But do not fret, Pandas provides methods to handle this problem.</p>
<p>One way to do this is by removing the missing values altogether. Code snippet below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Check for missing values</span>
print(df.isnull().sum())

<span class="hljs-comment">#Drop rows with missing valiues and place it in a new variable "df_cleaned"</span>
df_cleaned = df.dropna()

<span class="hljs-comment">#Fill missing values with mean for numerical data and place it ina new variable called df_filled</span>
df_filled = df.fillna(df.mean())
</code></pre>
<p>But if the number of rows that have missing values is large, then this method will be inadequate.</p>
<p>For numerical data, you can simply compute the mean and input it into the rows that have missing values. Code snippet below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Replace missing values with the mean of each column</span>
df.fillna(df.mean(), inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment">#If you want to replace missing values in a specific column, you can do it this way:</span>
<span class="hljs-comment">#Replace 'column_name' with the actual column name</span>
df[<span class="hljs-string">'column_name'</span>].fillna(df[<span class="hljs-string">'column_name'</span>].mean(), inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment">#Now, df contains no missing values, and NaNs have been replaced with column mean</span>
</code></pre>
<h2 id="heading-how-to-remove-duplicate-records">How to Remove Duplicate Records</h2>
<p>Duplicate records can distort your analysis by influencing the results in ways that do not accurately show trends and underlying patterns (by producing outliers).</p>
<p>Pandas helps to identify and remove the duplicate values in an easy way by placing them in new variables.</p>
<p>Code snippet below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Identify duplicates</span>
print(df.duplicated().sum())

<span class="hljs-comment">#Remove duplicates</span>
df_no_duplicates = df.drop_duplicates()
</code></pre>
<h2 id="heading-data-types-and-conversion">Data Types and Conversion</h2>
<p>Data type conversion in Pandas is a crucial aspect of data preprocessing, allowing you to ensure that your data is in the appropriate format for analysis or modeling.</p>
<p>Data from various sources are usually messy and the data types of some values may be in the wrong format, for example some numerical values may come in 'float' or 'string' format instead of 'integer' format and a mix up of these formats leads to errors and wrong results.</p>
<p>You can convert a Column of type <code>int</code> to <code>float</code> with the following code:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Convert 'Column1' to float</span>
df[<span class="hljs-string">'Column1'</span>] = df[<span class="hljs-string">'Column1'</span>].astype(float)

<span class="hljs-comment">#Display updated data types</span>
print(df.dtypes)
</code></pre>
<p>You can use <code>df.dtypes</code> to print column data types.</p>
<h2 id="heading-how-to-encode-categorical-variables">How to Encode Categorical Variables</h2>
<p>For machine learning algorithms, having categorical values in your dataset (non-numerical values) is crucial in ensuring the best model as they are equally as important.</p>
<p>These could be car brand names in a cars dataset for predicting car prices. But machine learning algorithms cannot processes this datatype, therefore it must be converted to numerical data before it can be used.</p>
<p>Pandas provides the <code>get_dummies</code> function which converts categorical values into numerical format(Binary format) such that it is recognized by the algorithm as a placeholder for values and not hierarchical data that can undergo numerical analysis. this just means that the numbers the brand name is converted to is not interpreted as 1 is greater than 0, but it tells the algorithm that both 1 and 0 are placeholders for categorical data. Code snippet is shown below:</p>
<pre><code class="lang-py"><span class="hljs-comment">#To convert categorical data from the column "Car_Brand" to numerical data</span>
df_encode = pd.get_dummies(df, columns=[Car_Brand])

<span class="hljs-comment">#The categorical data is converted to binary format of Numerical data</span>
</code></pre>
<h2 id="heading-how-to-handle-outliers">How to Handle Outliers</h2>
<p>Outliers are data points significantly different from the majority of the data, they can distort statistical measures and adversely affect the performance of machine learning models.</p>
<p>They may be caused by human error, missing NaN values, or could be accurate data that does not correlate with the rest of the data.</p>
<p>There are several methods to identify and remove outliers, they are:</p>
<ul>
<li><p>Remove NaN values.</p>
</li>
<li><p>Visualize the data before and after removal.</p>
</li>
<li><p>Z-score method (for normally distributed data).</p>
</li>
<li><p>IQR (Interquartile range) method for more robust data.</p>
</li>
</ul>
<p>The IQR is useful for identifying outliers in a dataset. According to the IQR method, values that fall below Q1−1.5× IQR or above Q3+1.5×IQR are considered outliers.</p>
<p>This rule is based on the assumption that most of the data in a normal distribution should fall within this range.</p>
<p>Here's a code snippet for the IQR method:</p>
<pre><code class="lang-py"><span class="hljs-comment">#Using median calculations and IQR, outliers are identified and these data points should be removed</span>
Q1 = df[<span class="hljs-string">"column_name"</span>].quantile(<span class="hljs-number">0.25</span>)
Q3 = df[<span class="hljs-string">"column_name"</span>].quantile(<span class="hljs-number">0.75</span>)
IQR = Q3 - Q1
lower_bound = Q1 - <span class="hljs-number">1.5</span> * IQR
upper_bound = Q3 + <span class="hljs-number">1.5</span> * IQR
df = df[df[<span class="hljs-string">"column_name"</span>].between(lower_bound, upper_bound)]
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Data cleaning and preprocessing are integral components of any data analysis, science or machine learning project. Pandas, with its versatile functions, facilitates these processes efficiently.</p>
<p>By following the concepts outlined in this article, you can ensure that your data is well-prepared for analysis and modeling, ultimately leading to more accurate and reliable results.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Learn Pandas & Python for Data Analysis [Full Course] ]]>
                </title>
                <description>
                    <![CDATA[ Pandas is an open source data analysis and manipulation tool. It is important to learn if you are interested in data science. We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to use Pandas through interactive... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/learn-pandas-for-data-science/</link>
                <guid isPermaLink="false">66b204d9c181ed99dbd2af33</guid>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ youtube ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Beau Carnes ]]>
                </dc:creator>
                <pubDate>Thu, 22 Jun 2023 12:50:51 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/06/pandas.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Pandas is an open source data analysis and manipulation tool. It is important to learn if you are interested in data science.</p>
<p>We just published a course on the freeCodeCamp.org YouTube channel that will teach you how to use Pandas through interactive projects. You will develop 7 projects ranging from the basics of Pandas for Data Analysis, to Data Cleaning and Data Wrangling.</p>
<p>This course targets everyone, from data science enthusiasts to professionals, aiming to refine their skills in data analysis, data cleaning, and data wrangling using Pandas and Python.</p>
<p>Santiago Basulto developed this course. His is knowledgeable and has an engaging teaching style, guarantees an enriching learning experience. Santiago is also the creator of <a target="_blank" href="https://www.datawars.io/">datawars.io</a>, a platform that offers a bunch of interactive data science projects.</p>
<h2 id="heading-a-sneak-peek-into-the-course">A Sneak Peek into the Course</h2>
<p>The course is designed to provide you with hands-on experience through real-life projects. The projects vary in complexity, catering to learners with different skill levels. It’s advisable to try resolving the projects independently and then compare your solutions with Santiago’s.</p>
<p>Below are the projects you will build.</p>
<h3 id="heading-for-beginners">For Beginners:</h3>
<p><strong>DataFrames Practice: Working with English Words</strong>: This project is an excellent entry point for beginners. You will get acquainted with the basics of Pandas DataFrames, focusing on understanding and manipulating their structures, all while working with an extensive dictionary of English words.</p>
<p><strong>Filtering and Sorting with Pokemon Data</strong>: Dive into the captivating world of Pokemon as you perform fundamental data analysis tasks like filtering and sorting. This project is not only educational but also fun, making it perfect for those just starting.</p>
<h3 id="heading-intermediate-level">Intermediate Level:</h3>
<p><strong>The Birthday Paradox in the NBA</strong>: Have you heard of the Birthday Paradox? Discover the answer to an intriguing question: How many people need to be in a room to have a 50% probability that at least two people share a birthday? Apply these insights to explore shared player birthdays within NBA teams.</p>
<p><strong>Matching Strings by Similarity using Levenshtein Distance</strong>: String handling is an integral aspect of data cleaning. This project introduces advanced techniques such as Combinatorics and the Levenshtein distance to detect irregularities in company names.</p>
<p><strong>Data Cleaning with Google Playstore Dataset</strong>: This project is a comprehensive guide to data cleaning. Learn how to identify and rectify null values, duplicate values, outliers, and more, using a dataset scraped from the Google Playstore, which is replete with irregularities.</p>
<h3 id="heading-advanced">Advanced:</h3>
<p><strong>Premier League Match Analysis</strong>: This project is designed for advanced learners. It combines data cleaning with analysis based on grouping operations, using data from the Premier League, the top-tier football league in England.</p>
<p><strong>NBA 2017 Season Analysis: Joining and Groupby Practice</strong>: This project is a test of your data wrangling skills. Learn how to merge different dataframes, clean them, and perform analysis and question/answering tasks using the 2017 NBA statistics dataset.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Data is the new oil, and learning how to refine it is a skill that is in high demand. The Pandas course on freeCodeCamp.org, developed by Santiago Basulto, is a great opportunity to enhance your data science career. You can watch the full course on the <a target="_blank" href="https://youtu.be/gtjxAH8uaP0">freeCodeCamp.org YouTube channel</a> (5-hour watch).</p>
<div class="embed-wrapper">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/gtjxAH8uaP0" style="aspect-ratio: 16 / 9; width: 100%; height: auto;" title="YouTube video player" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" loading="lazy"></iframe></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Pandas Count Rows – How to Get the Number of Rows in a Dataframe ]]>
                </title>
                <description>
                    <![CDATA[ Pandas is a library built on the Python programming language. You can use it to analyze and manipulate data. A dataframe is two-dimensional data structure in Pandas that organizes data in a tabular format with rows and columns.  In this article, you'... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/pandas-count-rows-how-to-get-the-number-of-rows-in-a-dataframe/</link>
                <guid isPermaLink="false">66b0a32e6428eb897141f888</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Fri, 19 May 2023 15:11:58 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/05/nacho-capelo-hMXuZrfmCWM-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Pandas is a library built on the Python programming language. You can use it to analyze and manipulate data.</p>
<p>A dataframe is two-dimensional data structure in Pandas that organizes data in a tabular format with rows and columns. </p>
<p>In this article, you'll learn how to get the number of rows in a dataframe using the following: </p>
<ul>
<li>The <code>len()</code> function.</li>
<li>The <code>shape</code> attribute.</li>
<li>The <code>index</code> attribute.</li>
<li>The <code>axes</code> attribuite. </li>
</ul>
<h2 id="heading-how-to-get-the-number-of-rows-in-a-dataframe-using-the-len-function">How to Get the Number of Rows in a Dataframe Using the <code>len()</code> Function</h2>
<p>You can use the <code>len()</code> function to return the length of an object. With a dataframe, the function returns the number of rows. </p>
<p>Consider the dataframe below:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {
  <span class="hljs-string">"name"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>],
  <span class="hljs-string">"age"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">10</span>, <span class="hljs-number">3</span>]
}

df = pd.DataFrame(data)
df
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>name</td><td>age</td></tr>
</thead>
<tbody>
<tr>
<td>0</td><td>John</td><td>2</td></tr>
<tr>
<td>1</td><td>Jane</td><td>10</td></tr>
<tr>
<td>2</td><td>Jade</td><td>3</td></tr>
</tbody>
</table>
</div><p>In the example above, we created a dataframe with three rows — row 0, 1, and 2. </p>
<p>You can use the <code>len()</code> function to verify the number of rows: </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {
  <span class="hljs-string">"name"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>],
  <span class="hljs-string">"age"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">10</span>, <span class="hljs-number">3</span>]
}

df = pd.DataFrame(data)
df

num_of_rows = len(df)

print(<span class="hljs-string">f"The number of rows is <span class="hljs-subst">{num_of_rows}</span>"</span>)
<span class="hljs-comment"># The number of rows is 3</span>
</code></pre>
<p>In the code above, we passed the dataframe as a parameter to the <code>len()</code> function and stored it in a variable called <code>num_of_rows</code>: </p>
<pre><code class="lang-python">num_of_rows = len(df)
</code></pre>
<p>When <code>num_of_rows</code> was printed, we got a value of 3 (the number of rows).</p>
<h2 id="heading-how-to-get-the-number-of-rows-in-a-dataframe-using-the-shape-attribute">How to Get the Number of Rows in a Dataframe Using the <code>shape</code> Attribute</h2>
<p>The <code>shape</code> attribute returns a tuple with the number of rows and columns in a dataframe.</p>
<p>Here's an example using the same dataframe as in the last section: </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {
  <span class="hljs-string">"name"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>],
  <span class="hljs-string">"age"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">10</span>, <span class="hljs-number">3</span>]
}

df = pd.DataFrame(data)
df

num_of_rows = df.shape

print(num_of_rows)
<span class="hljs-comment"># (3, 2)</span>
</code></pre>
<p>In the code above, a tuple — (3, 2) — was returned when we used the <code>shape</code> attribute on the dataframe: <code>df.shape</code>. </p>
<p>The first value, 3, is the number of rows in the dataframe while the second value, 2, is the number of columns. </p>
<p>Since we're only interested in the number of rows, we can extract just that value using its index in the tuple (remember that index numbers start at 0). That is:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {
  <span class="hljs-string">"name"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>],
  <span class="hljs-string">"age"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">10</span>, <span class="hljs-number">3</span>]
}

df = pd.DataFrame(data)
df

num_of_rows = df.shape[<span class="hljs-number">0</span>]

print(<span class="hljs-string">f"The number of rows is <span class="hljs-subst">{num_of_rows}</span>"</span>)
<span class="hljs-comment"># The number of rows is 3</span>
</code></pre>
<p>Now we're getting just the number of rows using its index in the tuple: <code>df.shape[0]</code>. </p>
<h2 id="heading-how-to-get-the-number-of-rows-in-a-dataframe-using-the-index-attribute">How to Get the Number of Rows in a Dataframe Using the <code>index</code> Attribute</h2>
<p>You can use the <code>index</code> attribute to access the number of elements in a dataframe, which corresponds with the number of rows. </p>
<p>You can do this in two different ways: </p>
<ul>
<li>Using the <code>index</code> attribute's <code>size</code> property. </li>
<li>Passing the <code>index</code> property as a parameter to the <code>len()</code> function. </li>
</ul>
<p>Here are examples to explain the methods above: </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {
  <span class="hljs-string">"name"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>],
  <span class="hljs-string">"age"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">10</span>, <span class="hljs-number">3</span>]
}

df = pd.DataFrame(data)
df

num_of_rows = df.index.size

print(<span class="hljs-string">f"The number of rows is <span class="hljs-subst">{num_of_rows}</span>"</span>)
<span class="hljs-comment"># The number of rows is 3</span>
</code></pre>
<p>In the example above, we accessed the number of rows in the dataframe using <code>df.index.size</code>.</p>
<p>Without the <code>size</code> property, you'd get a result like this: <code>RangeIndex(start=0, stop=3, step=1)</code>. </p>
<ul>
<li><code>start</code> denotes the first index number.</li>
<li><code>stop</code> denotes the number of rows in the dataframe. </li>
<li><code>step</code> denotes the way indexes are incremented (indexes are increased by 1 in our case).</li>
</ul>
<p>So the <code>size</code> property is way of specifying that you're only interested in the number of elements in the dataframe. </p>
<p>Here's another example that uses the <code>len()</code> function: </p>
<pre><code class="lang-pyhon">import pandas as pd

data = {
  "name": ["John", "Jane", "Jade"],
  "age": [2, 10, 3]
}

df = pd.DataFrame(data)
df

num_of_rows = len(df.index)

print(f"The number of rows is {num_of_rows}")
# The number of rows is 3
</code></pre>
<p>In the code above, we passed <code>df.index</code> as a parameter to the <code>len()</code> function. This returns the number of rows in the dataframe. </p>
<p>The difference between this example and the previous one is that we're not attaching the <code>size</code> property to <code>df.index</code>. Instead, we're using <code>df.index</code> as the <code>len()</code> function's parameter. </p>
<h2 id="heading-how-to-get-the-number-of-rows-in-a-dataframe-using-the-axes-attribute">How to Get the Number of Rows in a Dataframe Using the <code>axes</code> Attribute</h2>
<p>The <code>axes</code> attribute returns the value as the <code>index</code> attribute: <code>RangeIndex(start=0, stop=3, step=1)</code>. </p>
<p>Similarly, you can return the number of rows using either the <code>size</code> property or the <code>len()</code> function: </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {
  <span class="hljs-string">"name"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>],
  <span class="hljs-string">"age"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">10</span>, <span class="hljs-number">3</span>]
}

df = pd.DataFrame(data)
df

num_of_rows = df.axes[<span class="hljs-number">0</span>].size

print(<span class="hljs-string">f"The number of rows is <span class="hljs-subst">{num_of_rows}</span>"</span>)
<span class="hljs-comment"># The number of rows is 3</span>
</code></pre>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {
  <span class="hljs-string">"name"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>],
  <span class="hljs-string">"age"</span>: [<span class="hljs-number">2</span>, <span class="hljs-number">10</span>, <span class="hljs-number">3</span>]
}

df = pd.DataFrame(data)
df

num_of_rows = len(df.axes[<span class="hljs-number">0</span>])

print(<span class="hljs-string">f"The number of rows is <span class="hljs-subst">{num_of_rows}</span>"</span>)
<span class="hljs-comment"># The number of rows is 3</span>
</code></pre>
<p>The logic in the two code blocks above is the same as those in the last section: </p>
<ul>
<li><code>df.index.size</code> returns the number of elements/rows in the dataframe. </li>
<li><code>len(df.index)</code> returns the number of rows in the dataframe. </li>
</ul>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about dataframes in Pandas. They are two-dimensional data structures that organize data in rows and columns. </p>
<p>We saw different methods for getting the number of rows in a dataframe. We discussed the following methods along with code examples to show their application:</p>
<ul>
<li>The <code>len()</code> function.</li>
<li>The <code>shape</code> attribute.</li>
<li>The <code>index</code> attribute.</li>
<li>The <code>axes</code> attribuite. </li>
</ul>
<p>Happy coding! You can learn more about Python on <a target="_blank" href="https://ihechikara.com/">my blog</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Python vs Pandas - Difference Between Python and Pandas ]]>
                </title>
                <description>
                    <![CDATA[ The difference between Python and Pandas is a topic that often confuses some beginners in the Python ecosystem.  In this article, you'll learn about the differences between Python and Pandas, and what they are used for. You'll start by learning what ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/difference-between-python-and-pandas/</link>
                <guid isPermaLink="false">66b0a298d7edba94d20b3baf</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Tue, 04 Apr 2023 21:46:34 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/04/markus-winkler-IrRbSND5EUc-unsplash.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>The difference between Python and Pandas is a topic that often confuses some beginners in the Python ecosystem. </p>
<p>In this article, you'll learn about the differences between Python and Pandas, and what they are used for.</p>
<p>You'll start by learning what each technology is used for, then you'll see their differences in tabular format. </p>
<p>Let's get started!</p>
<h2 id="heading-what-is-python-used-for">What Is Python Used For?</h2>
<p>Python is a popular high-level, general-purpose programming language. Python has a simple syntax that is easy to read, write and understand.  </p>
<p>It has an active open source community. </p>
<p>Python has a variety of libraries and frameworks that can be used in developing different applications and products. </p>
<p>Here are some of the use cases for Python: </p>
<ul>
<li><strong>Web development</strong>: Python can be used to create full-stack web applications using frameworks Django, and Flask. </li>
<li><strong>Machine Learning</strong>: You can use Python libraries like <a target="_blank" href="https://www.tensorflow.org/">TensorFlow</a>, <a target="_blank" href="https://pytorch.org/">PyTorch</a>, and so on to build machine learning models.</li>
<li><strong>Game Development</strong>: <a target="_blank" href="https://www.renpy.org/">Ren'Py</a>, <a target="_blank" href="https://www.pygame.org/news">Pygame</a>, and <a target="_blank" href="https://www.panda3d.org/">Panda3D</a> are some of the Python frameworks that can be used to build cross-platform games.  </li>
<li><strong>Data Science and Analysis</strong>: There are numerous libraries and frameworks that can be used in data science and analysis like <a target="_blank" href="https://matplotlib.org/">Matplotlib</a>, <a target="_blank" href="https://numpy.org/">NumPy</a>, <a target="_blank" href="https://pandas.pydata.org/">Pandas</a>, and so on. They can also be used for scientific computing. </li>
</ul>
<p>From the use cases above, you should have an idea of the first difference between Python and Pandas — Python is a programming language while Pandas is a Python library. </p>
<h2 id="heading-what-is-pandas-used-for">What Is Pandas Used For?</h2>
<p>Pandas is an open source Python library used for manipulating and analyzing data. </p>
<p>Here are some of the use cases: </p>
<ul>
<li>Data manipulation. </li>
<li>Data analysis. </li>
<li>Data visualization. </li>
<li>Machine learning, and so on. </li>
</ul>
<h2 id="heading-what-are-the-differences-between-python-and-pandas">What Are the Differences Between Python and Pandas?</h2>
<p>Here are some of the differences between Python and Pandas:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Python</td><td>Pandas</td></tr>
</thead>
<tbody>
<tr>
<td>General-purpose programming language</td><td>Library for data manipulation and analysis</td></tr>
<tr>
<td>Uses Python code syntax</td><td>Extends Python code syntax</td></tr>
<tr>
<td>Uses data structures like lists, dictionaries, sets, tuples</td><td>Uses data structures like DataFrame, Panel, Series</td></tr>
<tr>
<td>Requires additional libraries for data visualization</td><td>Has built-in functionalities for data visualization</td></tr>
</tbody>
</table>
</div><h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about the differences between Python and Pandas. We saw what each of them is used for, and their differences in tabular format. </p>
<p>Python is a general-purpose programming language used in different fields like web development, machine learning, and so on. </p>
<p>Pandas is a Python library used mainly for data manipulation and analysis. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Iterate Over Rows with Pandas – Loop Through a Dataframe ]]>
                </title>
                <description>
                    <![CDATA[ By Shittu Olumide This article provides a comprehensive guide on how to loop through a Pandas DataFrame in Python.  I'll start by introducing the Pandas library and DataFrame data structure. I'll explain the essential characteristics of Pandas, how t... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-iterate-over-rows-with-pandas-loop-through-a-dataframe/</link>
                <guid isPermaLink="false">66d4610751f567b42d9f84c0</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 28 Mar 2023 18:35:22 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/Shittu-Olumide-How-to-Iterate-Over-Rows-with-Pandas---Loop-Through-a-Dataframe.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shittu Olumide</p>
<p>This article provides a comprehensive guide on how to loop through a Pandas DataFrame in Python. </p>
<p>I'll start by introducing the Pandas library and DataFrame data structure. I'll explain the essential characteristics of Pandas, how to loop through rows in a dataframe, and finally how to loop through columns in a dataframe.</p>
<h2 id="heading-what-is-pandas">What is Pandas?</h2>
<p>Pandas is a popular open-source Python library that's used for data cleaning, analysis, and manipulation. In addition to functions for carrying out operations on those datasets, it offers data structures for effectively storing and handling large and complex datasets. </p>
<p>Some of the essential characteristics of Pandas are:</p>
<ul>
<li><strong>DataFrame and Series Objects</strong>: Pandas provides two primary data structures, DataFrames and Series. They allow users to store and manipulate tabular data and time series data, respectively. These data structures are highly efficient and can handle large datasets with ease.</li>
<li><strong>Data Cleaning and Preparation</strong>: Pandas provides a wide range of functions and methods for cleaning and preparing data, including handling missing values, removing duplicates, and transforming data.</li>
<li><strong>Data Analysis and Visualization</strong>: Pandas provides powerful functions for performing data analysis, including statistical functions and grouping and aggregation functions. It also integrates well with other data analysis and visualization libraries in Python, such as Matplotlib and Seaborn.</li>
<li><strong>Data Input and Output</strong>: Pandas provides functions for reading and writing data in a variety of formats, including CSV, Excel, SQL databases, and more.</li>
</ul>
<h2 id="heading-what-is-a-pandas-dataframe">What is a Pandas Dataframe?</h2>
<p>In Pandas, a dataframe is a two-dimensional labeled data structure. It is comparable to a spreadsheet or a SQL table, where data is arranged in rows and columns with a variety of data types in each column.</p>
<p>Since dataframes offer an easy way to store, manipulate, and analyze data, they are frequently used in data science and data analysis applications. Dataframes provide a number of features, including pivoting, grouping, indexing, and filtering, that make it simple to carry out complex operations on data.</p>
<h2 id="heading-how-to-loop-through-rows-in-a-dataframe">How to Loop Through Rows in a Dataframe</h2>
<p>You can loop through rows in a dataframe using the <code>iterrows()</code> method in Pandas. This method allows us to iterate over each row in a dataframe and access its values.</p>
<p>Here's an example:</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># create a dataframe</span>
data = {<span class="hljs-string">'name'</span>: [<span class="hljs-string">'Mike'</span>, <span class="hljs-string">'Doe'</span>, <span class="hljs-string">'James'</span>], <span class="hljs-string">'age'</span>: [<span class="hljs-number">18</span>, <span class="hljs-number">19</span>, <span class="hljs-number">29</span>]}
df = pd.DataFrame(data)

<span class="hljs-comment"># loop through the rows using iterrows()</span>
<span class="hljs-keyword">for</span> index, row <span class="hljs-keyword">in</span> df.iterrows():
    print(row[<span class="hljs-string">'name'</span>], row[<span class="hljs-string">'age'</span>])
</code></pre>
<p>Output:</p>
<pre><code class="lang-bash">Mike 18
Doe 19
James 29
</code></pre>
<p>In this example, we first create a dataframe with two columns, <code>name</code> and <code>age</code>. We then loop through each row in the dataframe using <code>iterrows()</code>, which returns a tuple containing the index of the row and a Series object that contains the values for that row.</p>
<p>Within the loop, we can access the values for each column by using the column name as an index on the row object. For example, to access the value for the <code>name</code> column, we use <code>row['name']</code>.</p>
<h2 id="heading-how-to-loop-through-columns-in-a-dataframe">How to Loop Through Columns in a Dataframe</h2>
<p>Looping through columns in a dataframe is a common task in data analysis and manipulation. It's different from the way we loop through rows, though. </p>
<p>Here's an example:</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Create a sample dataframe</span>
df = pd.DataFrame({
    <span class="hljs-string">'A'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>],
    <span class="hljs-string">'B'</span>: [<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>],
    <span class="hljs-string">'C'</span>: [<span class="hljs-number">7</span>, <span class="hljs-number">8</span>, <span class="hljs-number">9</span>]
})

<span class="hljs-comment"># Loop through columns using a for loop</span>
<span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> df.columns:
    print(col)
</code></pre>
<p>Output:</p>
<pre><code class="lang-bash">A
B
C
</code></pre>
<p>First, we import the Pandas library using the <code>import pandas as pd</code> statement.</p>
<p>Then, we create a sample dataframe using the <code>pd.DataFrame()</code> function, which takes a dictionary of column names and values as an input.</p>
<p>Next, we loop through the columns of the dataframe using a for loop and the <code>df.columns</code> attribute, which returns a list of column names.</p>
<p>Inside the loop, we simply print the name of each column using the <code>print()</code> function.</p>
<h2 id="heading-use-cases-for-looping-through-a-dataframe">Use Cases for Looping Through a Dataframe</h2>
<p>Looping through a dataframe is an important technique in data analysis and manipulation, as it allows us to perform operations on each row or column of the dataframe. </p>
<p>You'll loop through dataframes in the following activities:</p>
<ul>
<li>Data Cleaning and Transformation.</li>
<li>Data Analysis.</li>
<li>Data Visualization.</li>
<li>Feature Engineering.</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>By looping through the rows in a dataframe, we can perform operations on each row, such as filtering or transforming the data. </p>
<p>But it's important to note that looping through rows in a dataframe can be slow and inefficient for large datasets. In general, it's often better to use vectorized operations or <code>apply()</code> functions to perform operations on dataframes, as these methods are optimized for performance.</p>
<p>Let's connect on <a target="_blank" href="https://www.twitter.com/Shittu_Olumide_">Twitter</a> and on <a target="_blank" href="https://www.linkedin.com/in/olumide-shittu">LinkedIn</a>. You can also subscribe to my <a target="_blank" href="https://www.youtube.com/channel/UCNhFxpk6hGt5uMCKXq0Jl8A">YouTube</a> channel.</p>
<p>Happy Coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Dataframe Drop Column in Pandas – How to Remove Columns from Dataframes ]]>
                </title>
                <description>
                    <![CDATA[ By Shittu Olumide In Pandas, sometimes you'll need to remove columns from a DataFrame for various reasons, such as cleaning data, reducing memory usage, or simplifying analysis. And in this article, I'll show you how to do it. I'll start by introduci... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/dataframe-drop-column-in-pandas-how-to-remove-columns-from-dataframes/</link>
                <guid isPermaLink="false">66d460f2768263422736e8b5</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Mon, 27 Mar 2023 18:03:48 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/Shittu-Olumide-Dataframe-Drop-Column-in-Pandas---How-to-Remove-Columns-from-Dataframes.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shittu Olumide</p>
<p>In Pandas, sometimes you'll need to remove columns from a DataFrame for various reasons, such as cleaning data, reducing memory usage, or simplifying analysis. And in this article, I'll show you how to do it.</p>
<p>I'll start by introducing the <code>.drop()</code> method, which is the primary method for removing columns in Pandas. We'll go through the syntax and parameters of the <code>.drop()</code> method, including how to specify columns to remove and how to control whether the original DataFrame is modified in place or a new DataFrame is returned.</p>
<p>Next, I'll provide an example of how to use the <code>.drop()</code> method to remove columns from a DataFrame.</p>
<h2 id="heading-how-to-use-the-drop-method-in-pandas">How to Use the <code>.drop()</code> Method in Pandas</h2>
<p>The <code>.drop()</code> method is a built-in function in Pandas that allows you to remove one or more rows or columns from a DataFrame. It returns a new DataFrame with the specified rows or columns removed and does not modify the original DataFrame in place, unless you set the <code>inplace</code> parameter to <code>True</code>.</p>
<p>The syntax for using the <code>.drop()</code> method is as follows:</p>
<pre><code class="lang-py">DataFrame.drop(labels=<span class="hljs-literal">None</span>, axis=<span class="hljs-number">0</span>, index=<span class="hljs-literal">None</span>, columns=<span class="hljs-literal">None</span>, level=<span class="hljs-literal">None</span>, inplace=<span class="hljs-literal">False</span>, errors=<span class="hljs-string">'raise'</span>)
</code></pre>
<p>Here, <code>DataFrame</code> refers to the Pandas DataFrame that you want to remove rows or columns from. The parameters you can use with the <code>.drop()</code> method include:</p>
<ul>
<li><code>labels</code>: This parameter specifies the labels or indices of the rows or columns to be removed. You can pass either a single label or index or a list of labels or indices.</li>
<li><code>axis</code>: This parameter specifies whether to remove rows or columns. By default, it is set to <code>0</code>, which means rows are removed. If you want to remove columns, set it to <code>1</code>.</li>
<li><code>index</code> and <code>columns</code>: These parameters are alternative to the <code>labels</code> parameter and specify the labels or indices of rows or columns to be removed, respectively.</li>
<li><code>level</code>: This parameter is used to remove a specific level of a hierarchical index.</li>
<li><code>inplace</code>: This parameter is a boolean value that determines whether to modify the original DataFrame in place. By default, it is set to <code>False</code>.</li>
<li><code>errors</code>: This parameter specifies how to handle errors if the specified label(s) or index(es) are not found in the DataFrame. By default, it is set to <code>raise</code>, which means that a <code>KeyError</code> is raised. Other options are <code>ignore</code> and <code>warn</code>, which will respectively ignore or display a warning when the label/index is not found.</li>
</ul>
<h2 id="heading-how-to-remove-a-single-column-from-a-dataframe-in-pandas">How to Remove a Single Column from a Dataframe in Pandas</h2>
<p>Let's ease into it by first learning how to remove a single column from a Dataframe before we remove multiple columns.</p>
<p>Code sample:</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># create a sample dataframe</span>
data = {<span class="hljs-string">'name'</span>: [<span class="hljs-string">'Alice'</span>, <span class="hljs-string">'Bob'</span>, <span class="hljs-string">'Charlie'</span>],
        <span class="hljs-string">'age'</span>: [<span class="hljs-number">25</span>, <span class="hljs-number">30</span>, <span class="hljs-number">35</span>],
        <span class="hljs-string">'gender'</span>: [<span class="hljs-string">'F'</span>, <span class="hljs-string">'M'</span>, <span class="hljs-string">'M'</span>]
        }
df = pd.DataFrame(data)

<span class="hljs-comment"># display the original dataframe</span>
print(<span class="hljs-string">'Original DataFrame:\n'</span>, df)

<span class="hljs-comment"># drop the 'gender' column</span>
df = df.drop(columns=[<span class="hljs-string">'gender'</span>])

<span class="hljs-comment"># display the modified dataframe</span>
print(<span class="hljs-string">'Modified DataFrame:\n'</span>, df)
</code></pre>
<p>Output:</p>
<pre><code class="lang-bash">Original DataFrame:
       name  age gender
0    Alice   25      F
1      Bob   30      M
2  Charlie   35      M

Modified DataFrame:
       name  age
0    Alice   25
1      Bob   30
2  Charlie   35
</code></pre>
<h3 id="heading-code-explanation">Code explanation:</h3>
<p>In the example above, we first created a sample DataFrame with three columns – <code>name</code>, <code>age</code>, and <code>gender</code>. We then used the <code>.drop()</code> method with the <code>columns</code> parameter to remove the <code>gender</code> column. The resulting DataFrame only contains the <code>name</code> and <code>age</code> columns.</p>
<p>It's important to note that the <code>.drop()</code> method does not modify the original DataFrame in place. Instead, it returns a new DataFrame with the specified column(s) removed. If you want to modify the original DataFrame, you need to assign the result of the <code>.drop()</code> method back to the original variable, as we did in the example above.</p>
<p>In addition to the <code>columns</code> parameter, the <code>.drop()</code> method also has a number of other optional parameters you can use to control how columns are removed. </p>
<p>For example, you can use the <code>inplace</code> parameter to modify the original DataFrame in place instead of returning a new DataFrame. You can also use the <code>axis</code> parameter to remove columns by index instead of name.</p>
<h2 id="heading-how-to-remove-multiple-columns-from-a-dataframe-in-pandas">How to Remove Multiple Columns from a Dataframe in Pandas</h2>
<p>In this section we will remove multiple columns from our dataframe. This approach is similar to removing a single column from the dataframe.</p>
<p>To remove two or more columns from a DataFrame using the <code>.drop()</code> method in Pandas, we can pass a list of column names to the <code>columns</code> parameter of the method.</p>
<h3 id="heading-code-sample">Code sample:</h3>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># create a sample dataframe</span>
data = {<span class="hljs-string">'name'</span>: [<span class="hljs-string">'John'</span>, <span class="hljs-string">'Mary'</span>, <span class="hljs-string">'Peter'</span>],
        <span class="hljs-string">'age'</span>: [<span class="hljs-number">30</span>, <span class="hljs-number">25</span>, <span class="hljs-number">35</span>],
        <span class="hljs-string">'gender'</span>: [<span class="hljs-string">'Male'</span>, <span class="hljs-string">'Female'</span>, <span class="hljs-string">'Male'</span>],
        <span class="hljs-string">'city'</span>: [<span class="hljs-string">'New York'</span>, <span class="hljs-string">'London'</span>, <span class="hljs-string">'Paris'</span>]}
df = pd.DataFrame(data)

<span class="hljs-comment"># remove the 'gender' and 'city' columns</span>
df.drop(columns=[<span class="hljs-string">'gender'</span>, <span class="hljs-string">'city'</span>], inplace=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># print the modified dataframe</span>
print(df)
</code></pre>
<p>Output:</p>
<pre><code class="lang-bash">    name  age
0   John   30
1   Mary   25
2  Peter   35
</code></pre>
<h3 id="heading-code-explanation-1">Code explanation:</h3>
<p>In this example, we first create a sample DataFrame with four columns – <code>name</code>, <code>age</code>, <code>gender</code>, and <code>city</code>. Then, we use the <code>.drop()</code> method to remove the <code>gender</code> and <code>city</code> columns by passing a list of their names to the <code>columns</code> parameter. Finally, we set the <code>inplace</code> parameter to <code>True</code> to modify the original DataFrame and print the modified DataFrame.</p>
<p>Note that you can also remove columns by their indices by passing a list of indices to the <code>columns</code> parameter. For example, to remove the second and third columns, you can use:</p>
<pre><code class="lang-py">df.drop(columns=df.columns[<span class="hljs-number">1</span>:<span class="hljs-number">3</span>], inplace=<span class="hljs-literal">True</span>)
</code></pre>
<p>This will remove the columns with indices 1 and 2 (that is the <code>age</code> and <code>gender</code> columns in this example).</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>I hope this article is a useful resource for anyone working with Pandas DataFrames who needs to remove columns efficiently and effectively.</p>
<p>Let's connect on <a target="_blank" href="https://www.twitter.com/Shittu_Olumide_">Twitter</a> and on <a target="_blank" href="https://www.linkedin.com/in/olumide-shittu">LinkedIn</a>. You can also subscribe to my <a target="_blank" href="https://www.youtube.com/channel/UCNhFxpk6hGt5uMCKXq0Jl8A">YouTube</a> channel.</p>
<p>Happy Coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Dataframe to CSV – How to Save Pandas Dataframes by Exporting ]]>
                </title>
                <description>
                    <![CDATA[ By Shittu Olumide Pandas is a widely used open-source library in Python for data manipulation and analysis. It provides a range of data structures and functions for working with data, one of which is the DataFrame.  DataFrames are a powerful tool for... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/dataframe-to-csv-how-to-save-pandas-dataframes-by-exporting/</link>
                <guid isPermaLink="false">66d460f43dce891ac3a9681a</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dataframe ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 24 Mar 2023 18:06:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/Shittu-Olumide-Dataframe-to-CSV---How-to-Save-Pandas-Dataframes-by-Exporting.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Shittu Olumide</p>
<p>Pandas is a widely used open-source library in Python for data manipulation and analysis. It provides a range of data structures and functions for working with data, one of which is the DataFrame. </p>
<p>DataFrames are a powerful tool for storing and analyzing large sets of data, but they can be challenging to work with if they are not saved or exported correctly.</p>
<p>It is common practice in data analysis to export data from Pandas DataFrames into CSV files because it can help conserve time and resources. Due to their portability and ability to be easily read by numerous applications, CSV files are a common file format for storing and distributing tabular data. </p>
<p>Regardless of whether you are a novice or an expert data analyst, this article will walk you through the process of saving Pandas DataFrames into CSV files and give you useful tips on how to do so.</p>
<h2 id="heading-how-to-save-pandas-dataframes-using-the-tocsv-method">How to Save Pandas DataFrames Using the <code>.to_csv()</code> Method</h2>
<p>The <code>.to_csv()</code> method is a built-in function in Pandas that allows you to save a Pandas DataFrame as a CSV file. This method exports the DataFrame into a comma-separated values (CSV) file, which is a simple and widely used format for storing tabular data.</p>
<p>The syntax for using the <code>.to_csv()</code> method is as follows:</p>
<pre><code class="lang-py">DataFrame.to_csv(filename, sep=<span class="hljs-string">','</span>, index=<span class="hljs-literal">False</span>, encoding=<span class="hljs-string">'utf-8'</span>)
</code></pre>
<p>Here, <code>DataFrame</code> refers to the Pandas DataFrame that we want to export, and <code>filename</code> refers to the name of the file that you want to save your data to.</p>
<p>The <code>sep</code> parameter specifies the separator that should be used to separate values in the CSV file. By default, it is set to <code>,</code> for comma-separated values. We can also set it to a different separator like <code>\t</code> for tab-separated values.</p>
<p>The <code>index</code> parameter is a boolean value that determines whether to include the index of the DataFrame in the CSV file. By default, it is set to <code>False</code>, which means the index is not included.</p>
<p>The <code>encoding</code> parameter specifies the character encoding to be used for the CSV file. By default, it is set to <code>utf-8</code>, which is a standard encoding for text files.</p>
<h3 id="heading-code-example">Code example</h3>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Create a sample dataframe</span>
Biodata = {<span class="hljs-string">'Name'</span>: [<span class="hljs-string">'John'</span>, <span class="hljs-string">'Emily'</span>, <span class="hljs-string">'Mike'</span>, <span class="hljs-string">'Lisa'</span>],
        <span class="hljs-string">'Age'</span>: [<span class="hljs-number">28</span>, <span class="hljs-number">23</span>, <span class="hljs-number">35</span>, <span class="hljs-number">31</span>],
        <span class="hljs-string">'Gender'</span>: [<span class="hljs-string">'M'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'M'</span>, <span class="hljs-string">'F'</span>]
        }
df = pd.DataFrame(Biodata)

<span class="hljs-comment"># Save the dataframe to a CSV file</span>
df.to_csv(<span class="hljs-string">'Biodata.csv'</span>, index=<span class="hljs-literal">False</span>)
</code></pre>
<h3 id="heading-code-explanation">Code explanation</h3>
<p>Let's break down what each part of this code does:</p>
<ul>
<li><code>import pandas as pd</code>: This imports the Pandas library and assigns it the alias <code>pd</code>, which is a commonly used convention.</li>
<li><code>Biodata = {'Name': ['John', 'Emily', 'Mike', 'Lisa'], 'Age': [28, 23, 35, 31], 'Gender': ['M', 'F', 'M', 'F']}</code>: This creates a Python dictionary with the data we want to store in the DataFrame. Each key represents a column in the DataFrame, and its corresponding value is a list of values for that column.</li>
<li><code>df = pd.DataFrame(Biodata)</code>: This creates a Pandas DataFrame from the <code>Biodata</code> dictionary.</li>
<li><code>df.to_csv('Biodata.csv', index=False)</code>: This saves the DataFrame to a CSV file named <code>Biodata.csv</code>.</li>
</ul>
<h2 id="heading-other-ways-to-save-pandas-dataframes">Other Ways to Save Pandas DataFrames</h2>
<p>There are several alternative methods to <code>.to_csv()</code> for saving Pandas DataFrames into various file formats, including:</p>
<ol>
<li><code>to_excel()</code>: This method is used to save a DataFrame as an Excel file. </li>
<li><code>to_json()</code>: This method is used to save a DataFrame as a JSON file. </li>
<li><code>to_hdf()</code>: This method is used to save a DataFrame as an HDF5 file, which is a hierarchical data format commonly used in scientific computing.</li>
<li><code>to_sql()</code>: This method is used to save a DataFrame to a SQL database. </li>
<li><code>to_pickle()</code>: This method is used to save a DataFrame as a pickled object, which is a serialized representation of the DataFrame. </li>
</ol>
<p>These alternative methods provide flexibility in choosing the file format that best suits your use case and can be particularly useful for advanced data analysis and sharing.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Thanks for reading! I hope you now understand how you can easily convert your Pandas Dataframes by exporting into a CSV file using the build-in <code>to_csv()</code> method. </p>
<p>Let's connect on <a target="_blank" href="https://www.twitter.com/Shittu_Olumide_">Twitter</a> and on <a target="_blank" href="https://www.linkedin.com/in/olumide-shittu">LinkedIn</a>. You can also subscribe to my <a target="_blank" href="https://www.youtube.com/channel/UCNhFxpk6hGt5uMCKXq0Jl8A">YouTube</a> channel.</p>
<p>Happy Coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ pandas.DataFrame.sort_values - How To Sort Values in Pandas ]]>
                </title>
                <description>
                    <![CDATA[ When analyzing and manipulating data using Pandas, you might want to sort the data in a certain order. This makes it easier to understand and visualize data. In this article, you'll learn how to sort data in ascending and descending order using Panda... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-sort-values-in-pandas/</link>
                <guid isPermaLink="false">66b0a2e97cd8dca6718a2243</guid>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Mon, 13 Mar 2023 21:52:10 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/sort-in-pandas-1.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>When analyzing and manipulating data using Pandas, you might want to sort the data in a certain order. This makes it easier to understand and visualize data.</p>
<p>In this article, you'll learn how to sort data in ascending and descending order using Pandas' <code>sort_values()</code> method. </p>
<h2 id="heading-how-to-sort-values-in-pandas">How To Sort Values in Pandas</h2>
<p>You can use the <code>sort_values()</code> method to sort values in a data set. By default, the method sorts values in ascending order. </p>
<p>In this section, you'll learn how to sort data in ascending and descending order using the <code>sort_values()</code> method. </p>
<h3 id="heading-how-to-sort-values-in-ascending-order-using-pandas-sortvalues-method">How To Sort Values in Ascending Order Using Pandas <code>sort_values()</code> Method</h3>
<p>The <code>sort_values()</code> method takes in multiple parameters, as can be seen in the <a target="_blank" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html">Pandas documentation</a>. </p>
<p>We'll focus on the <code>by</code> and <code>ascending</code> parameters. That is:</p>
<pre><code class="lang-txt">Dataframe.sort_values(by, ascending)
</code></pre>
<ul>
<li>The <code>by</code> parameter denotes the column or index to sort. </li>
<li><code>ascending</code> is used to specify what other the values should be sorted in. By default, it is set to <code>True</code>. </li>
</ul>
<p>Here's an example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># create a sample dataframe</span>
data = {<span class="hljs-string">'cost'</span>: [<span class="hljs-number">50000</span>, <span class="hljs-number">30000</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">60000</span>]}

df = pd.DataFrame(data)

df
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>item</td><td>cost</td></tr>
</thead>
<tbody>
<tr>
<td>0</td><td>laptop</td><td>500</td></tr>
<tr>
<td>1</td><td>monitor</td><td>300</td></tr>
<tr>
<td>2</td><td>HDMI</td><td>700</td></tr>
<tr>
<td>3</td><td>speaker</td><td>600</td></tr>
</tbody>
</table>
</div><p>In the table above, we have different items along with the cost of each item. To sort the items in ascending order using their cost, you can do this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {<span class="hljs-string">'item'</span>: [<span class="hljs-string">'laptop'</span>, <span class="hljs-string">'monitor'</span>, <span class="hljs-string">'HDMI'</span>, <span class="hljs-string">'speaker'</span>],
        <span class="hljs-string">'cost'</span>: [<span class="hljs-number">500</span>, <span class="hljs-number">300</span>, <span class="hljs-number">700</span>, <span class="hljs-number">600</span>]
       }

df = pd.DataFrame(data)

sorted_data = df.sort_values(by=<span class="hljs-string">'cost'</span>, ascending=<span class="hljs-literal">True</span>)

sorted_data
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>item</td><td>cost</td></tr>
</thead>
<tbody>
<tr>
<td>1</td><td>monitor</td><td>300</td></tr>
<tr>
<td>0</td><td>laptop</td><td>500</td></tr>
<tr>
<td>3</td><td>speaker</td><td>600</td></tr>
<tr>
<td>2</td><td>HDMI</td><td>700</td></tr>
</tbody>
</table>
</div><p>In the code above, the <code>sort_values()</code> method was used to sort the <code>cost</code> column.</p>
<ul>
<li>Using the <code>by</code> parameter, we specified which column was to be sorted:  <code>by='cost'</code></li>
<li>Using the <code>ascending</code> parameter, we set the order of the data to be sorted: <code>ascending=True</code>. </li>
</ul>
<p>Note that the default order of the <code>sort_values()</code> method is <code>ascending=True</code>. So if you remove <code>ascending</code> parameter, you'd still have the values sorted in ascending order. </p>
<h3 id="heading-how-to-sort-values-in-descending-order-using-pandas-sortvalues-method">How To Sort Values in Descending Order Using Pandas <code>sort_values()</code> Method</h3>
<p>You can sort values in descending order by simply setting the <code>ascending</code> parameter to <code>False</code>. </p>
<p>We'll work with the same code in the last section:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># create a sample dataframe</span>
data = {<span class="hljs-string">'cost'</span>: [<span class="hljs-number">50000</span>, <span class="hljs-number">30000</span>, <span class="hljs-number">70000</span>, <span class="hljs-number">60000</span>]}

df = pd.DataFrame(data)

df
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>item</td><td>cost</td></tr>
</thead>
<tbody>
<tr>
<td>0</td><td>laptop</td><td>500</td></tr>
<tr>
<td>1</td><td>monitor</td><td>300</td></tr>
<tr>
<td>2</td><td>HDMI</td><td>700</td></tr>
<tr>
<td>3</td><td>speaker</td><td>600</td></tr>
</tbody>
</table>
</div><p>Here's the code for sorting the <code>cost</code> column in descending order:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {<span class="hljs-string">'item'</span>: [<span class="hljs-string">'laptop'</span>, <span class="hljs-string">'monitor'</span>, <span class="hljs-string">'HDMI'</span>, <span class="hljs-string">'speaker'</span>],
        <span class="hljs-string">'cost'</span>: [<span class="hljs-number">500</span>, <span class="hljs-number">300</span>, <span class="hljs-number">700</span>, <span class="hljs-number">600</span>]
       }

df = pd.DataFrame(data)

sorted_data = df.sort_values(by=<span class="hljs-string">'cost'</span>, ascending=<span class="hljs-literal">False</span>)

sorted_data
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>item</td><td>cost</td></tr>
</thead>
<tbody>
<tr>
<td>2</td><td>HDMI</td><td>700</td></tr>
<tr>
<td>3</td><td>speaker</td><td>600</td></tr>
<tr>
<td>0</td><td>laptop</td><td>500</td></tr>
<tr>
<td>1</td><td>monitor</td><td>300</td></tr>
</tbody>
</table>
</div><p>By setting the value of the <code>ascending</code> parameter to <code>False</code>, we've sorted the data by cost in descending order. </p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we learned about sorting values in Pandas using the <code>sort_values()</code> method.</p>
<p>We saw two code examples on how to sort data in Pandas in ascending or descending order.</p>
<p>You can use the <code>sort_values()</code> method's <code>ascending</code> parameter to sort data in ascending or descending order. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Pandas round() Method – How To Round a Float in Pandas ]]>
                </title>
                <description>
                    <![CDATA[ You can use the Pandas library in Python to manipulate and analyze data. In most cases, it is used for manipulating and analyzing tabular data.  In this article, you'll learn how to use the Pandas round() method to round a float value to a specified ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-round-a-float-in-pandas/</link>
                <guid isPermaLink="false">66b0a2e55e73cf343a5cc012</guid>
                
                    <category>
                        <![CDATA[ float ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Mon, 13 Mar 2023 21:49:09 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/03/round-float-in-pandas.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>You can use the Pandas library in Python to manipulate and analyze data. In most cases, it is used for manipulating and analyzing tabular data. </p>
<p>In this article, you'll learn how to use the Pandas <code>round()</code> method to round a float value to a specified number of decimal places. </p>
<p>We'll begin by looking the method's syntax, and then see some practical code applications. </p>
<h2 id="heading-pandas-round-method-example">Pandas <code>round()</code> Method Example</h2>
<p>Here's what the syntax for the <code>round()</code> method looks like:</p>
<pre><code class="lang-txt">DataFrame.round(decimals)
</code></pre>
<p>The <strong>decimals</strong> parameter represents the number of decimal places a number should be rounded to.</p>
<p>The number of decimal places to be returned is passed in as a parameter. <code>round(2)</code> return rounds a number to two decimal places. </p>
<p>Here's an example to demonstrate:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {<span class="hljs-string">'cost'</span>:[<span class="hljs-number">20.5550</span>, <span class="hljs-number">21.03535</span>, <span class="hljs-number">19.67373</span>, <span class="hljs-number">18.233233</span>]}

df = pd.DataFrame(data)

df[<span class="hljs-string">'rounded_cost'</span>] = df[<span class="hljs-string">'cost'</span>].round(<span class="hljs-number">2</span>)
print(df)
</code></pre>
<p>In the code above, we have a list of numbers that fall under the <code>cost</code> column. The column had these values: [20.5550, 21.03535, 19.67373, 18.233233]. </p>
<p>Using the <code>round()</code> method, we rounded the values to 2 decimal places: <code>df['cost'].round(2)</code>. </p>
<p>The return values were stored in a column called <code>rounded_cost</code>. </p>
<p>Here's the output of the code:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>cost</td><td>rounded_cost</td></tr>
</thead>
<tbody>
<tr>
<td>0</td><td>20.555000</td><td>20.56</td></tr>
<tr>
<td>1</td><td>21.035350</td><td>21.04</td></tr>
<tr>
<td>2</td><td>19.673730</td><td>19.67</td></tr>
<tr>
<td>3</td><td>18.233233</td><td>18.23</td></tr>
</tbody>
</table>
</div><p>From the table above, you can see that the values in the <code>cost</code> column have been rounded to 2 decimal places in the <code>rounded_cost</code> columns. </p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we have learned about rounding float values with Pandas using the <code>round()</code> method. </p>
<p>We started looking at the syntax for the <code>round()</code> method looks like. We then saw an example using the method to round float values to a specified number of decimal places. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use the Pandas DataFrame Groupby Method ]]>
                </title>
                <description>
                    <![CDATA[ By Faith Oyama Pandas is a fast and approachable open-source library in Python built for analyzing and manipulating data.  This library has a lot of functions and methods to expedite the data analysis process. One of my favorites is the groupby metho... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/pandas-dataframe-groupby-method/</link>
                <guid isPermaLink="false">66d45edf4a7504b7409c33df</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 25 Jan 2023 21:32:38 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/01/Pandas-Groupby---1.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Faith Oyama</p>
<p>Pandas is a fast and approachable open-source library in Python built for analyzing and manipulating data. </p>
<p>This library has a lot of functions and methods to expedite the data analysis process. One of my favorites is the <code>groupby</code> method, mainly because it lets you get quick insights into your data by transforming, aggregating, and splitting data into various categories.</p>
<p>In this article, you will learn about the Pandas <code>groupby</code> function, how to aggregate data, and group Pandas DataFrames with multiple columns using the <code>groupby</code> method.</p>
<h2 id="heading-what-do-i-need-to-install-on-my-computer-to-follow-this-article"><strong>What do I need to install on my computer to follow this article?</strong></h2>
<p>For this article, I'll be using a Jupyter notebook. You can install Jupyter notebook and get it up and running on your computer via the <a target="_blank" href="https://jupyter.org/install">official website</a>. </p>
<p>After installing Juypter, create a new notebook and run <code>Import pandas as pd</code> to import pandas and <code>Import numpy as np</code> to import NumPy.</p>
<p>NumPy will let us work with multi-dimensional arrays and high-level mathematical functions. On the other hand, Pandas will allow us to manipulate our data and access the <code>df.groupby()</code>, the <code>groupby</code> method.</p>
<p>Let's get started.</p>
<p><img src="https://lh3.googleusercontent.com/wU1odIzH8x7LfrhUHRK-1xDGHWA7NC0WLazO4CCfYku1V3TQtOZvm7r6QaUYGNt4H4MwX-F3mZYq82X4eMg7ZFmSGlO-kkfun2G5-r5MR7len95hg43Qq5z97WxK1_6EC0Z2h6ADCDqIW-BqanEfH4Iou2VFN_RrvK__9cxGzk9_MgS1_bkjS0gwpnPgaQ" alt="Image" width="1157" height="211" loading="lazy">
<em>Importing the required libraries</em></p>
<h2 id="heading-what-is-groupby-in-pandas"><strong>What is <code>groupby</code> in Pandas?</strong></h2>
<p>If you're familiar with <a target="_blank" href="https://www.freecodecamp.org/news/sql-aggregate-functions-how-to-group-by-in-mysql-and-postgresql/">SQL and its GROUP BY syntax</a>, you already know how powerful it is in summarizing and categorizing data. </p>
<p>The Pandas <code>groupby</code> method in Python does the same thing and is great when splitting and categorizing data into groups to analyze your data better. </p>
<p>Here is the syntax for Pandas <code>groupby</code>:</p>
<pre><code class="lang-python">python DataFrame.groupby(by=<span class="hljs-literal">None</span>, axis=<span class="hljs-number">0</span>, level=<span class="hljs-literal">None</span>, as_index=<span class="hljs-literal">True</span>, sort=<span class="hljs-literal">True</span>, group_keys=_NoDefault.no_default, squeeze=_NoDefault.no_default, observed=<span class="hljs-literal">False</span>, dropna=<span class="hljs-literal">True</span>)
</code></pre>
<p>Each attribute has a meaning:</p>
<ul>
<li><code>by</code> – List of the columns you want to group by.</li>
<li><code>axis</code> – Defaults to 0. It takes 0 or 'index', 1 or 'columns'.</li>
<li><code>level</code> – Used with MultiIndex.</li>
<li><code>as_index</code> – SQL style grouped otput.</li>
<li><code>sort</code> – Defaults to True. Specify whether to sort after grouping.</li>
<li><code>group_keys</code> – add group keys or not.</li>
<li><code>squeeze</code> – deprecated in new versions.</li>
<li><code>observed</code> – Only use if any of the groupers are Categoricals.</li>
<li><code>dropna</code> – Defaults to False. Use True to drop None/Nan.</li>
</ul>
<p>Now let's see how this function works in action.</p>
<h2 id="heading-how-to-load-the-dataset"><strong>How to Load the Dataset</strong></h2>
<p>For this tutorial, we'll use the supermarket sales dataset from Kaggle, which you can access and download <a target="_blank" href="https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales">here</a>.</p>
<p>After downloading the dataset, load the data into a pandas dataframe.</p>
<p>A DataFrame is a 2-dimensional data structure made up of rows and columns. This is very similar to your spreadsheet.</p>
<p>You can do that by running this code:</p>
<pre><code class="lang-python">df = pd.read_csv(<span class="hljs-string">r"C:\Users\Double Arkad\Downloads\archive\supermarket_sales - Sheet1.csv"</span>)
</code></pre>
<p>After that, use the <code>df.head()</code>  method to show the first few rows of your dataset. After running <code>df.head()</code>, you should get the result below. This indicates that the dataset got loaded successfully.</p>
<p><img src="https://lh3.googleusercontent.com/DqY0WSe2sJ-_Mh7Yx0sGKndujULCR-RxFSm1RdWCcXHrCEq3UJxC-_3ugFtStAgPeHXgrsttTWb9DtpfFz9C0OmhRGDiyjLxWMhZxY0Fls4nfw3qiNlos6DtyQ35jqyv11afGFvlwDCnFOvgVcZj-yv2aJGFmRc9OwJSzjWhE9oK37uv1SK-3UJN6hkgnQ" alt="Image" width="1157" height="590" loading="lazy">
<em>Dataset has loaded correctly</em></p>
<h2 id="heading-how-to-use-the-groupby-method-in-pandas"><strong>How to Use the <code>groupby</code> Method in Pandas</strong></h2>
<p>Assume your employer asked you to total the number of items ordered and categorize them according to the different payment options. This will let you determine which payment method generates the most revenue.</p>
<p>You can answer this question with the <code>groupby</code> function by simply grouping the data based on the 'Payment'.</p>
<pre><code class="lang-python">df.groupby(<span class="hljs-string">'Payment'</span>)[<span class="hljs-string">'Quantity'</span>].sum()
</code></pre>
<p><img src="https://lh6.googleusercontent.com/UpPjAe1GL7BbIcRGEAWBz2DoY3WCckOlJ1Rs9WObvgft1D02QvXwgnoaBBSE3l7PdeKFlwOnp98YyUGBOYJ16G5c1gncSsH6JPvX3qjQGjqcOR2qEG0i63WOHI8tX0aTTZsmKgTJJ4GsBvr_wvpHzJM4S-3ft5QPP1rCNxQdjCv9sIc1SNKL0lqxHHlnDg" alt="Image" width="1151" height="153" loading="lazy">
<em>Using the <code>sum</code> function with <code>groupby</code></em></p>
<p>The first column, 'Payments', is the column you want to group by. The second column, 'Quantity' is the column you'll perform an aggregate function on. Lastly, you have the aggregate function <code>.sum()</code>.</p>
<p>The <code>Sum()</code> is one of many functions you can use in a <code>groupby</code>. You could also use other aggregate functions like the <code>Min()</code>, <code>Mean()</code>, <code>Median()</code>, <code>Count()</code>, and <code>Average()</code> to find the minimum, mean, median, count, and average value in a group within your dataset.</p>
<p>But by using the <code>agg()</code> function, you can perform two or more aggregations simultaneously.</p>
<p>Let's see how that works.</p>
<h2 id="heading-how-to-aggregate-data-using-groupby-in-pandas"><strong>How to Aggregate Data Using <code>groupby</code> in Pandas</strong></h2>
<h3 id="heading-pandas-groupby-and-agg"><strong>Pandas <code>groupby</code> and <code>Agg()</code></strong></h3>
<p>Here's how to use <code>agg()</code> in a <code>groupby</code> function to find this supermarket's most used payment method.</p>
<pre><code class="lang-python">df.groupby(<span class="hljs-string">'Payment'</span>)[<span class="hljs-string">'Quantity'</span>].agg([np.sum, np.mean])
</code></pre>
<p><img src="https://lh5.googleusercontent.com/1mljwrO9rcXq5YblmNTSwB6U5m2fijHe27GrBvoU2N2_l-d8ZsSS-d6ssm7R4xFjLPf1KU3kp2a6cFkjRcKQotMo02Dg83F6HGieAIk4_jithoFPtC_ErS3ckrwAz68DKXxU258Gbu5PHQ1Qgayvi-YzV78CuNBqRJL8WbFS6CTWFYxM6cfmrTKwDxCLHg" alt="Image" width="1126" height="209" loading="lazy">
<em>Using the <code>agg</code> function with <code>groupby</code></em></p>
<p>There are more cash transactions done. Ewallets and credit card transactions follow in level of use.</p>
<p>Notice here we created a dictionary and passed the aggregate functions to be performed. This simultaneously performed two statistical computations on our data! Of course, you can add more aggregate functions in the dictionary depending on the insights you want to get.</p>
<p>Here is what I mean:</p>
<pre><code class="lang-python">df.groupby([<span class="hljs-string">'Payment'</span>, <span class="hljs-string">'Customer type'</span>])[<span class="hljs-string">'Quantity'</span>].agg([np.sum, np.mean, np.max, np.min])
</code></pre>
<p><img src="https://lh4.googleusercontent.com/-HkvhpGBKdBADDRiPX2uw4RPEAl6a1pRLSwltGU4AvYCVVNHkQHCCf4F1r5H8O8kpE10uU3L0o050fLN0uaa_j1SmBLd7n9oCzAesx8ixgr4Gu9Qaxm4MTm2GyqBoi0RmpI9kTYanVnxyyd8j60c-L9lyrQp2zYtV43LOiEDBAC_-84zrVegboZOWwW8Ow" alt="Image" width="1114" height="198" loading="lazy">
<em>Adding more aggregate functions</em></p>
<p>In the <code>groupby</code> function, we added more aggregate functions to our statistical computation to gain insight into the maximum and the minimum number of goods ordered in each payment group.</p>
<h3 id="heading-pandas-groupby-and-count">Pandas <code>groupby</code> and <code>count()</code></h3>
<p>Here's how it works:</p>
<pre><code class="lang-python">df.groupby(<span class="hljs-string">'Payment'</span>)[<span class="hljs-string">'Quantity'</span>].count()
</code></pre>
<p>And here's the result you get:</p>
<p><img src="https://lh5.googleusercontent.com/J28sp0mmTSkG3TSfMHQ9RRBPZJjEyEbeybfUAS12a7jjJX_oUmrbgUn0yPWWbu2jaOymOAaFossZ1an0EeXW12rFNkT9FRBeUHJ0RmqpPnOrm1FIO3XaGZxOBmdqb-571GwRyPi9q91-gjk-ieJInRLc_a9z8Eb43J05k5BvfknTlCcTOHQYKplfVqCuRA" alt="Image" width="1136" height="148" loading="lazy">
<em>Using the <code>count</code> function with <code>groupby</code></em></p>
<p>From the output, we're counting the total number of orders placed in the store and grouping the results by each payment method.</p>
<h2 id="heading-how-to-group-pandas-dataframes-by-multiple-columns"><strong>How to Group Pandas DataFrames by Multiple Columns</strong></h2>
<p>You can also group multiple columns in the <code>groupby</code> function. For example, we included a column below to our <code>groupby</code> function called 'Customer type'.</p>
<pre><code class="lang-python">df.groupby([<span class="hljs-string">'Payment'</span>, <span class="hljs-string">'Customer type'</span>])[<span class="hljs-string">'Quantity'</span>].sum()
</code></pre>
<p><img src="https://lh6.googleusercontent.com/qe2y8j4A3LtUeScZGnALEu8oRzUwGZN4qjePTMhuvVqSlIs64tprnq-mBPb6v71ckMfB5aRZ88Jd948dCv7L5duyYmjI3zFNqp11muRmZzN_SmE5ZX2qENwTIh-U1yaZLKoflVH-1KcW1SRXgOGDtw6-lUCxBBX2Tfmlctrf84Z1CvyZNtFgSJ0j_Hr-nw" alt="Image" width="1139" height="198" loading="lazy">
<em>Grouping multiple columns</em></p>
<p>Our output shows that the data was split and categorized into two groups based on the Customer type column. The output is becoming easier to analyze.</p>
<h2 id="heading-how-to-aggregate-multiple-columns-using-pandas-groupby">How to Aggregate Multiple Columns Using Pandas <code>groupby</code></h2>
<p>You can also perform statistical computations on multiple columns with the <code>groupby</code> function. For example, let's look at the total sales generated and quantity ordered and group our results by the "Payment" and "Customer type" columns.</p>
<p>Run the code:</p>
<pre><code class="lang-python">df.groupby([<span class="hljs-string">'Payment'</span>, <span class="hljs-string">'Customer type'</span>]) [[<span class="hljs-string">'Quantity'</span>,<span class="hljs-string">'Unit price'</span>]].sum()
</code></pre>
<p><img src="https://lh3.googleusercontent.com/-wKU-8DT6enZr5SQhZh_-NrC-IF7a3B4gNABHIeBHccvIne6nryDy5XXAq5zg-jDwHNKHcv78Z76eXXVW8L67ehPgAr15mr7XuLojTP_ElOsbd4fQwOsMW3KozX6XPs_52A99I5b2PEMUCF1xVHisz8n63mEkcHrVKpInehbLfdgqyOIUDX2AeF8OQQRBg" alt="Image" width="1139" height="295" loading="lazy">
<em>Using the <code>sum</code> function with multiple columns grouping</em></p>
<p>We can see from the output that the Payment type "Ewallet" generated the most revenue, and you can move on to determine which type of Customers contributed the most revenue for the Store.</p>
<h2 id="heading-summary"><strong>Summary</strong></h2>
<p>In this article, you learned about the importance of the Pandas <code>groupby</code> method. You saw how the <code>groupby</code> function allows you to do a lot of operations on your data, from splitting the data to applying a function like <code>Sum()</code> to get more insight and add more functionality. </p>
<p>To learn more about Python and how you can use it for data analysis, I'll recommend this <a target="_blank" href="https://www.youtube.com/watch?v=r-uOLxNrNk8">Python for data analysis course</a> on the <a target="_blank" href="https://www.freecodecamp.org/news/learn-data-analysis-with-python-course/">freeCodeCamp YouTube channel</a>. </p>
<p>If you enjoyed reading this article and/or have questions and want to connect, you can find me on <a target="_blank" href="https://www.linkedin.com/in/faith-oyama-97b843253/">LinkedIn</a> or <a target="_blank" href="https://twitter.com/kin_kema">Twitter</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Rename a Column in Pandas – Python Pandas Dataframe Renaming Tutorial ]]>
                </title>
                <description>
                    <![CDATA[ A Pandas Dataframe is a 2-dimensional data structure that displays data in tables with rows and columns.  In this article, you'll learn how to rename columns in a Pandas Dataframe by using:  The rename() function. A List. The set_axis() function. H... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-rename-a-column-in-pandas/</link>
                <guid isPermaLink="false">66b0a2e3d7edba94d20b3bb9</guid>
                
                    <category>
                        <![CDATA[ Data Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Ihechikara Abba ]]>
                </dc:creator>
                <pubDate>Fri, 13 Jan 2023 18:18:13 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2023/01/how-to-rename-column-in-pandas.svg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A Pandas Dataframe is a 2-dimensional data structure that displays data in tables with rows and columns. </p>
<p>In this article, you'll learn how to rename columns in a Pandas Dataframe by using: </p>
<ul>
<li>The <code>rename()</code> function.</li>
<li>A List.</li>
<li>The <code>set_axis()</code> function.</li>
</ul>
<h2 id="heading-how-to-rename-a-column-in-pandas-using-the-rename-function">How to Rename a Column in Pandas Using the <code>rename()</code> Function</h2>
<p>In this section, you'll see a practical example of renaming a Pandas Dataframe using the <code>rename()</code> function. </p>
<p>Let's begin by passing data into a Dataframe object: </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

students = {
    <span class="hljs-string">"firstname"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>], 
    <span class="hljs-string">"lastname"</span>: [<span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Done"</span>, <span class="hljs-string">"Do"</span>]
}

<span class="hljs-comment"># convert student names into a Dataframe</span>
df = pd.DataFrame(students)

print(df)
</code></pre>
<pre><code class="lang-txt"># Output
  firstname lastname
0      John      Doe
1      Jane     Done
2      Jade       Do
</code></pre>
<p>In the example above, we created a Python dictionary which we used to store the <code>firstname</code> and <code>lastname</code> of students. </p>
<p>We then converted the dictionary to a Dataframe by passing it as a parameter to the Pandas Dataframe object: <code>pd.DataFrame(students)</code>. </p>
<p>When printed to the console, we had this table printed out:</p>
<pre><code class="lang-txt">  firstname lastname
0      John      Doe
1      Jane     Done
2      Jade       Do
</code></pre>
<p>The goal here is to rename the columns. We can do that using the <code>rename()</code> function. </p>
<h3 id="heading-heres-what-the-syntax-looks-like"><strong>Here's what the syntax looks like:</strong></h3>
<pre><code>df.rename(columns={<span class="hljs-string">"OLD_COLUMN_VALUE"</span>: <span class="hljs-string">"NEW_COLUMN_VALUE"</span>})
</code></pre><p>Let's go ahead and change the column names (<code>firstname</code> and <code>lastname</code>) in the table from lowercase to uppercase (<code>FIRSTNAME</code> and <code>LASTNAME</code>). </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

students = {
    <span class="hljs-string">"firstname"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>], 
    <span class="hljs-string">"lastname"</span>: [<span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Done"</span>, <span class="hljs-string">"Do"</span>]
}

<span class="hljs-comment"># convert student names into a Dataframe</span>
df = pd.DataFrame(students)

df.rename(columns={<span class="hljs-string">"firstname"</span>: <span class="hljs-string">"FIRSTNAME"</span>, <span class="hljs-string">"lastname"</span>: <span class="hljs-string">"LASTNAME"</span>}, inplace=<span class="hljs-literal">True</span>)

print(df)
</code></pre>
<pre><code class="lang-txt"># Output
  FIRSTNAME LASTNAME
0      John      Doe
1      Jane     Done
2      Jade       Do
</code></pre>
<p>In the code above, we specified that the columns <code>firstname</code> and <code>lastname</code> should be renamed to <code>FIRSTNAME</code> and <code>LASTNAME</code>, respectively: <code>df.rename(columns={"firstname": "FIRSTNAME", "lastname": "LASTNAME"}, inplace=True)</code></p>
<p>You'll notice that we added the <code>inplace=True</code> parameter. This helps in persisting the new changes in the Dataframe. Delete the parameter and see what happens ;)</p>
<p>You can rename the columns to whatever you want. For instance, we can use <code>SURNAME</code> instead of <code>lastname</code> by doing this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

students = {
    <span class="hljs-string">"firstname"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>], 
    <span class="hljs-string">"lastname"</span>: [<span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Done"</span>, <span class="hljs-string">"Do"</span>]
}

<span class="hljs-comment"># convert student names into a Dataframe</span>
df = pd.DataFrame(students)
df.rename(columns={<span class="hljs-string">"firstname"</span>: <span class="hljs-string">"FIRSTNAME"</span>, <span class="hljs-string">"lastname"</span>: <span class="hljs-string">"SURNAME"</span>}, inplace=<span class="hljs-literal">True</span>)

print(df)
</code></pre>
<pre><code class="lang-txt"># Output
  FIRSTNAME SURNAME
0      John     Doe
1      Jane    Done
2      Jade      Do
</code></pre>
<p>You can change just one column name, too. You are not required to change all the column names at the same time. </p>
<h2 id="heading-how-to-rename-a-column-in-pandas-using-a-list">How to Rename a Column in Pandas Using a List</h2>
<p>You can access the column names of a Dataframe using <code>df.columns</code>. Consider the table below:</p>
<pre><code class="lang-txt">  firstname lastname
0      John      Doe
1      Jane     Done
2      Jade       Do
</code></pre>
<p>We can print out the column names with the code below:</p>
<pre><code class="lang-python">print(df.columns)

<span class="hljs-comment"># Index(['firstname', 'lastname'], dtype='object')</span>
</code></pre>
<p>Using that, we can rename the column of a Dataframe. Here's an example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

students = {
    <span class="hljs-string">"firstname"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>], 
    <span class="hljs-string">"lastname"</span>: [<span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Done"</span>, <span class="hljs-string">"Do"</span>]
}

<span class="hljs-comment"># convert student names into a Dataframe</span>
df = pd.DataFrame(students)
df.columns = [<span class="hljs-string">"FIRSTNAME"</span>, <span class="hljs-string">"SURNAME"</span>]

print(df)
</code></pre>
<pre><code class="lang-txt"># Output
  FIRSTNAME SURNAME
0      John     Doe
1      Jane    Done
2      Jade      Do
</code></pre>
<p>In the example above, we put the new column names in a List and assigned it to the Dataframe columns: <code>df.columns = ["FIRSTNAME", "SURNAME"]</code>. </p>
<p>This will override the previous column names. </p>
<h2 id="heading-how-to-rename-a-column-in-pandas-using-the-setaxis-function">How to Rename a Column in Pandas Using the <code>set_axis()</code> Function</h2>
<p>The syntax for renaming a column with the <code>set_axis()</code> function looks like this:</p>
<pre><code>df.set_axis([NEW_COLUMN_NAME,...], axis=<span class="hljs-string">"columns"</span>)
</code></pre><p>Here's a code example:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

students = {
    <span class="hljs-string">"firstname"</span>: [<span class="hljs-string">"John"</span>, <span class="hljs-string">"Jane"</span>, <span class="hljs-string">"Jade"</span>], 
    <span class="hljs-string">"lastname"</span>: [<span class="hljs-string">"Doe"</span>, <span class="hljs-string">"Done"</span>, <span class="hljs-string">"Do"</span>]
}

<span class="hljs-comment"># convert student names into a Dataframe</span>
df = pd.DataFrame(students)

df.set_axis([<span class="hljs-string">"FIRSTNAME"</span>, <span class="hljs-string">"SURNAME"</span>], axis=<span class="hljs-string">"columns"</span>, inplace=<span class="hljs-literal">True</span>) 

print(df)
</code></pre>
<pre><code class="lang-txt"># Output
  FIRSTNAME SURNAME
0      John     Doe
1      Jane    Done
2      Jade      Do
</code></pre>
<p>Note that the <code>inplace=True</code> parameter might raise a warning because it's deprecated for the <code>set_axis()</code> function and will be replaced in the future. </p>
<h2 id="heading-summary">Summary</h2>
<p>In this article, we talked about renaming a column in Pandas. </p>
<p>We saw different methods that can be used to rename a Pandas Dataframe column with code examples. </p>
<p>Happy coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ What is Data Analysis? How to Visualize Data with Python, Numpy, Pandas, Matplotlib & Seaborn Tutorial ]]>
                </title>
                <description>
                    <![CDATA[ By Aakash NS Data Analysis is the process of exploring, investigating, and gathering insights from data using statistical measures and visualizations.  The objective of data analysis is to develop an understanding of data by uncovering trends, relati... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/exploratory-data-analysis-with-numpy-pandas-matplotlib-seaborn/</link>
                <guid isPermaLink="false">66d45d5ab3016bf139028cff</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data visualization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Matplotlib ]]>
                    </category>
                
                    <category>
                        <![CDATA[ numpy ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Thu, 24 Jun 2021 00:11:01 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/05/blog-cover-4.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Aakash NS</p>
<p>Data Analysis is the process of exploring, investigating, and gathering insights from data using statistical measures and visualizations. </p>
<p>The objective of data analysis is to develop an understanding of data by uncovering trends, relationships, and patterns.</p>
<p>Data analysis is both a science and an art. On the one hand it requires that you know statistics, visualization techniques, and data analysis tools like Numpy, Pandas, and Seaborn. </p>
<p>On the other hand, it requires that you ask interesting questions to guide the investigation, and then interpret the numbers and figures to generate useful insights.</p>
<p>This tutorial on data analysis covers the following topics:</p>
<ol>
<li><a class="post-section-overview" href="#heading-what-is-numerical-computation-python-and-numpy-for-beginners">What is Numerical Computation? Python and Numpy for Beginners</a></li>
<li><a class="post-section-overview" href="#heading-how-to-analyze-tabular-data-using-python-and-pandas">How to Analyze Tabular Data using Python and Pandas</a></li>
<li><a class="post-section-overview" href="#heading-data-visualization-using-python-matplotlib-and-seaborn">Data Visualization using Python, Matplotlib, and Seaborn</a></li>
</ol>
<h2 id="heading-what-is-numerical-computation-python-and-numpy-for-beginners">What is Numerical Computation? Python and Numpy for Beginners</h2>
<p><img src="https://i.imgur.com/mg8O3kd.png" alt="Image" width="1385" height="480" loading="lazy">
_Source: <a target="_blank" href="https://github.com/elegant-scipy/elegant-scipy/blob/master/figures/NumPy_ndarrays_v2.png">Elegant Scipy</a>_</p>
<p>You can follow along with the tutorial and run the code here: <a target="_blank" href="https://jovian.ai/aakashns/python-numerical-computing-with-numpy">https://jovian.ai/aakashns/python-numerical-computing-with-nump</a>y</p>
<p>This section covers the following topics:</p>
<ul>
<li>How to work with numerical data in Python</li>
<li>How to turn Python lists into Numpy arrays</li>
<li>Multi-dimensional Numpy arrays and their benefits</li>
<li>Array operations, broadcasting, indexing, and slicing</li>
<li>How to work with CSV data files using Numpy</li>
</ul>
<h3 id="heading-how-to-work-with-numerical-data-in-python">How to Work with Numerical Data in Python</h3>
<p>The "data" in <em>Data Analysis</em> typically refers to numerical data, like stock prices, sales figures, sensor measurements, sports scores, database tables, and so on. </p>
<p>The <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org">Numpy</a> library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why and how to use Numpy to work with numerical data.</p>
<p>Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. </p>
<p>A simple approach to do this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters), and average relative humidity (in percentage) as a linear equation.</p>
<p><code>yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity</code></p>
<p>We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. </p>
<p>This equation is an approximation, since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.</p>
<p>Based on some statistical analysis of historical data, we might come up with reasonable values for the weights <code>w1</code>, <code>w2</code>, and <code>w3</code>. Here's an example set of values:</p>
<pre><code class="lang-py">w1, w2, w3 = <span class="hljs-number">0.3</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.5</span>
</code></pre>
<p>Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:</p>
<p><img src="https://i.imgur.com/TXPBiqv.png" alt="Image" width="846" height="330" loading="lazy"></p>
<p>To begin, we can define some variables to record climate data for a region.</p>
<pre><code class="lang-py">kanto_temp = <span class="hljs-number">73</span>
kanto_rainfall = <span class="hljs-number">67</span>
kanto_humidity = <span class="hljs-number">43</span>
</code></pre>
<p>We can now substitute these variables into the linear equation to predict the yield of apples.</p>
<pre><code class="lang-py">kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples
<span class="hljs-comment"># 56.8</span>

print(<span class="hljs-string">"The expected yield of apples in Kanto region is {} tons per hectare."</span>.format(kanto_yield_apples))
<span class="hljs-comment"># The expected yield of apples in Kanto region is 56.8 tons per hectare.</span>
</code></pre>
<p>To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector, that is a list of numbers.</p>
<pre><code class="lang-py">kanto = [<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>]
johto = [<span class="hljs-number">91</span>, <span class="hljs-number">88</span>, <span class="hljs-number">64</span>]
hoenn = [<span class="hljs-number">87</span>, <span class="hljs-number">134</span>, <span class="hljs-number">58</span>]
sinnoh = [<span class="hljs-number">102</span>, <span class="hljs-number">43</span>, <span class="hljs-number">37</span>]
unova = [<span class="hljs-number">69</span>, <span class="hljs-number">96</span>, <span class="hljs-number">70</span>]
</code></pre>
<p>The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively.</p>
<p>We can also represent the set of weights used in the formula as a vector.</p>
<pre><code class="lang-py">weights = [w1, w2, w3]
</code></pre>
<p>We can now write a function <code>crop_yield</code> to calculate the yield of apples (or any other crop) given the climate data and the respective weights.</p>
<pre><code class="lang-py"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">crop_yield</span>(<span class="hljs-params">region, weights</span>):</span>
    result = <span class="hljs-number">0</span>
    <span class="hljs-keyword">for</span> x, w <span class="hljs-keyword">in</span> zip(region, weights):
        result += x * w
    <span class="hljs-keyword">return</span> result

crop_yield(kanto, weights)
<span class="hljs-comment"># 56.8</span>

crop_yield(johto, weights)
<span class="hljs-comment"># 76.9</span>

crop_yield(unova, weights)
<span class="hljs-comment"># 74.9</span>
</code></pre>
<h3 id="heading-how-to-turn-python-lists-into-numpy-arrays">How to Turn Python Lists into Numpy Arrays</h3>
<p>The calculation performed by the <code>crop_yield</code> (element-wise multiplication of two vectors and taking a sum of the results) is also called the <em>dot product</em>. Learn more about dot products <a target="_blank" href="https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length">here</a>.</p>
<p>The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.</p>
<p>Let's install the Numpy library using the <code>pip</code> package manager.</p>
<pre><code class="lang-py">!pip install numpy --upgrade --quiet
</code></pre>
<p>Next, let's import the <code>numpy</code> module. It's common practice to import numpy with the alias <code>np</code>.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
</code></pre>
<p>We can now use the <code>np.array</code> function to create Numpy arrays.</p>
<pre><code class="lang-py">kanto = np.array([<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>])

kanto
<span class="hljs-comment"># array([73, 67, 43])</span>

weights = np.array([w1, w2, w3])

weights
<span class="hljs-comment"># array([0.3, 0.2, 0.5])</span>
</code></pre>
<p>Numpy arrays have the type <code>ndarray</code>.</p>
<pre><code class="lang-py">type(kanto)
<span class="hljs-comment"># numpy.ndarray</span>

type(weights)
<span class="hljs-comment"># numpy.ndarray</span>
</code></pre>
<p>Just like lists, Numpy arrays support the indexing notation <code>[]</code>.</p>
<pre><code class="lang-py">weights[<span class="hljs-number">0</span>]
<span class="hljs-comment"># 0.3</span>

kanto[<span class="hljs-number">2</span>]
<span class="hljs-comment">#43</span>
</code></pre>
<h3 id="heading-how-to-operate-on-numpy-arrays">How to Operate on Numpy arrays</h3>
<p>We can now compute the dot product of the two vectors using the <code>np.dot</code> function.</p>
<pre><code class="lang-py">np.dot(kanto, weights)
<span class="hljs-comment"># 56.8</span>
</code></pre>
<p>We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.</p>
<pre><code class="lang-py">(kanto * weights).sum()
<span class="hljs-comment"># 56.8</span>
</code></pre>
<p>The <code>*</code> operator performs an element-wise multiplication of two arrays if they have the same size. The <code>sum</code> method calculates the sum of numbers in an array.</p>
<pre><code class="lang-py">arr1 = np.array([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])
arr2 = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>])

arr1 * arr2
<span class="hljs-comment"># array([ 4, 10, 18])</span>

arr2.sum()
<span class="hljs-comment"># 15</span>
</code></pre>
<h3 id="heading-what-are-the-benefits-of-using-numpy-arrays">What are the Benefits of Using Numpy Arrays?</h3>
<p>Numpy arrays offer the following benefits over Python lists for operating on numerical data:</p>
<ul>
<li><strong>They're easy</strong> to <strong>use</strong>: You can write small, concise, and intuitive mathematical expressions like <code>(kanto * weights).sum()</code> rather than using loops and custom functions like <code>crop_yield</code>.</li>
<li><strong>Performance</strong>: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements and loops that are interpreted at runtime</li>
</ul>
<p>Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Python lists</span>
arr1 = list(range(<span class="hljs-number">1000000</span>))
arr2 = list(range(<span class="hljs-number">1000000</span>, <span class="hljs-number">2000000</span>))

<span class="hljs-comment"># Numpy arrays</span>
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

%%time
result = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> x1, x2 <span class="hljs-keyword">in</span> zip(arr1, arr2):
    result += x1*x2
result

<span class="hljs-comment"># CPU times: user 300 ms, sys: 3.26 ms, total: 303 ms</span>
<span class="hljs-comment"># Wall time: 302 ms</span>
<span class="hljs-comment"># 833332333333500000</span>

%%time
np.dot(arr1_np, arr2_np)

<span class="hljs-comment"># CPU times: user 2.11 ms, sys: 951 µs, total: 3.07 ms</span>
<span class="hljs-comment"># Wall time: 1.58 ms</span>
<span class="hljs-comment"># 833332333333500000</span>
</code></pre>
<p>As you can see, using <code>np.dot</code> is 100 times faster than using a <code>for</code> loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.</p>
<h3 id="heading-multi-dimensional-numpy-arrays">Multi-Dimensional Numpy Arrays</h3>
<p>We can now go one step further and represent the climate data for all the regions using a single 2-dimensional Numpy array.</p>
<pre><code class="lang-py">climate_data = np.array([[<span class="hljs-number">73</span>, <span class="hljs-number">67</span>, <span class="hljs-number">43</span>],
                         [<span class="hljs-number">91</span>, <span class="hljs-number">88</span>, <span class="hljs-number">64</span>],
                         [<span class="hljs-number">87</span>, <span class="hljs-number">134</span>, <span class="hljs-number">58</span>],
                         [<span class="hljs-number">102</span>, <span class="hljs-number">43</span>, <span class="hljs-number">37</span>],
                         [<span class="hljs-number">69</span>, <span class="hljs-number">96</span>, <span class="hljs-number">70</span>]])

climate_data
<span class="hljs-comment"># array([[ 73,  67,  43],</span>
<span class="hljs-comment">#        [ 91,  88,  64],</span>
<span class="hljs-comment">#        [ 87, 134,  58],</span>
<span class="hljs-comment">#        [102,  43,  37],</span>
<span class="hljs-comment">#        [ 69,  96,  70]])</span>
</code></pre>
<p>If you've taken a linear algebra class in high school, you may recognize the above 2-d array as a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.</p>
<p>Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the <code>.shape</code> property of an array.</p>
<p><img src="https://fgnt.github.io/python_crashkurs_doc/_images/numpy_array_t.png" alt="Image" width="1440" height="805" loading="lazy">
_Source: <a target="_blank" href="https://github.com/elegant-scipy/elegant-scipy/blob/master/figures/NumPy_ndarrays_v2.png">Elegant Scipy</a>_</p>
<pre><code class="lang-py"><span class="hljs-comment"># 2D array (matrix)</span>
climate_data.shape
<span class="hljs-comment"># (5, 3)</span>

weights
<span class="hljs-comment"># array([0.3, 0.2, 0.5])</span>

<span class="hljs-comment"># 1D array (vector)</span>
weights.shape
<span class="hljs-comment"># (3,)</span>

<span class="hljs-comment"># 3D array </span>
arr3 = np.array([
    [[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>], 
     [<span class="hljs-number">13</span>, <span class="hljs-number">14</span>, <span class="hljs-number">15</span>]], 
    [[<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>], 
     [<span class="hljs-number">17</span>, <span class="hljs-number">18</span>, <span class="hljs-number">19.5</span>]]])

arr3.shape
<span class="hljs-comment"># (2, 2, 3)</span>
</code></pre>
<p>All the elements in a numpy array have the same data type. You can check the data type of an array using the <code>.dtype</code> property.</p>
<pre><code class="lang-py">weights.dtype
<span class="hljs-comment"># dtype('float64')</span>

climate_data.dtype
<span class="hljs-comment"># dtype('int64')</span>
</code></pre>
<p>If an array contains even a single floating point number, all the other elements are also converted to floats.</p>
<pre><code class="lang-py">arr3.dtype
<span class="hljs-comment"># dtype('float64')</span>
</code></pre>
<p>We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication between <code>climate_data</code> (a 5x3 matrix) and <code>weights</code> (a vector of length 3). Here's what it looks like visually:</p>
<p><img src="https://i.imgur.com/LJ2WKSI.png" alt="Image" width="578" height="334" loading="lazy"></p>
<p>You can learn about matrices and matrix multiplication by watching the first 3-4 videos of <a target="_blank" href="https://www.youtube.com/watch?v=xyAuNHPsq-g&amp;list=PLFD0EB975BA0CC1E0&amp;index=1">this YouTube playlist</a>.</p>
<p>We can use the <code>np.matmul</code> function or the <code>@</code> operator to perform matrix multiplication.</p>
<pre><code class="lang-py">np.matmul(climate_data, weights)
<span class="hljs-comment"># array([56.8, 76.9, 81.9, 57.7, 74.9])</span>

climate_data @ weights
<span class="hljs-comment"># array([56.8, 76.9, 81.9, 57.7, 74.9])</span>
</code></pre>
<h3 id="heading-how-to-work-with-csv-data-files">How to Work with CSV Data Files</h3>
<p>Numpy also provides helper functions reading from and writing to files. Let's download a file <code>climate.txt</code>, which contains 10,000 climate measurements (temperature, rainfall, and humidity) in the following format:</p>
<pre><code>temperature,rainfall,humidity
<span class="hljs-number">25.00</span>,<span class="hljs-number">76.00</span>,<span class="hljs-number">99.00</span>
<span class="hljs-number">39.00</span>,<span class="hljs-number">65.00</span>,<span class="hljs-number">70.00</span>
<span class="hljs-number">59.00</span>,<span class="hljs-number">45.00</span>,<span class="hljs-number">77.00</span>
<span class="hljs-number">84.00</span>,<span class="hljs-number">63.00</span>,<span class="hljs-number">38.00</span>
<span class="hljs-number">66.00</span>,<span class="hljs-number">50.00</span>,<span class="hljs-number">52.00</span>
<span class="hljs-number">41.00</span>,<span class="hljs-number">94.00</span>,<span class="hljs-number">77.00</span>
<span class="hljs-number">91.00</span>,<span class="hljs-number">57.00</span>,<span class="hljs-number">96.00</span>
<span class="hljs-number">49.00</span>,<span class="hljs-number">96.00</span>,<span class="hljs-number">99.00</span>
<span class="hljs-number">67.00</span>,<span class="hljs-number">20.00</span>,<span class="hljs-number">28.00</span>
...
</code></pre><p>This format of storing data is known as <em>comma-separated values</em> or CSV.</p>
<blockquote>
<p><strong>CSVs</strong>: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)</p>
</blockquote>
<p>To read this file into a numpy array, we can use the <code>genfromtxt</code> function.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> urllib.request

urllib.request.urlretrieve(
    <span class="hljs-string">'https://hub.jovian.ml/wp-content/uploads/2020/08/climate.csv'</span>, 
    <span class="hljs-string">'climate.txt'</span>)

climate_data = np.genfromtxt(<span class="hljs-string">'climate.txt'</span>, delimiter=<span class="hljs-string">','</span>, skip_header=<span class="hljs-number">1</span>)

climate_data
<span class="hljs-comment"># array([[25., 76., 99.],</span>
<span class="hljs-comment">#        [39., 65., 70.],</span>
<span class="hljs-comment">#        [59., 45., 77.],</span>
<span class="hljs-comment">#        ...,</span>
<span class="hljs-comment">#        [99., 62., 58.],</span>
<span class="hljs-comment">#        [70., 71., 91.],</span>
<span class="hljs-comment">#        [92., 39., 76.]])</span>

climate_data.shape
<span class="hljs-comment"># (10000, 3)</span>
</code></pre>
<p>We can now perform a matrix multiplication using the <code>@</code> operator to predict the yield of apples for the entire dataset using a given set of weights.</p>
<pre><code class="lang-py">weights = np.array([<span class="hljs-number">0.3</span>, <span class="hljs-number">0.2</span>, <span class="hljs-number">0.5</span>])

yields = climate_data @ weights
yields
<span class="hljs-comment"># array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])</span>

yields.shape
<span class="hljs-comment"># (10000,)</span>
</code></pre>
<p>Let's add the <code>yields</code> to <code>climate_data</code> as a fourth column using the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Fgenerated%2Fnumpy.concatenate.html"><code>np.concatenate</code></a> function.</p>
<pre><code class="lang-py">climate_results = np.concatenate((climate_data, yields.reshape(<span class="hljs-number">10000</span>, <span class="hljs-number">1</span>)), axis=<span class="hljs-number">1</span>)

climate_results
<span class="hljs-comment"># array([[25. , 76. , 99. , 72.2],</span>
<span class="hljs-comment">#        [39. , 65. , 70. , 59.7],</span>
<span class="hljs-comment">#        [59. , 45. , 77. , 65.2],</span>
<span class="hljs-comment">#        ...,</span>
<span class="hljs-comment">#        [99. , 62. , 58. , 71.1],</span>
<span class="hljs-comment">#        [70. , 71. , 91. , 80.7],</span>
<span class="hljs-comment">#        [92. , 39. , 76. , 73.4]])</span>
</code></pre>
<p>There are a couple of subtleties here:</p>
<ul>
<li>Since we wish to add new columns, we pass the argument <code>axis=1</code> to <code>np.concatenate</code>. The <code>axis</code> argument specifies the dimension for concatenation.</li>
<li>The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Fgenerated%2Fnumpy.reshape.html"><code>np.reshape</code></a> function to change the shape of <code>yields</code> from <code>(10000,)</code> to <code>(10000,1)</code>.</li>
</ul>
<p>Here's a visual explanation of <code>np.concatenate</code> along <code>axis=1</code> (can you guess what <code>axis=0</code> results in?):</p>
<p><img src="https://www.w3resource.com/w3r_images/python-numpy-image-exercise-58.png" alt="Image" width="576" height="536" loading="lazy">
<em>Source: <a target="_blank" href="w3resource.com">w3resource.com</a></em></p>
<p>The best way to understand what a Numpy function does is to experiment with it and read the documentation to learn about its arguments and return values. Use the cells below to experiment with <code>np.concatenate</code> and <code>np.reshape</code>.</p>
<p>Let's write the final results from our computation above back to a file using the <code>np.savetxt</code> function.</p>
<pre><code class="lang-py">np.savetxt(<span class="hljs-string">'climate_results.txt'</span>, 
           climate_results, 
           fmt=<span class="hljs-string">'%.2f'</span>, 
           delimiter=<span class="hljs-string">','</span>,
           header=<span class="hljs-string">'temperature,rainfall,humidity,yeild_apples'</span>, 
           comments=<span class="hljs-string">''</span>)
</code></pre>
<p>The results are written back in the CSV format to the file <code>climate_results.txt</code>.</p>
<pre><code>temperature,rainfall,humidity,yeild_apples
<span class="hljs-number">25.00</span>,<span class="hljs-number">76.00</span>,<span class="hljs-number">99.00</span>,<span class="hljs-number">72.20</span>
<span class="hljs-number">39.00</span>,<span class="hljs-number">65.00</span>,<span class="hljs-number">70.00</span>,<span class="hljs-number">59.70</span>
<span class="hljs-number">59.00</span>,<span class="hljs-number">45.00</span>,<span class="hljs-number">77.00</span>,<span class="hljs-number">65.20</span>
<span class="hljs-number">84.00</span>,<span class="hljs-number">63.00</span>,<span class="hljs-number">38.00</span>,<span class="hljs-number">56.80</span>
...
</code></pre><p>Numpy provides hundreds of functions for performing operations on arrays. Here are some commonly used functions:</p>
<ul>
<li>Mathematics: <code>np.sum</code>, <code>np.exp</code>, <code>np.round</code>, arithmetic operators</li>
<li>Array manipulation: <code>np.reshape</code>, <code>np.stack</code>, <code>np.concatenate</code>, <code>np.split</code></li>
<li>Linear Algebra: <code>np.matmul</code>, <code>np.dot</code>, <code>np.transpose</code>, <code>np.eigvals</code></li>
<li>Statistics: <code>np.mean</code>, <code>np.median</code>, <code>np.std</code>, <code>np.max</code></li>
</ul>
<p><strong>So how do you </strong>find the function you need?<em>**</em> The easiest way to find the right function for a specific operation or use-case is to do a web search. For instance, searching for "How to join numpy arrays" leads to <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fcmdlinetips.com%2F2018%2F04%2Fhow-to-concatenate-arrays-in-numpy%2F">this tutorial on array concatenation</a>.</p>
<p>You can find a <a target="_blank" href="https://numpy.org/doc/stable/reference/routines.html">full list of array functions here</a>.</p>
<h3 id="heading-numpy-arithmetic-operations-broadcasting-and-comparison">Numpy Arithmetic Operations, Broadcasting, and Comparison</h3>
<p>Numpy arrays support arithmetic operators like <code>+</code>, <code>-</code>, <code>*</code>, etc. You can perform an arithmetic operation with a single number (also called a scalar) or with another array of the same shape. </p>
<p>Operators make it easy to write mathematical expressions with multi-dimensional arrays.</p>
<pre><code class="lang-py">arr2 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>], 
                 [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>], 
                 [<span class="hljs-number">9</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]])

arr3 = np.array([[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>, <span class="hljs-number">14</span>], 
                 [<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>, <span class="hljs-number">18</span>], 
                 [<span class="hljs-number">19</span>, <span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>]])

<span class="hljs-comment"># Adding a scalar</span>
arr2 + <span class="hljs-number">3</span>

<span class="hljs-comment"># array([[ 4,  5,  6,  7],</span>
<span class="hljs-comment">#        [ 8,  9, 10, 11],</span>
<span class="hljs-comment">#        [12,  4,  5,  6]])</span>

<span class="hljs-comment"># Element-wise subtraction</span>
arr3 - arr2

<span class="hljs-comment"># array([[10, 10, 10, 10],</span>
<span class="hljs-comment">#        [10, 10, 10, 10],</span>
<span class="hljs-comment">#        [10, 10, 10, 10]])</span>

<span class="hljs-comment"># Division by scalar</span>
arr2 / <span class="hljs-number">2</span>

<span class="hljs-comment"># array([[0.5, 1. , 1.5, 2. ],</span>
<span class="hljs-comment">#        [2.5, 3. , 3.5, 4. ],</span>
<span class="hljs-comment">#        [4.5, 0.5, 1. , 1.5]])</span>

<span class="hljs-comment"># Element-wise multiplication</span>
arr2 * arr3

<span class="hljs-comment"># array([[ 11,  24,  39,  56],</span>
<span class="hljs-comment">#        [ 75,  96, 119, 144],</span>
<span class="hljs-comment">#        [171,  11,  24,  39]])</span>

<span class="hljs-comment"># Modulus with scalar</span>
arr2 % <span class="hljs-number">4</span>

<span class="hljs-comment"># array([[1, 2, 3, 0],</span>
<span class="hljs-comment">#        [1, 2, 3, 0],</span>
<span class="hljs-comment">#        [1, 1, 2, 3]])</span>
</code></pre>
<h4 id="heading-numpy-array-broadcasting"><strong>Numpy Array Broadcasting</strong></h4>
<p>Numpy arrays also support <em>broadcasting</em>, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.</p>
<pre><code class="lang-py">arr2 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>], 
                 [<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>, <span class="hljs-number">8</span>], 
                 [<span class="hljs-number">9</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]])               
arr2.shape
<span class="hljs-comment"># (3, 4)</span>

arr4 = np.array([<span class="hljs-number">4</span>, <span class="hljs-number">5</span>, <span class="hljs-number">6</span>, <span class="hljs-number">7</span>])
arr4.shape
<span class="hljs-comment"># (4,)</span>

arr2 + arr4
<span class="hljs-comment"># array([[ 5,  7,  9, 11],</span>
<span class="hljs-comment">#        [ 9, 11, 13, 15],</span>
<span class="hljs-comment">#        [13,  6,  8, 10]])</span>
</code></pre>
<p>When the expression <code>arr2 + arr4</code> is evaluated, <code>arr4</code> (which has the shape <code>(4,)</code>) is replicated three times to match the shape <code>(3, 4)</code> of <code>arr2</code>. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.</p>
<p><img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/02.05-broadcasting.png" alt="Image" width="432" height="324" loading="lazy">
<em>Source: <a target="_blank" href="https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html">Python Data Science Handbook</a></em></p>
<p>Broadcasting only works if one of the arrays can be replicated to match the other array's shape.</p>
<pre><code class="lang-py">arr5 = np.array([<span class="hljs-number">7</span>, <span class="hljs-number">8</span>])
arr5.shape
<span class="hljs-comment"># (2,)</span>

arr2 + arr5
<span class="hljs-comment"># ValueError: operands could not be broadcast together with shapes (3,4) (2,)</span>
</code></pre>
<p>In the above example, even if <code>arr5</code> is replicated three times, it will not match the shape of <code>arr2</code>. So <code>arr2 + arr5</code> cannot be evaluated successfully. <a target="_blank" href="https://numpy.org/doc/stable/user/basics.broadcasting.html">Learn more about broadcasting here</a>.</p>
<h4 id="heading-numpy-array-comparison"><strong>Numpy Array Comparison</strong></h4>
<p>Numpy arrays also support comparison operations like <code>==</code>, <code>!=</code>, <code>&gt;</code> and so on. The result is an array of booleans.</p>
<pre><code class="lang-py">arr1 = np.array([[<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>], [<span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>]])
arr2 = np.array([[<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>], [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">5</span>]])

arr1 == arr2
<span class="hljs-comment"># array([[False,  True,  True],</span>
<span class="hljs-comment">#        [False, False,  True]])</span>

arr1 != arr2
<span class="hljs-comment"># array([[ True, False, False],</span>
<span class="hljs-comment">#        [ True,  True, False]])</span>

arr1 &gt;= arr2
<span class="hljs-comment"># array([[False,  True,  True],</span>
<span class="hljs-comment">#        [ True,  True,  True]])</span>

arr1 &lt; arr2
<span class="hljs-comment"># array([[ True, False, False],</span>
<span class="hljs-comment">#        [False, False, False]])</span>
</code></pre>
<p>Array comparison is frequently used to count the number of equal elements in two arrays using the <code>sum</code> method. Remember that <code>True</code> evaluates to <code>1</code> and <code>False</code> evaluates to <code>0</code> when you use booleans in arithmetic operations.</p>
<pre><code class="lang-py">(arr1 == arr2).sum()
<span class="hljs-comment"># 3</span>
</code></pre>
<h3 id="heading-numpy-array-indexing-and-slicing">Numpy Array Indexing and Slicing</h3>
<p>Numpy extends Python's list indexing notation using <code>[]</code> to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.</p>
<pre><code class="lang-py">arr3 = np.array([
    [[<span class="hljs-number">11</span>, <span class="hljs-number">12</span>, <span class="hljs-number">13</span>, <span class="hljs-number">14</span>], 
     [<span class="hljs-number">13</span>, <span class="hljs-number">14</span>, <span class="hljs-number">15</span>, <span class="hljs-number">19</span>]], 

    [[<span class="hljs-number">15</span>, <span class="hljs-number">16</span>, <span class="hljs-number">17</span>, <span class="hljs-number">21</span>], 
     [<span class="hljs-number">63</span>, <span class="hljs-number">92</span>, <span class="hljs-number">36</span>, <span class="hljs-number">18</span>]], 

    [[<span class="hljs-number">98</span>, <span class="hljs-number">32</span>, <span class="hljs-number">81</span>, <span class="hljs-number">23</span>],      
     [<span class="hljs-number">17</span>, <span class="hljs-number">18</span>, <span class="hljs-number">19.5</span>, <span class="hljs-number">43</span>]]])

arr3.shape
<span class="hljs-comment"># (3, 2, 4)</span>

<span class="hljs-comment"># Single element</span>
arr3[<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>]

<span class="hljs-comment"># 36.0</span>

<span class="hljs-comment"># Subarray using ranges</span>
arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">0</span>:<span class="hljs-number">1</span>, :<span class="hljs-number">2</span>]

<span class="hljs-comment"># array([[[15., 16.]],</span>
<span class="hljs-comment"># </span>
<span class="hljs-comment">#        [[98., 32.]]])</span>

<span class="hljs-comment"># Mixing indices and ranges</span>
arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">1</span>, <span class="hljs-number">3</span>]

<span class="hljs-comment"># array([18., 43.])</span>

arr3[<span class="hljs-number">1</span>:, <span class="hljs-number">1</span>, :<span class="hljs-number">3</span>]
<span class="hljs-comment"># array([[63. , 92. , 36. ],</span>
<span class="hljs-comment">#        [17. , 18. , 19.5]])</span>

<span class="hljs-comment"># Using fewer indices</span>
arr3[<span class="hljs-number">1</span>]

<span class="hljs-comment"># array([[15., 16., 17., 21.],</span>
<span class="hljs-comment">#        [63., 92., 36., 18.]])</span>

arr3[:<span class="hljs-number">2</span>, <span class="hljs-number">1</span>]
<span class="hljs-comment"># array([[13., 14., 15., 19.],</span>
<span class="hljs-comment">#        [63., 92., 36., 18.]])</span>

<span class="hljs-comment"># Using too many indices</span>
arr3[<span class="hljs-number">1</span>,<span class="hljs-number">3</span>,<span class="hljs-number">2</span>,<span class="hljs-number">1</span>]

<span class="hljs-comment"># IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed</span>
</code></pre>
<p>The notation and its results can seem confusing at first, so take your time to experiment and become comfortable with it. </p>
<p>Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges. Here are some more examples demonstrated visually:</p>
<p><img src="https://scipy-lectures.org/_images/numpy_indexing.png" alt="Image" width="772" height="383" loading="lazy">
_Source: <a target="_blank" href="https://scipy-lectures.org/intro/numpy/array_object.html">Scipy Lectures</a>_</p>
<h3 id="heading-how-to-create-numpy-arrays-other-methods">How to Create Numpy Arrays – Other Methods</h3>
<p>Numpy also provides some handy functions to create arrays of desired shapes with fixed or random values. Check out the <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fnumpy.org%2Fdoc%2Fstable%2Freference%2Froutines.array-creation.html">official documentation</a> or use the <code>help</code> function to learn more.</p>
<pre><code># All zeros
np.zeros((<span class="hljs-number">3</span>, <span class="hljs-number">2</span>))

# array([[<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>]])

# All ones
np.ones([<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>])

# array([[[<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>],
#         [<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>]],
#
#        [[<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>],
#         [<span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">1.</span>]]])

# Identity matrix
np.eye(<span class="hljs-number">3</span>)

# array([[<span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>, <span class="hljs-number">0.</span>],
#        [<span class="hljs-number">0.</span>, <span class="hljs-number">0.</span>, <span class="hljs-number">1.</span>]])

# Random vector
np.random.rand(<span class="hljs-number">5</span>)

# array([<span class="hljs-number">0.92929562</span>, <span class="hljs-number">0.11301864</span>, <span class="hljs-number">0.64213555</span>, <span class="hljs-number">0.8600434</span> , <span class="hljs-number">0.53738656</span>])

# Random matrix
np.random.randn(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>) # rand vs. randn - what<span class="hljs-string">'s the difference?

# array([[ 0.09906435, -1.64668094,  0.08073528],
#        [ 0.1437016 ,  0.80715712,  1.27285476]])

# Fixed value
np.full([2, 3], 42)

# array([[42, 42, 42],
#        [42, 42, 42]])

# Range with start, end and step
np.arange(10, 90, 3)

# array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
#        61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

# Equally spaced numbers in a range
np.linspace(3, 27, 9)

# array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])</span>
</code></pre><h3 id="heading-exercises">Exercises</h3>
<p>Try the following exercises to become familiar with Numpy arrays and practice your skills:</p>
<ul>
<li>Assignment on Numpy array functions: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2Fnumpy-array-operations">https://jovian.ml/aakashns/numpy-array-operations</a></li>
<li>(Optional) 100 numpy exercises: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2F100-numpy-exercises">https://jovian.ml/aakashns/100-numpy-exercises</a></li>
</ul>
<h3 id="heading-summary-and-further-reading">Summary and Further Reading</h3>
<p>With this, we complete our discussion of numerical computing with Numpy. We've covered the following topics in this part of the tutorial:</p>
<ul>
<li>How to go from Python lists to Numpy arrays</li>
<li>How to operate on Numpy arrays</li>
<li>The benefits of using Numpy arrays over lists</li>
<li>Multi-dimensional Numpy arrays</li>
<li>How to work with CSV data files</li>
<li>Arithmetic operations and broadcasting</li>
<li>Array indexing and slicing</li>
<li>Other ways of creating Numpy arrays</li>
</ul>
<p>Check out the following resources for learning more about Numpy:</p>
<ul>
<li><a target="_blank" href="https://numpy.org/devdocs/user/quickstart.html">Official tutorial</a></li>
<li><a target="_blank" href="https://www.freecodecamp.org/news/the-ultimate-guide-to-the-numpy-scientific-computing-library-for-python/">Numpy course on freeCodeCamp</a></li>
<li><a target="_blank" href="http://scipy-lectures.org/advanced/advanced_numpy/index.html">Advanced Numpy (exploring the internals)</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is a vector?</li>
<li>How do you represent vectors using a Python list? Give an example.</li>
<li>What is a dot product of two vectors?</li>
<li>Write a function to compute the dot product of two vectors.</li>
<li>What is Numpy?</li>
<li>How do you install Numpy?</li>
<li>How do you import the <code>numpy</code> module?</li>
<li>What does it mean to import a module with an alias? Give an example.</li>
<li>What is the commonly used alias for <code>numpy</code>?</li>
<li>What is a Numpy array?</li>
<li>How do you create a Numpy array? Give an example.</li>
<li>What is the type of Numpy arrays?</li>
<li>How do you access the elements of a Numpy array?</li>
<li>How do you compute the dot product of two vectors using Numpy?</li>
<li>What happens if you try to compute the dot product of two vectors which have different sizes?</li>
<li>How do you compute the element-wise product of two Numpy arrays?</li>
<li>How do you compute the sum of all the elements in a Numpy array?</li>
<li>What are the benefits of using Numpy arrays over Python lists for operating on numerical data?</li>
<li>Why do Numpy array operations have better performance compared to Python functions and loops?</li>
<li>Illustrate the performance difference between Numpy array operations and Python loops using an example.</li>
<li>What are multi-dimensional Numpy arrays?</li>
<li>Illustrate how you'd create Numpy arrays with 2, 3, and 4 dimensions.</li>
<li>How do you inspect the number of dimensions and the length along each dimension in a Numpy array?</li>
<li>Can the elements of a Numpy array have different data types?</li>
<li>How do you check the data types of the elements of a Numpy array?</li>
<li>What is the data type of a Numpy array?</li>
<li>What is the difference between a matrix and a 2D Numpy array?</li>
<li>How do you perform matrix multiplication using Numpy?</li>
<li>What is the <code>@</code> operator used for in Numpy?</li>
<li>What is the CSV file format?</li>
<li>How do you read data from a CSV file using Numpy?</li>
<li>How do you concatenate two Numpy arrays?</li>
<li>What is the purpose of the <code>axis</code> argument of <code>np.concatenate</code>?</li>
<li>When are two Numpy arrays compatible for concatenation?</li>
<li>Give an example of two Numpy arrays that can be concatenated.</li>
<li>Give an example of two Numpy arrays that cannot be concatenated.</li>
<li>What is the purpose of the <code>np.reshape</code> function?</li>
<li>What does it mean to “reshape” a Numpy array?</li>
<li>How do you write a numpy array into a CSV file?</li>
<li>Give some examples of Numpy functions for performing mathematical operations.</li>
<li>Give some examples of Numpy functions for performing array manipulation.</li>
<li>Give some examples of Numpy functions for performing linear algebra.</li>
<li>Give some examples of Numpy functions for performing statistical operations.</li>
<li>How do you find the right Numpy function for a specific operation or use case?</li>
<li>Where can you see a list of all the Numpy array functions and operations?</li>
<li>What are the arithmetic operators supported by Numpy arrays? Illustrate with examples.</li>
<li>What is array broadcasting? How is it useful? Illustrate with an example.</li>
<li>Give some examples of arrays that are compatible for broadcasting.</li>
<li>Give some examples of arrays that are not compatible for broadcasting.</li>
<li>What are the comparison operators supported by Numpy arrays? Illustrate with examples.</li>
<li>How do you access a specific subarray or slice from a Numpy array?</li>
<li>Illustrate array indexing and slicing in multi-dimensional Numpy arrays with some examples.</li>
<li>How do you create a Numpy array with a given shape containing all zeros?</li>
<li>How do you create a Numpy array with a given shape containing all ones?</li>
<li>How do you create an identity matrix of a given shape?</li>
<li>How do you create a random vector of a given length?</li>
<li>How do you create a Numpy array with a given shape with a fixed value for each element?</li>
<li>How do you create a Numpy array with a given shape containing randomly initialized elements?</li>
<li>What is the difference between <code>np.random.rand</code> and <code>np.random.randn</code>? Illustrate with examples.</li>
<li>What is the difference between <code>np.arange</code> and <code>np.linspace</code>? Illustrate with examples.</li>
</ol>
<p>You are ready to move on to the next section of this tutorial.</p>
<h2 id="heading-how-to-analyze-tabular-data-using-python-and-pandas">How to Analyze Tabular Data using Python and Pandas</h2>
<p><img src="https://i.imgur.com/zfxLzEv.png" alt="Image" width="3175" height="1414" loading="lazy"></p>
<p>Follow along and run the code here: <a target="_blank" href="https://jovian.ai/aakashns/python-pandas-data-analysis">https://jovian.ai/aakashns/python-pandas-data-analysis</a>.</p>
<p>This section covers the following topics:</p>
<ul>
<li>How to read a CSV file into a Pandas data frame</li>
<li>How to retrieve data from Pandas data frames</li>
<li>How to query, sort, and analyze data</li>
<li>How to merge, group, and aggregate data</li>
<li>How to extract useful information from dates</li>
<li>Basic plotting using line and bar charts</li>
<li>How to write data frames to CSV files</li>
</ul>
<h3 id="heading-how-to-read-a-csv-file-using-pandas">How to Read a CSV File Using Pandas</h3>
<p><a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fpandas.pydata.org%2F">Pandas</a> is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). It provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL, and more. </p>
<p>Let's download a file <code>italy-covid-daywise.txt</code> which contains day-wise Covid-19 data for Italy in the following format:</p>
<pre><code>date,new_cases,new_deaths,new_tests
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-21</span>,<span class="hljs-number">2256.0</span>,<span class="hljs-number">454.0</span>,<span class="hljs-number">28095.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-22</span>,<span class="hljs-number">2729.0</span>,<span class="hljs-number">534.0</span>,<span class="hljs-number">44248.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-23</span>,<span class="hljs-number">3370.0</span>,<span class="hljs-number">437.0</span>,<span class="hljs-number">37083.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-24</span>,<span class="hljs-number">2646.0</span>,<span class="hljs-number">464.0</span>,<span class="hljs-number">95273.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-25</span>,<span class="hljs-number">3021.0</span>,<span class="hljs-number">420.0</span>,<span class="hljs-number">38676.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-26</span>,<span class="hljs-number">2357.0</span>,<span class="hljs-number">415.0</span>,<span class="hljs-number">24113.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-27</span>,<span class="hljs-number">2324.0</span>,<span class="hljs-number">260.0</span>,<span class="hljs-number">26678.0</span>
<span class="hljs-number">2020</span><span class="hljs-number">-04</span><span class="hljs-number">-28</span>,<span class="hljs-number">1739.0</span>,<span class="hljs-number">333.0</span>,<span class="hljs-number">37554.0</span>
...
</code></pre><p>This format of storing data is known as <em>comma-separated values</em> or CSV. Here's a reminder in case you need a definition of what the CSV format is:</p>
<blockquote>
<p><strong>CSVs</strong>: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)</p>
</blockquote>
<p>We'll download this file using the <code>urlretrieve</code> function from the <code>urllib.request</code> module.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> urllib.request <span class="hljs-keyword">import</span> urlretrieve

urlretrieve(<span class="hljs-string">'https://hub.jovian.ml/wp-content/uploads/2020/09/italy-covid-daywise.csv'</span>, <span class="hljs-string">'italy-covid-daywise.csv'</span>)
</code></pre>
<p>To read the file, we can use the <code>read_csv</code> method from Pandas. First, let's install the Pandas library.</p>
<pre><code class="lang-py">!pip install pandas --upgrade --quiet
</code></pre>
<p>We can now import the <code>pandas</code> module. As a convention, it is imported with the alias <code>pd</code>.</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

covid_df = pd.read_csv(<span class="hljs-string">'italy-covid-daywise.csv'</span>)
</code></pre>
<p>Data from the file is read and stored in a <code>DataFrame</code> object – one of the core data structures in Pandas for storing and working with tabular data. We typically use the <code>_df</code> suffix in the variable names for dataframes.</p>
<pre><code class="lang-py">type(covid_df)
<span class="hljs-comment"># pandas.core.frame.DataFrame</span>

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-108.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Here's what we can tell by looking at the dataframe:</p>
<ul>
<li>The file provides four day-wise counts for COVID-19 in Italy</li>
<li>The metrics reported are new cases, deaths, and tests</li>
<li>Data is provided for 248 days: from Dec 12, 2019, to Sep 3, 2020</li>
</ul>
<p>Keep in mind that these are officially reported numbers. The actual number of cases and deaths may be higher, as not all cases are diagnosed.</p>
<p>We can view some basic information about the data frame using the <code>.info</code> method.</p>
<pre><code class="lang-py">covid_df.info()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-109.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the <code>.describe</code> method.</p>
<pre><code class="lang-py">covid_df.describe()
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-110.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The <code>columns</code> property contains the list of columns within the data frame.</p>
<pre><code class="lang-py">covid_df.columns
<span class="hljs-comment"># Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')</span>
</code></pre>
<p>You can also retrieve the number of rows and columns in the data frame using the <code>.shape</code> method.</p>
<pre><code class="lang-py">covid_df.shape
<span class="hljs-comment"># (248, 4)</span>
</code></pre>
<p>Here's a summary of the functions and methods we've looked at so far:</p>
<ul>
<li><code>pd.read_csv</code> – Read data from a CSV file into a Pandas <code>DataFrame</code> object</li>
<li><code>.info()</code> – View basic information about rows, columns, and data types</li>
<li><code>.describe()</code> – View statistical information about numeric columns</li>
<li><code>.columns</code> – Get the list of column names</li>
<li><code>.shape</code> – Get the number of rows and columns as a tuple</li>
</ul>
<h3 id="heading-how-to-retrieve-data-from-a-data-frame-in-pandas">How to Retrieve Data from a Data Frame in Pandas</h3>
<p>The first thing you might want to do is retrieve data from this data frame, like the counts of a specific day or the list of values in a particular column. </p>
<p>To do this, you should understand the internal representation of data in a data frame. Conceptually, you can think of a dataframe as a dictionary of lists: keys are column names, and values are lists/arrays containing data for the respective columns.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Pandas format is simliar to this</span>
covid_data_dict = {
    <span class="hljs-string">'date'</span>:       [<span class="hljs-string">'2020-08-30'</span>, <span class="hljs-string">'2020-08-31'</span>, <span class="hljs-string">'2020-09-01'</span>, <span class="hljs-string">'2020-09-02'</span>, <span class="hljs-string">'2020-09-03'</span>],
    <span class="hljs-string">'new_cases'</span>:  [<span class="hljs-number">1444</span>, <span class="hljs-number">1365</span>, <span class="hljs-number">996</span>, <span class="hljs-number">975</span>, <span class="hljs-number">1326</span>],
    <span class="hljs-string">'new_deaths'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">4</span>, <span class="hljs-number">6</span>, <span class="hljs-number">8</span>, <span class="hljs-number">6</span>],
    <span class="hljs-string">'new_tests'</span>: [<span class="hljs-number">53541</span>, <span class="hljs-number">42583</span>, <span class="hljs-number">54395</span>, <span class="hljs-literal">None</span>, <span class="hljs-literal">None</span>]
}
</code></pre>
<p>Representing data in the above format has a few benefits:</p>
<ul>
<li>All values in a column typically have the same type of value, so it's more efficient to store them in a single array.</li>
<li>Retrieving the values for a particular row simply requires extracting the elements at a given index from each column array.</li>
<li>The representation is more compact (column names are recorded only once) compared to other formats that use a dictionary for each row of data (see the example below).</li>
</ul>
<pre><code class="lang-py"><span class="hljs-comment"># Pandas format is not similar to this</span>
covid_data_list = [
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-08-30'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1444</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">53541</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-08-31'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1365</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">42583</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-01'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">996</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">6</span>, <span class="hljs-string">'new_tests'</span>: <span class="hljs-number">54395</span>},
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-02'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">975</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">8</span> },
    {<span class="hljs-string">'date'</span>: <span class="hljs-string">'2020-09-03'</span>, <span class="hljs-string">'new_cases'</span>: <span class="hljs-number">1326</span>, <span class="hljs-string">'new_deaths'</span>: <span class="hljs-number">6</span>},
]
</code></pre>
<p>With the dictionary of lists analogy in mind, you can now guess how to retrieve data from a data frame. For example, we can get a list of values from a specific column using the <code>[]</code> indexing notation.</p>
<pre><code class="lang-py">covid_data_dict[<span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># [1444, 1365, 996, 975, 1326]</span>

covid_df[<span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># 0         0.0</span>
<span class="hljs-comment"># 1         0.0</span>
<span class="hljs-comment"># 2         0.0</span>
<span class="hljs-comment"># 3         0.0</span>
<span class="hljs-comment"># 4         0.0</span>
<span class="hljs-comment">#         ...  </span>
<span class="hljs-comment"># 243    1444.0</span>
<span class="hljs-comment"># 244    1365.0</span>
<span class="hljs-comment"># 245     996.0</span>
<span class="hljs-comment"># 246     975.0</span>
<span class="hljs-comment"># 247    1326.0</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: float64</span>
</code></pre>
<p>Each column is represented using a data structure called <code>Series</code>, which is essentially a numpy array with some extra methods and properties.</p>
<pre><code class="lang-py">type(covid_df[<span class="hljs-string">'new_cases'</span>])
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>Like arrays, you can retrieve a specific value with a series using the indexing notation <code>[]</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'new_cases'</span>][<span class="hljs-number">246</span>]
<span class="hljs-comment"># 975.0</span>

covid_df[<span class="hljs-string">'new_tests'</span>][<span class="hljs-number">240</span>]
<span class="hljs-number">57640.0</span>
</code></pre>
<p>Pandas also provides the <code>.at</code> method to retrieve the element at a specific row &amp; column directly.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">246</span>, <span class="hljs-string">'new_cases'</span>]
<span class="hljs-comment"># 975.0</span>

covid_df.at[<span class="hljs-number">240</span>, <span class="hljs-string">'new_tests'</span>]
<span class="hljs-comment"># 57640.0</span>
</code></pre>
<p>Instead of using the indexing notation <code>[]</code>, Pandas also allows accessing columns as properties of the dataframe using the <code>.</code> notation. However, this method only works for columns whose names do not contain spaces or special characters.</p>
<pre><code class="lang-py">covid_df.new_cases
<span class="hljs-comment"># 0         0.0</span>
<span class="hljs-comment"># 1         0.0</span>
<span class="hljs-comment"># 2         0.0</span>
<span class="hljs-comment"># 3         0.0</span>
<span class="hljs-comment"># 4         0.0</span>
<span class="hljs-comment">#         ...  </span>
<span class="hljs-comment"># 243    1444.0</span>
<span class="hljs-comment"># 244    1365.0</span>
<span class="hljs-comment"># 245     996.0</span>
<span class="hljs-comment"># 246     975.0</span>
<span class="hljs-comment"># 247    1326.0</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: float64</span>
</code></pre>
<p>Further, you can also pass a list of columns within the indexing notation <code>[]</code> to access a subset of the data frame with just the given columns.</p>
<pre><code class="lang-py">cases_df = covid_df[[<span class="hljs-string">'date'</span>, <span class="hljs-string">'new_cases'</span>]]
cases_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-111.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The new data frame <code>cases_df</code> is simply a "view" of the original data frame <code>covid_df</code>. Both point to the same data in the computer's memory. Changing any values inside one of them will also change the respective values in the other. </p>
<p>Sharing data between data frames makes data manipulation in Pandas blazing fast. You needn't worry about the overhead of copying thousands or millions of rows every time you want to create a new data frame by operating on an existing one.</p>
<p>Sometimes you might need a full copy of the data frame, in which case you can use the <code>copy</code> method.</p>
<pre><code class="lang-py">covid_df_copy = covid_df.copy()
</code></pre>
<p>The data within <code>covid_df_copy</code> is completely separate from <code>covid_df</code>, and changing values inside one of them will not affect the other.</p>
<p>To access a specific row of data, Pandas provides the <code>.loc</code> method.</p>
<pre><code class="lang-py">covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-112.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">243</span>]
<span class="hljs-comment"># date          2020-08-30</span>
<span class="hljs-comment"># new_cases         1444.0</span>
<span class="hljs-comment"># new_deaths           1.0</span>
<span class="hljs-comment"># new_tests        53541.0</span>
<span class="hljs-comment"># Name: 243, dtype: object</span>
</code></pre>
<p>Each retrieved row is also a <code>Series</code> object.</p>
<pre><code class="lang-py">type(covid_df.loc[<span class="hljs-number">243</span>])
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>We can use the <code>.head</code> and <code>.tail</code> methods to view the first or last few rows of data.</p>
<pre><code class="lang-py">covid_df.head(<span class="hljs-number">5</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-113.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_df.tail(<span class="hljs-number">4</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-114.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice above that while the first few values in the <code>new_cases</code> and <code>new_deaths</code> columns are <code>0</code>, the corresponding values within the <code>new_tests</code> column are <code>NaN</code>. That is because the CSV file does not contain any data for the <code>new_tests</code> column for specific dates (you can verify this by looking into the file). These values may be missing or unknown.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">0</span>, <span class="hljs-string">'new_tests'</span>]
<span class="hljs-comment"># nan</span>

type(covid_df.at[<span class="hljs-number">0</span>, <span class="hljs-string">'new_tests'</span>])
<span class="hljs-comment"># numpy.float64</span>
</code></pre>
<p>The distinction between <code>0</code> and <code>NaN</code> is subtle but important. In this dataset, it represents that daily test numbers were not reported on specific dates. Italy started reporting daily tests on Apr 19, 2020. They'd already conducted 935,310 tests before Apr 19.</p>
<p>We can find the first index that doesn't contain a <code>NaN</code> value using a column's <code>first_valid_index</code> method.</p>
<pre><code class="lang-py">covid_df.new_tests.first_valid_index()
<span class="hljs-comment"># 111</span>
</code></pre>
<p>Let's look at a few rows before and after this index to verify that the values change from <code>NaN</code> to actual numbers. We can do this by passing a range to <code>loc</code>.</p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">108</span>:<span class="hljs-number">113</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-115.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can use the <code>.sample</code> method to retrieve a random sample of rows from the data frame.</p>
<pre><code class="lang-py">covid_df.sample(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-116.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice that even though we have taken a random sample, each row's original index is preserved. This is a useful property of data frames.</p>
<p>Here's a summary of the functions and methods we looked at in this section:</p>
<ul>
<li><code>covid_df['new_cases']</code> – Retrieving columns as a <code>Series</code> using the column name</li>
<li><code>new_cases[243]</code> – Retrieving values from a <code>Series</code> using an index</li>
<li><code>covid_df.at[243, 'new_cases']</code> – Retrieving a single value from a data frame</li>
<li><code>covid_df.copy()</code> – Creating a deep copy of a data frame</li>
<li><code>covid_df.loc[243]</code> - Retrieving a row or range of rows of data from the data frame</li>
<li><code>head</code>, <code>tail</code>, and <code>sample</code> – Retrieving multiple rows of data from the data frame</li>
<li><code>covid_df.new_tests.first_valid_index</code> – Finding the first non-empty index in a series</li>
</ul>
<h3 id="heading-how-to-analyze-data-from-data-frames-in-pandas">How to Analyze Data from Data Frames in Pandas</h3>
<p>Let's try to answer some questions about our data.</p>
<p><strong>Q: What are the total number of reported cases and deaths related to Covid-19 in Italy?</strong></p>
<p>Similar to Numpy arrays, a Pandas series supports the <code>sum</code> method to answer these questions.</p>
<pre><code class="lang-py">total_cases = covid_df.new_cases.sum()
total_deaths = covid_df.new_deaths.sum()

print(<span class="hljs-string">'The number of reported cases is {} and the number of reported deaths is {}.'</span>.format(int(total_cases), int(total_deaths)))
<span class="hljs-comment"># The number of reported cases is 271515 and the number of reported deaths is 35497.</span>
</code></pre>
<p><strong>Q: What is the overall death rate (ratio of reported deaths to reported cases)?</strong></p>
<pre><code class="lang-py">death_rate = covid_df.new_deaths.sum() / covid_df.new_cases.sum()

print(<span class="hljs-string">"The overall reported death rate in Italy is {:.2f} %."</span>.format(death_rate*<span class="hljs-number">100</span>))
<span class="hljs-comment"># The overall reported death rate in Italy is 13.07 %.</span>
</code></pre>
<p><strong>Q: What is the overall number of tests conducted? A total of 935</strong>,<strong>310 tests were conducted before daily test numbers were reported.</strong></p>
<pre><code class="lang-py">initial_tests = <span class="hljs-number">935310</span>
total_tests = initial_tests + covid_df.new_tests.sum()

total_tests
<span class="hljs-comment"># 5214766.0</span>
</code></pre>
<p><strong>Q: What fraction of tests returned a positive result?</strong></p>
<pre><code class="lang-py">positive_rate = total_cases / total_tests

print(<span class="hljs-string">'{:.2f}% of tests in Italy led to a positive diagnosis.'</span>.format(positive_rate*<span class="hljs-number">100</span>))
<span class="hljs-comment"># 5.21% of tests in Italy led to a positive diagnosis.</span>
</code></pre>
<p>Try asking and answering some more questions about the data.</p>
<h3 id="heading-how-to-query-and-sort-rows-in-pandas">How to Query and Sort Rows in Pandas</h3>
<p>Let's say we only want to look at the days which had more than 1,000 reported cases. We can use a boolean expression to check which rows satisfy this criterion.</p>
<pre><code class="lang-py">high_new_cases = covid_df.new_cases &gt; <span class="hljs-number">1000</span>

high_new_cases
<span class="hljs-comment"># 0      False</span>
<span class="hljs-comment"># 1      False</span>
<span class="hljs-comment"># 2      False</span>
<span class="hljs-comment"># 3      False</span>
<span class="hljs-comment"># 4      False</span>
<span class="hljs-comment">#        ...  </span>
<span class="hljs-comment"># 243     True</span>
<span class="hljs-comment"># 244     True</span>
<span class="hljs-comment"># 245    False</span>
<span class="hljs-comment"># 246    False</span>
<span class="hljs-comment"># 247     True</span>
<span class="hljs-comment"># Name: new_cases, Length: 248, dtype: bool</span>
</code></pre>
<p>The boolean expression returns a series containing <code>True</code> and <code>False</code> boolean values. You can use this series to select a subset of rows from the original dataframe, corresponding to the <code>True</code> values in the series.</p>
<pre><code class="lang-py">covid_df[high_new_cases]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-117.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The data frame contains 72 rows, but only the first and last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.</p>
<pre><code class="lang-py">high_cases_df = covid_df[covid_df.new_cases &gt; <span class="hljs-number">1000</span>]

high_cases_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-118.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The data frame contains 72 rows, but only the first &amp; last five rows are displayed by default with Jupyter for brevity. We can change some display options to view all the rows.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> IPython.display <span class="hljs-keyword">import</span> display
<span class="hljs-keyword">with</span> pd.option_context(<span class="hljs-string">'display.max_rows'</span>, <span class="hljs-number">100</span>):
    display(covid_df[covid_df.new_cases &gt; <span class="hljs-number">1000</span>])
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-119.png" alt="Image" width="600" height="400" loading="lazy">
<em>This is just part of the data frame. Check out the rest <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<p>We can also formulate more complex queries that involve multiple columns. As an example, let's try to determine the days when the ratio of cases reported to tests conducted is higher than the overall <code>positive_rate</code>.</p>
<pre><code class="lang-py">positive_rate
<span class="hljs-comment"># 0.05206657403227681</span>

high_ratio_df = covid_df[covid_df.new_cases / covid_df.new_tests &gt; positive_rate]

high_ratio_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-120.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The result of performing an operation on two columns is a new series.</p>
<pre><code class="lang-py">covid_df.new_cases / covid_df.new_tests
<span class="hljs-comment"># 0           NaN</span>
<span class="hljs-comment"># 1           NaN</span>
<span class="hljs-comment"># 2           NaN</span>
<span class="hljs-comment"># 3           NaN</span>
<span class="hljs-comment"># 4           NaN</span>
<span class="hljs-comment">#          ...   </span>
<span class="hljs-comment"># 243    0.026970</span>
<span class="hljs-comment"># 244    0.032055</span>
<span class="hljs-comment"># 245    0.018311</span>
<span class="hljs-comment"># 246         NaN</span>
<span class="hljs-comment"># 247         NaN</span>
<span class="hljs-comment"># Length: 248, dtype: float64</span>
</code></pre>
<p>We can use this series to add a new column to the data frame.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'positive_rate'</span>] = covid_df.new_cases / covid_df.new_tests

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-121.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>However, keep in mind that sometimes it takes a few days to get the results for a test, so we can't compare the number of new cases with the number of tests conducted on the same day. Any inference based on this <code>positive_rate</code> column is likely to be incorrect. </p>
<p>It's essential to watch out for such subtle relationships that are often not conveyed within the CSV file and require some external context. It's always a good idea to read through the documentation provided with the dataset or ask for more information.</p>
<p>For now, let's remove the <code>positive_rate</code> column using the <code>drop</code> method.</p>
<pre><code class="lang-py">covid_df.drop(columns=[<span class="hljs-string">'positive_rate'</span>], inplace=<span class="hljs-literal">True</span>)
</code></pre>
<p>Can you figure the purpose of the <code>inplace</code> argument?</p>
<h4 id="heading-how-to-sort-rows-using-column-values-in-pandas"><strong>How to Sort Rows Using Column Values in Pandas</strong></h4>
<p>You can also sort the rows by a specific column using <code>.sort_values</code>. Let's sort to identify the days with the highest number of cases, then chain it with the <code>head</code> method to list just the first ten results.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_cases'</span>, ascending=<span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-122.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It looks like the last two weeks of March had the highest number of daily cases. Let's compare this to the days where the highest number of deaths were recorded.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_deaths'</span>, ascending=<span class="hljs-literal">False</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-123.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It appears that daily deaths hit a peak just about a week after the peak in daily new cases.</p>
<p>Let's also look at the days with the smallest number of cases. We might expect to see the first few days of the year on this list.</p>
<pre><code class="lang-py">covid_df.sort_values(<span class="hljs-string">'new_cases'</span>).head(<span class="hljs-number">10</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-124.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>It seems like the count of new cases on Jun 20, 2020, was <code>-148</code>, a negative number! Not something we might have expected, but that's the nature of real-world data. It could be a data entry error, or the government may have issued a correction to account for miscounting in the past. </p>
<p>Can you dig through news articles online and figure out why the number was negative?</p>
<p>Let's look at some days before and after Jun 20, 2020.</p>
<pre><code class="lang-py">covid_df.loc[<span class="hljs-number">169</span>:<span class="hljs-number">175</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-125.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:</p>
<ol>
<li>Replace it with <code>0</code>.</li>
<li>Replace it with the average of the entire column</li>
<li>Replace it with the average of the values on the previous and next date</li>
<li>Discard the row entirely</li>
</ol>
<p>Which approach you pick requires some context about the data and the problem. In this case, since we are dealing with data ordered by date, we can go ahead with the third approach.</p>
<p>You can use the <code>.at</code> method to modify a specific value within the dataframe.</p>
<pre><code class="lang-py">covid_df.at[<span class="hljs-number">172</span>, <span class="hljs-string">'new_cases'</span>] = (covid_df.at[<span class="hljs-number">171</span>, <span class="hljs-string">'new_cases'</span>] + covid_df.at[<span class="hljs-number">173</span>, <span class="hljs-string">'new_cases'</span>])/<span class="hljs-number">2</span>
</code></pre>
<p>Here's a summary of the functions and methods we looked at in this section:</p>
<ul>
<li><code>covid_df.new_cases.sum()</code> – Computing the sum of values in a column or series</li>
<li><code>covid_df[covid_df.new_cases &gt; 1000]</code> – Querying a subset of rows satisfying the chosen criteria using boolean expressions</li>
<li><code>df['pos_rate'] = df.new_cases/df.new_tests</code> – Adding new columns by combining data from existing columns</li>
<li><code>covid_df.drop('positive_rate')</code> – Removing one or more columns from the data frame</li>
<li><code>sort_values</code> – Sorting the rows of a data frame using column values</li>
<li><code>covid_df.at[172, 'new_cases'] = ...</code> – Replacing a value within the data frame</li>
</ul>
<h3 id="heading-how-to-work-with-dates-in-pandas">How to Work with Dates in Pandas</h3>
<p>While we've looked at overall numbers for the cases, tests, positive rate, and more, it would also be useful to study these numbers on a month-by-month basis. </p>
<p>The <code>date</code> column might come in handy here, as Pandas provides many utilities for working with dates.</p>
<pre><code class="lang-py">covid_df.date
<span class="hljs-comment"># 0      2019-12-31</span>
<span class="hljs-comment"># 1      2020-01-01</span>
<span class="hljs-comment"># 2      2020-01-02</span>
<span class="hljs-comment"># 3      2020-01-03</span>
<span class="hljs-comment"># 4      2020-01-04</span>
<span class="hljs-comment">#           ...    </span>
<span class="hljs-comment"># 243    2020-08-30</span>
<span class="hljs-comment"># 244    2020-08-31</span>
<span class="hljs-comment"># 245    2020-09-01</span>
<span class="hljs-comment"># 246    2020-09-02</span>
<span class="hljs-comment"># 247    2020-09-03</span>
<span class="hljs-comment"># Name: date, Length: 248, dtype: object</span>
</code></pre>
<p>The data type of date is currently <code>object</code>, so Pandas does not know that this column is a date. We can convert it into a <code>datetime</code> column using the <code>pd.to_datetime</code> method.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'date'</span>] = pd.to_datetime(covid_df.date)

covid_df[<span class="hljs-string">'date'</span>]
<span class="hljs-comment"># 0     2019-12-31</span>
<span class="hljs-comment"># 1     2020-01-01</span>
<span class="hljs-comment"># 2     2020-01-02</span>
<span class="hljs-comment"># 3     2020-01-03</span>
<span class="hljs-comment"># 4     2020-01-04</span>
<span class="hljs-comment">#          ...    </span>
<span class="hljs-comment"># 243   2020-08-30</span>
<span class="hljs-comment"># 244   2020-08-31</span>
<span class="hljs-comment"># 245   2020-09-01</span>
<span class="hljs-comment"># 246   2020-09-02</span>
<span class="hljs-comment"># 247   2020-09-03</span>
<span class="hljs-comment"># Name: date, Length: 248, dtype: datetime64[ns]</span>
</code></pre>
<p>You can see that it now has the datatype <code>datetime64</code>. We can now extract different parts of the data into separate columns, using the <code>DatetimeIndex</code> class (<a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fversion%2F0.23.4%2Fgenerated%2Fpandas.DatetimeIndex.html">view docs</a>).</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'year'</span>] = pd.DatetimeIndex(covid_df.date).year
covid_df[<span class="hljs-string">'month'</span>] = pd.DatetimeIndex(covid_df.date).month
covid_df[<span class="hljs-string">'day'</span>] = pd.DatetimeIndex(covid_df.date).day
covid_df[<span class="hljs-string">'weekday'</span>] = pd.DatetimeIndex(covid_df.date).weekday

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-126.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's check the overall metrics for May. We can query the rows for May, choose a subset of columns, and use the <code>sum</code> method to aggregate each selected column's values.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Query the rows for May</span>
covid_df_may = covid_df[covid_df.month == <span class="hljs-number">5</span>]

<span class="hljs-comment"># Extract the subset of columns to be aggregated</span>
covid_df_may_metrics = covid_df_may[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]]

<span class="hljs-comment"># Get the column-wise sum</span>
covid_may_totals = covid_df_may_metrics.sum()

covid_may_totals
<span class="hljs-comment"># new_cases       29073.0</span>
<span class="hljs-comment"># new_deaths       5658.0</span>
<span class="hljs-comment"># new_tests     1078720.0</span>
<span class="hljs-comment"># dtype: float64</span>

type(covid_may_totals)
<span class="hljs-comment"># pandas.core.series.Series</span>
</code></pre>
<p>We can also combine the above operations into a single statement.</p>
<pre><code class="lang-py">covid_df[covid_df.month == <span class="hljs-number">5</span>][[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].sum()
<span class="hljs-comment"># new_cases       29073.0</span>
<span class="hljs-comment"># new_deaths       5658.0</span>
<span class="hljs-comment"># new_tests     1078720.0</span>
<span class="hljs-comment"># dtype: float64</span>
</code></pre>
<p>As another example, let's check if the number of cases reported on Sundays is higher than the average number of cases reported every day. This time, we might want to aggregate columns using the <code>.mean</code> method.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Overall average</span>
covid_df.new_cases.mean()

<span class="hljs-comment"># 1096.6149193548388</span>

<span class="hljs-comment"># Average for Sundays</span>
covid_df[covid_df.weekday == <span class="hljs-number">6</span>].new_cases.mean()

<span class="hljs-comment"># 1247.2571428571428</span>
</code></pre>
<p>It seems like more cases were reported on Sundays compared to other days.</p>
<p>Try asking and answering some more date-related questions about the data.</p>
<h3 id="heading-how-to-group-and-aggregate-data-in-pandas">How to Group and Aggregate Data in Pandas</h3>
<p>As a next step, we might want to summarize the day-wise data and create a new dataframe with month-wise data. We can use the <code>groupby</code> function to create a group for each month, select the columns we wish to aggregate, and aggregate them using the <code>sum</code> method.</p>
<pre><code class="lang-py">covid_month_df = covid_df.groupby(<span class="hljs-string">'month'</span>)[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].sum()

covid_month_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-127.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The result is a new data frame that uses unique values from the column passed to <code>groupby</code> as the index. Grouping and aggregation is a powerful method for progressively summarizing data into smaller data frames.</p>
<p>Instead of aggregating by sum, you can also aggregate by other measures like mean. Let's compute the average number of daily new cases, deaths, and tests for each month.</p>
<pre><code class="lang-py">covid_month_mean_df = covid_df.groupby(<span class="hljs-string">'month'</span>)[[<span class="hljs-string">'new_cases'</span>, <span class="hljs-string">'new_deaths'</span>, <span class="hljs-string">'new_tests'</span>]].mean()

covid_month_mean_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-128.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Apart from grouping, another form of aggregation is the running or cumulative sum of cases, tests, or deaths up to each row's date. We can use the <code>cumsum</code> method to compute the cumulative sum of a column as a new series. </p>
<p>Let's add three new columns: <code>total_cases</code>, <code>total_deaths</code>, and <code>total_tests</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'total_cases'</span>] = covid_df.new_cases.cumsum()
covid_df[<span class="hljs-string">'total_deaths'</span>] = covid_df.new_deaths.cumsum()
covid_df[<span class="hljs-string">'total_tests'</span>] = covid_df.new_tests.cumsum() + initial_tests
</code></pre>
<p>We've also included the initial test count in <code>total_test</code> to account for tests conducted before daily reporting was started.</p>
<pre><code class="lang-py">covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-129.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice how the <code>NaN</code> values in the <code>total_tests</code> column remain unaffected.</p>
<h3 id="heading-how-to-merge-data-from-multiple-sources-in-pandas">How to Merge Data from Multiple Sources in Pandas</h3>
<p>To determine other metrics like test per million, cases per million, and so on, we require some more information about the country, namely its population. </p>
<p>Let's download another file <code>locations.csv</code> that contains health-related information for many countries, including Italy.</p>
<pre><code class="lang-py">urlretrieve(<span class="hljs-string">'https://gist.githubusercontent.com/aakashns/8684589ef4f266116cdce023377fc9c8/raw/99ce3826b2a9d1e6d0bde7e9e559fc8b6e9ac88b/locations.csv'</span>, <span class="hljs-string">'locations.csv'</span>)

locations_df = pd.read_csv(<span class="hljs-string">'locations.csv'</span>)
locations_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-130.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">locations_df[locations_df.location == <span class="hljs-string">"Italy"</span>]
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-131.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can merge this data into our existing data frame by adding more columns. However, to merge two data frames, we need at least one common column. Let's insert a <code>location</code> column in the <code>covid_df</code> dataframe with all values set to <code>"Italy"</code>.</p>
<pre><code class="lang-py">covid_df[<span class="hljs-string">'location'</span>] = <span class="hljs-string">"Italy"</span>

covid_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-132.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can now add the columns from <code>locations_df</code> into <code>covid_df</code> using the <code>.merge</code> method.</p>
<pre><code class="lang-py">merged_df = covid_df.merge(locations_df, on=<span class="hljs-string">"location"</span>)

merged_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-133.png" alt="Image" width="600" height="400" loading="lazy">
<em>Check out the full data frame <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<p>The location data for Italy is appended to each row within <code>covid_df</code>. If the <code>covid_df</code> data frame contained data for multiple locations, then the respective country's location data would be appended for each row.</p>
<p>We can now calculate metrics like cases per million, deaths per million, and tests per million.</p>
<pre><code class="lang-py">merged_df[<span class="hljs-string">'cases_per_million'</span>] = merged_df.total_cases * <span class="hljs-number">1e6</span> / merged_df.population
merged_df[<span class="hljs-string">'deaths_per_million'</span>] = merged_df.total_deaths * <span class="hljs-number">1e6</span> / merged_df.population
merged_df[<span class="hljs-string">'tests_per_million'</span>] = merged_df.total_tests * <span class="hljs-number">1e6</span> / merged_df.population

merged_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-134.png" alt="Image" width="600" height="400" loading="lazy">
<em>Check out the full data frame <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-pandas-data-analysis">here</a>.</em></p>
<h3 id="heading-how-to-write-data-back-to-files-in-pandas">How to Write Data Back to Files in Pandas</h3>
<p>After completing your analysis and adding new columns, you should write the results back to a file. Otherwise, the data will be lost when the Jupyter notebook shuts down. </p>
<p>Before writing to file, let's first create a data frame containing just the columns we wish to record.</p>
<pre><code class="lang-py">result_df = merged_df[[<span class="hljs-string">'date'</span>,
                       <span class="hljs-string">'new_cases'</span>, 
                       <span class="hljs-string">'total_cases'</span>, 
                       <span class="hljs-string">'new_deaths'</span>, 
                       <span class="hljs-string">'total_deaths'</span>, 
                       <span class="hljs-string">'new_tests'</span>, 
                       <span class="hljs-string">'total_tests'</span>, 
                       <span class="hljs-string">'cases_per_million'</span>, 
                       <span class="hljs-string">'deaths_per_million'</span>, 
                       <span class="hljs-string">'tests_per_million'</span>]]

result_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-135.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To write the data from the data frame into a file, we can use the <code>to_csv</code> function.</p>
<pre><code class="lang-py">result_df.to_csv(<span class="hljs-string">'results.csv'</span>, index=<span class="hljs-literal">None</span>)
</code></pre>
<p>The <code>to_csv</code> function also includes an additional column for storing the index of the dataframe by default. We pass <code>index=None</code> to turn off this behavior. You can now verify that the <code>results.csv</code> is created and contains data from the data frame in CSV format:</p>
<pre><code class="lang-py">date,new_cases,total_cases,new_deaths,total_deaths,new_tests,total_tests,cases_per_million,deaths_per_million,tests_per_million
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-27</span>,<span class="hljs-number">78.0</span>,<span class="hljs-number">400.0</span>,<span class="hljs-number">1.0</span>,<span class="hljs-number">12.0</span>,,,<span class="hljs-number">6.61574439992122</span>,<span class="hljs-number">0.1984723319976366</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-28</span>,<span class="hljs-number">250.0</span>,<span class="hljs-number">650.0</span>,<span class="hljs-number">5.0</span>,<span class="hljs-number">17.0</span>,,,<span class="hljs-number">10.750584649871982</span>,<span class="hljs-number">0.28116913699665186</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-02</span><span class="hljs-number">-29</span>,<span class="hljs-number">238.0</span>,<span class="hljs-number">888.0</span>,<span class="hljs-number">4.0</span>,<span class="hljs-number">21.0</span>,,,<span class="hljs-number">14.686952567825108</span>,<span class="hljs-number">0.34732658099586405</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-01</span>,<span class="hljs-number">240.0</span>,<span class="hljs-number">1128.0</span>,<span class="hljs-number">8.0</span>,<span class="hljs-number">29.0</span>,,,<span class="hljs-number">18.656399207777838</span>,<span class="hljs-number">0.47964146899428844</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-02</span>,<span class="hljs-number">561.0</span>,<span class="hljs-number">1689.0</span>,<span class="hljs-number">6.0</span>,<span class="hljs-number">35.0</span>,,,<span class="hljs-number">27.93498072866735</span>,<span class="hljs-number">0.5788776349931067</span>,
<span class="hljs-number">2020</span><span class="hljs-number">-03</span><span class="hljs-number">-03</span>,<span class="hljs-number">347.0</span>,<span class="hljs-number">2036.0</span>,<span class="hljs-number">17.0</span>,<span class="hljs-number">52.0</span>,,,<span class="hljs-number">33.67413899559901</span>,<span class="hljs-number">0.8600467719897585</span>,
...
</code></pre>
<h3 id="heading-bonus-basic-plotting-with-pandas">Bonus: Basic Plotting with Pandas</h3>
<p>We generally use a library like <code>matplotlib</code> or <code>seaborn</code> to plot graphs within a Jupyter notebook. However, Pandas dataframes and series provide a handy <code>.plot</code> method for quick and easy plotting.</p>
<p>Let's plot a line graph showing how the number of daily cases varies over time.</p>
<pre><code class="lang-py">result_df.new_cases.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-137.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>While this plot shows the overall trend, it's hard to tell where the peak occurred, as there are no dates on the X-axis. We can use the <code>date</code> column as the index for the data frame to address this issue.</p>
<pre><code class="lang-py">result_df.set_index(<span class="hljs-string">'date'</span>, inplace=<span class="hljs-literal">True</span>)

result_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-138.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Notice that the index of a data frame doesn't have to be numeric. Using the date as the index also allows us to get the data for a specific data using <code>.loc</code>.</p>
<pre><code class="lang-py">result_df.loc[<span class="hljs-string">'2020-09-01'</span>]
<span class="hljs-comment"># new_cases             9.960000e+02</span>
<span class="hljs-comment"># total_cases           2.696595e+05</span>
<span class="hljs-comment"># new_deaths            6.000000e+00</span>
<span class="hljs-comment"># total_deaths          3.548300e+04</span>
<span class="hljs-comment"># new_tests             5.439500e+04</span>
<span class="hljs-comment"># total_tests           5.214766e+06</span>
<span class="hljs-comment"># cases_per_million     4.459996e+03</span>
<span class="hljs-comment"># deaths_per_million    5.868661e+02</span>
<span class="hljs-comment"># tests_per_million     8.624890e+04</span>
<span class="hljs-comment"># Name: 2020-09-01 00:00:00, dtype: float64</span>
</code></pre>
<p>Let's plot the new cases and new deaths per day as line graphs.</p>
<pre><code class="lang-py">result_df.new_cases.plot()
result_df.new_deaths.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-139.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can also compare the total cases vs. total deaths.</p>
<pre><code class="lang-py">result_df.total_cases.plot()
result_df.total_deaths.plot();
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-140.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's see how the death rate and positive testing rates vary over time.</p>
<pre><code class="lang-py">death_rate = result_df.total_deaths / result_df.total_cases

death_rate.plot(title=<span class="hljs-string">'Death Rate'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-141.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">positive_rates = result_df.total_cases / result_df.total_tests

positive_rates.plot(title=<span class="hljs-string">'Positive Rate'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-142.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Finally, let's plot some month-wise data using a bar chart to visualize the trend at a higher level.</p>
<pre><code class="lang-py">covid_month_df.new_cases.plot(kind=<span class="hljs-string">'bar'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-143.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">covid_month_df.new_tests.plot(kind=<span class="hljs-string">'bar'</span>)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-144.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-pandas-exercises">Pandas Exercises</h3>
<p>Try the following exercises to become familiar with Pandas dataframes and practice your skills:</p>
<ul>
<li><a target="_blank" href="https://jovian.ml/aakashns/pandas-practice-assignment">Assignment on Pandas dataframes</a></li>
<li><a target="_blank" href="https://github.com/guipsamora/pandas_exercises">Additional exercises on Pandas</a></li>
<li><a target="_blank" href="https://www.kaggle.com/datasets">Try downloading and analyzing some data from Kaggle</a></li>
</ul>
<h3 id="heading-summary-and-further-reading-1">Summary and Further Reading</h3>
<p>We've covered the following topics in this tutorial:</p>
<ul>
<li>How to read a CSV file into a Pandas data frame</li>
<li>How to retrieve data from Pandas data frames</li>
<li>How to query, sort, and analyze data</li>
<li>How to merge, group, and aggregate data</li>
<li>How to extract useful information from dates</li>
<li>Basic plotting using line and bar charts</li>
<li>How to write data frames to CSV files</li>
</ul>
<p>Check out the following resources to learn more about Pandas:</p>
<ul>
<li><a target="_blank" href="https://pandas.pydata.org/docs/user_guide/index.html">User guide for Pandas</a></li>
<li><a target="_blank" href="https://www.oreilly.com/library/view/python-for-data/9781491957653/">Python for Data Analysis (book by Wes McKinney - creator of Pandas)</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension-1">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is Pandas? What makes it useful?</li>
<li>How do you install the Pandas library?</li>
<li>How do you import the <code>pandas</code> module?</li>
<li>What is the common alias used while importing the <code>pandas</code> module?</li>
<li>How do you read a CSV file using Pandas? Give an example.</li>
<li>What are some other file formats you can read using Pandas? Illustrate with examples.</li>
<li>What are Pandas dataframes?</li>
<li>How are Pandas dataframes different from Numpy arrays?</li>
<li>How do you find the number of rows and columns in a dataframe?</li>
<li>How do you get the list of columns in a dataframe?</li>
<li>What is the purpose of the <code>describe</code> method of a dataframe?</li>
<li>How are the <code>info</code> and <code>describe</code> dataframe methods different?</li>
<li>Is a Pandas dataframe conceptually similar to a list of dictionaries or a dictionary of lists? Explain with an example.</li>
<li>What is a Pandas <code>Series</code>? How is it different from a Numpy array?</li>
<li>How do you access a column from a dataframe?</li>
<li>How do you access a row from a dataframe?</li>
<li>How do you access an element at a specific row and column of a dataframe?</li>
<li>How do you create a subset of a dataframe with a specific set of columns?</li>
<li>How do you create a subset of a dataframe with a specific range of rows?</li>
<li>Does changing a value within a dataframe affect other dataframes created using a subset of the rows or columns? Why is it so?</li>
<li>How do you create a copy of a dataframe?</li>
<li>Why should you avoid creating too many copies of a dataframe?</li>
<li>How do you view the first few rows of a dataframe?</li>
<li>How do you view the last few rows of a dataframe?</li>
<li>How do you view a random selection of rows of a dataframe?</li>
<li>What is the "index" in a dataframe? How is it useful?</li>
<li>What does a <code>NaN</code> value in a Pandas dataframe represent?</li>
<li>How is <code>Nan</code> different from <code>0</code>?</li>
<li>How do you identify the first non-empty row in a Pandas series or column?</li>
<li>What is the difference between <code>df.loc</code> and <code>df.at</code>?</li>
<li>Where can you find a full list of methods supported by Pandas <code>DataFrame</code> and <code>Series</code> objects?</li>
<li>How do you find the sum of numbers in a column of a dataframe?</li>
<li>How do you find the mean of numbers in a column of a dataframe?</li>
<li>How do you find the number of non-empty numbers in a column of a dataframe?</li>
<li>What is the result obtained by using a Pandas column in a boolean expression? Illustrate with an example.</li>
<li>How do you select a subset of rows where a specific column's value meets a given condition? Illustrate with an example.</li>
<li>What is the result of the expression <code>df[df.new_cases &gt; 100]</code> ?</li>
<li>How do you display all the rows of a pandas dataframe in a Jupyter cell output?</li>
<li>What is the result obtained when you perform an arithmetic operation between two columns of a dataframe? Illustrate with an example.</li>
<li>How do you add a new column to a dataframe by combining values from two existing columns? Illustrate with an example.</li>
<li>How do you remove a column from a dataframe? Illustrate with an example.</li>
<li>What is the purpose of the <code>inplace</code> argument in dataframe methods?</li>
<li>How do you sort the rows of a dataframe based on the values in a particular column?</li>
<li>How do you sort a pandas dataframe using values from multiple columns?</li>
<li>How do you specify whether to sort by ascending or descending order while sorting a Pandas dataframe?</li>
<li>How do you change a specific value within a dataframe?</li>
<li>How do you convert a dataframe column to the <code>datetime</code> data type?</li>
<li>What are the benefits of using the <code>datetime</code> data type instead of <code>object</code>?</li>
<li>How do you extract different parts of a date column like the month, year, month, weekday, and so on into separate columns? Illustrate with an example.</li>
<li>How do you aggregate multiple columns of a dataframe together?</li>
<li>What is the purpose of the <code>groupby</code> method of a dataframe? Illustrate with an example.</li>
<li>What are the different ways in which you can aggregate the groups created by <code>groupby</code>?</li>
<li>What do you mean by a running or cumulative sum?</li>
<li>How do you create a new column containing the running or cumulative sum of another column?</li>
<li>What are other cumulative measures supported by Pandas dataframes?</li>
<li>What does it mean to merge two dataframes? Give an example.</li>
<li>How do you specify the columns that should be used for merging two dataframes?</li>
<li>How do you write data from a Pandas dataframe into a CSV file? Give an example.</li>
<li>What are some other file formats you can write to from a Pandas dataframe? Illustrate with examples.</li>
<li>How do you create a line plot showing the values within a column of a dataframe?</li>
<li>How do you convert a column of a dataframe into its index?</li>
<li>Can the index of a dataframe be non-numeric?</li>
<li>What are the benefits of using a non-numeric dataframe? Illustrate with an example.</li>
<li>How you create a bar plot showing the values within a column of a dataframe?</li>
<li>What are some other types of plots supported by Pandas dataframes and series?</li>
</ol>
<p>You are ready to move on to the next section of the tutorial.</p>
<h2 id="heading-data-visualization-using-python-matplotlib-and-seaborn">Data Visualization using Python, Matplotlib, and Seaborn</h2>
<p><img src="https://i.imgur.com/9i806Rh.png" alt="Image" width="2314" height="1092" loading="lazy"></p>
<p>Notebook link: <a target="_blank" href="https://jovian.ai/aakashns/python-matplotlib-data-visualization">https://jovian.ai/aakashns/python-matplotlib-data-visualization</a></p>
<p>Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. </p>
<p>Visualizing data is an essential part of data analysis and machine learning. We'll use Python libraries <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org">Matplotlib</a> and <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org">Seaborn</a> to learn and apply some popular data visualization techniques. We'll use the words <em>chart</em>, <em>plot</em>, and <em>graph</em> interchangeably in this tutorial.</p>
<p>To begin, let's install and import the libraries. We'll use the <code>matplotlib.pyplot</code> module for basic plots like line and bar charts. It is often imported with the alias <code>plt</code>. We'll use the <code>seaborn</code> module for more advanced plots. It is commonly imported with the alias <code>sns</code>.</p>
<pre><code class="lang-py">!pip install matplotlib seaborn --upgrade --quiet

<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
%matplotlib inline
</code></pre>
<p>Notice this we also include the special command <code>%matplotlib inline</code> to ensure that our plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows.</p>
<h3 id="heading-how-to-create-a-line-chart-in-python">How to Create a Line Chart in Python</h3>
<p>The line chart is one of the simplest and most widely used data visualization techniques. A line chart displays information as a series of data points or markers connected by straight lines. </p>
<p>You can customize the shape, size, color, and other aesthetic elements of the lines and markers for better visual clarity.</p>
<p>Here's a Python list showing the yield of apples (tons per hectare) over six years in an imaginary country called Kanto.</p>
<pre><code class="lang-py">yield_apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>]
</code></pre>
<p>We can visualize how the yield of apples changes over time using a line chart. To draw a line chart, we can use the <code>plt.plot</code> function.</p>
<pre><code class="lang-py">plt.plot(yield_apples)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-145.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Calling the <code>plt.plot</code> function draws the line chart as expected. It also returns a list of plots drawn <code>[&lt;matplotlib.lines.Line2D at 0x7ff70aa20760&gt;]</code>, shown within the output. We can include a semicolon (<code>;</code>) at the end of the last statement in the cell to avoiding showing the output and display just the graph.</p>
<pre><code class="lang-py">plt.plot(yield_apples);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-146.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let's enhance this plot step-by-step to make it more informative and beautiful.</p>
<h4 id="heading-how-to-customize-the-x-axis-in-matplotlib"><strong>How to Customize the X-axis in MatPlotLib</strong></h4>
<p>The X-axis of the plot currently shows list element indices 0 to 5. The plot would be more informative if we could display the year for which we're plotting the data. We can do this by two arguments <code>plt.plot</code>.</p>
<pre><code class="lang-py">years = [<span class="hljs-number">2010</span>, <span class="hljs-number">2011</span>, <span class="hljs-number">2012</span>, <span class="hljs-number">2013</span>, <span class="hljs-number">2014</span>, <span class="hljs-number">2015</span>]
yield_apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>]

plt.plot(years, yield_apples)
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-147.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-axis-labels-in-matplotlib"><strong>Axis Labels in MatPlotLib</strong></h4>
<p>We can add labels to the axes to show what each axis represents using the <code>plt.xlabel</code> and <code>plt.ylabel</code> methods.</p>
<pre><code class="lang-py">plt.plot(years, yield_apples)
plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-148.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-plot-multiple-lines-in-matplotlib"><strong>How to Plot Multiple Lines in MatPlotLib</strong></h4>
<p>You can invoke the <code>plt.plot</code> function once for each line to plot multiple lines in the same graph. Let's compare the yields of apples vs. oranges in Kanto.</p>
<pre><code class="lang-py">years = range(<span class="hljs-number">2000</span>, <span class="hljs-number">2012</span>)
apples = [<span class="hljs-number">0.895</span>, <span class="hljs-number">0.91</span>, <span class="hljs-number">0.919</span>, <span class="hljs-number">0.926</span>, <span class="hljs-number">0.929</span>, <span class="hljs-number">0.931</span>, <span class="hljs-number">0.934</span>, <span class="hljs-number">0.936</span>, <span class="hljs-number">0.937</span>, <span class="hljs-number">0.9375</span>, <span class="hljs-number">0.9372</span>, <span class="hljs-number">0.939</span>]
oranges = [<span class="hljs-number">0.962</span>, <span class="hljs-number">0.941</span>, <span class="hljs-number">0.930</span>, <span class="hljs-number">0.923</span>, <span class="hljs-number">0.918</span>, <span class="hljs-number">0.908</span>, <span class="hljs-number">0.907</span>, <span class="hljs-number">0.904</span>, <span class="hljs-number">0.901</span>, <span class="hljs-number">0.898</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.896</span>, ]

plt.plot(years, apples)
plt.plot(years, oranges)
plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-149.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-chart-title-and-legend-in-matplotlib"><strong>Chart Title and Legend in MatPlotLib</strong></h4>
<p>To differentiate between multiple lines, we can include a legend within the graph using the <code>plt.legend</code> function. We can also set a title for the chart using the <code>plt.title</code> function.</p>
<pre><code class="lang-py">plt.plot(years, apples)
plt.plot(years, oranges)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-150.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-use-line-markers-in-matplotlib"><strong>How to Use Line Markers in MatPlotLib</strong></h4>
<p>We can also show markers for the data points on each line using the <code>marker</code> argument of <code>plt.plot</code>. </p>
<p>Matplotlib provides many different markers like a circle, cross, square, diamond, and more. You can find the full list of marker types here: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.1%2Fapi%2Fmarkers_api.html">https://matplotlib.org/3.1.1/api/markers_api.html</a> .</p>
<pre><code class="lang-py">plt.plot(years, apples, marker=<span class="hljs-string">'o'</span>)
plt.plot(years, oranges, marker=<span class="hljs-string">'x'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-151.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-style-lines-and-markers-in-matplotlib"><strong>How to Style Lines and Markers in MatPlotLib</strong></h4>
<p>The <code>plt.plot</code> function supports many arguments for styling lines and markers:</p>
<ul>
<li><code>color</code> or <code>c</code> – Set the color of the line (<a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.0%2Fgallery%2Fcolor%2Fnamed_colors.html">supported colors</a>)</li>
<li><code>linestyle</code> or <code>ls</code> – Choose between a solid or dashed line</li>
<li><code>linewidth</code> or <code>lw</code> – Set the width of a line</li>
<li><code>markersize</code> or <code>ms</code> – Set the size of markers</li>
<li><code>markeredgecolor</code> or <code>mec</code> – Set the edge color for markers</li>
<li><code>markeredgewidth</code> or <code>mew</code> – Set the edge width for markers</li>
<li><code>markerfacecolor</code> or <code>mfc</code> – Set the fill color for markers</li>
<li><code>alpha</code> – Opacity of the plot</li>
</ul>
<p>Check out the documentation for <code>plt.plot</code> to learn more: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2Fapi%2F_as_gen%2Fmatplotlib.pyplot.plot.html%23matplotlib.pyplot.plot">https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot</a> .</p>
<pre><code class="lang-py">plt.plot(years, apples, marker=<span class="hljs-string">'s'</span>, c=<span class="hljs-string">'b'</span>, ls=<span class="hljs-string">'-'</span>, lw=<span class="hljs-number">2</span>, ms=<span class="hljs-number">8</span>, mew=<span class="hljs-number">2</span>, mec=<span class="hljs-string">'navy'</span>)
plt.plot(years, oranges, marker=<span class="hljs-string">'o'</span>, c=<span class="hljs-string">'r'</span>, ls=<span class="hljs-string">'--'</span>, lw=<span class="hljs-number">3</span>, ms=<span class="hljs-number">10</span>, alpha=<span class="hljs-number">.5</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-152.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The <code>fmt</code> argument provides a shorthand for specifying the marker shape, line style, and line color. You can provide it as the third argument to <code>plt.plot</code>.</p>
<pre><code class="lang-py">fmt = <span class="hljs-string">'[marker][line][color]'</span>

plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-153.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can use the <code>plt.figure</code> function to change the size of the figure.</p>
<pre><code class="lang-py">plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-154.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-change-the-figure-size-in-matplotlib"><strong>How to Change the Figure Size in MatPlotLib</strong></h4>
<p>You can use the <code>plt.figure</code> function to change the size of the figure.</p>
<pre><code class="lang-py">plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))

plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-155.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-improve-default-styles-using-seaborn"><strong>How to Improve Default Styles using Seaborn</strong></h4>
<p>An easy way to make your charts look beautiful is to use some default styles from the Seaborn library. You can apply them globally using the <code>sns.set_style</code> function. You can see a full list of predefined styles here: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org%2Fgenerated%2Fseaborn.set_style.html">https://seaborn.pydata.org/generated/seaborn.set_style.html</a> .</p>
<pre><code class="lang-py">sns.set_style(<span class="hljs-string">"whitegrid"</span>)
plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-156.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code>sns.set_style(<span class="hljs-string">"darkgrid"</span>)

plt.plot(years, apples, <span class="hljs-string">'s-b'</span>)
plt.plot(years, oranges, <span class="hljs-string">'o--r'</span>)

plt.xlabel(<span class="hljs-string">'Year'</span>)
plt.ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)

plt.title(<span class="hljs-string">"Crop Yields in Kanto"</span>)
plt.legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
</code></pre><p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-157.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">plt.plot(years, oranges, <span class="hljs-string">'or'</span>)
plt.title(<span class="hljs-string">"Yield of Oranges (tons per hectare)"</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-158.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can also edit default styles directly by modifying the <code>matplotlib.rcParams</code> dictionary. Learn more: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.2.1%2Ftutorials%2Fintroductory%2Fcustomizing.html%23matplotlib-rcparams">https://matplotlib.org/3.2.1/tutorials/introductory/customizing.html#matplotlib-rcparams</a> .</p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> matplotlib

matplotlib.rcParams[<span class="hljs-string">'font.size'</span>] = <span class="hljs-number">14</span>
matplotlib.rcParams[<span class="hljs-string">'figure.figsize'</span>] = (<span class="hljs-number">9</span>, <span class="hljs-number">5</span>)
matplotlib.rcParams[<span class="hljs-string">'figure.facecolor'</span>] = <span class="hljs-string">'#00000000'</span>
</code></pre>
<h3 id="heading-scatter-plots-in-matplotlib">Scatter Plots <strong>in MatPlotLib</strong></h3>
<p>In a scatter plot, the values of 2 variables are plotted as points on a 2-dimensional grid. Additionally, you can also use a third variable to determine the size or color of the points. Let's try out an example.</p>
<p>The <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FIris_flower_data_set">Iris flower dataset</a> provides sample measurements of sepals and petals for three species of flowers. The Iris dataset is included with the Seaborn library and you can load it as a Pandas data frame.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Load data into a Pandas dataframe</span>
flowers_df = sns.load_dataset(<span class="hljs-string">"iris"</span>)

flowers_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-159.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py">flowers_df.species.unique()
<span class="hljs-comment"># array(['setosa', 'versicolor', 'virginica'], dtype=object)</span>
</code></pre>
<p>Let's try to visualize the relationship between sepal length and sepal width. Our first instinct might be to create a line chart using <code>plt.plot</code>.</p>
<pre><code class="lang-py">plt.plot(flowers_df.sepal_length, flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-160.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The output is not very informative as there are too many combinations of the two properties within the dataset. There doesn't seem to be simple relationship between them.</p>
<p>We can use a scatter plot to visualize how sepal length and sepal width vary using the <code>scatterplot</code> function from the <code>seaborn</code> module (imported as <code>sns</code>).</p>
<pre><code class="lang-py">sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-161.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-add-hues-in-matplotlib"><strong>How to Add Hues in MatPlotLib</strong></h4>
<p>Notice how the points in the above plot seem to form distinct clusters with some outliers. We can color the dots using the flower species as a <code>hue</code>. We can also make the points larger using the <code>s</code> argument.</p>
<pre><code class="lang-py">sns.scatterplot(x=flowers_df.sepal_length, y=flowers_df.sepal_width, hue=flowers_df.species, s=<span class="hljs-number">100</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-162.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Adding hues makes the plot more informative. We can immediately tell that Setosa irises have a smaller sepal length but higher sepal widths. In contrast, the opposite is true for Virginica irises.</p>
<h4 id="heading-how-to-customize-seaborn-figures"><strong>How to </strong>Customiz<strong>e </strong>Seaborn Figures<em>**</em></h4>
<p>Since Seaborn uses Matplotlib's plotting functions internally, we can use functions like <code>plt.figure</code> and <code>plt.title</code> to modify the figure.</p>
<pre><code class="lang-py">plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))
plt.title(<span class="hljs-string">'Sepal Dimensions'</span>)

sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species,
                s=<span class="hljs-number">100</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-163.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-plot-data-using-pandas-data-frames-with-seaborn"><strong>How to Plot Data using Pandas Data Frames with Seaborn</strong></h4>
<p>Seaborn has built-in support for Pandas data frames. Instead of passing each column as a series, you can provide column names and use the <code>data</code> argument to specify a data frame.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">'Sepal Dimensions'</span>)
sns.scatterplot(x=<span class="hljs-string">'sepal_length'</span>, 
                y=<span class="hljs-string">'sepal_width'</span>, 
                hue=<span class="hljs-string">'species'</span>,
                s=<span class="hljs-number">100</span>,
                data=flowers_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-164.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-histograms-in-matplotlib">Histograms <strong>in MatPlotLib</strong></h3>
<p>A histogram represents the distribution of a variable by creating bins (intervals) along the range of values and showing vertical bars to indicate the number of observations in each bin.</p>
<p>For example, let's visualize the distribution of values of sepal width in the Iris dataset. We can use the <code>plt.hist</code> function to create a histogram.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Load data into a Pandas dataframe</span>
flowers_df = sns.load_dataset(<span class="hljs-string">"iris"</span>)

flowers_df.sepal_width
<span class="hljs-comment"># 0      3.5</span>
<span class="hljs-comment"># 1      3.0</span>
<span class="hljs-comment"># 2      3.2</span>
<span class="hljs-comment"># 3      3.1</span>
<span class="hljs-comment"># 4      3.6</span>
<span class="hljs-comment">#       ... </span>
<span class="hljs-comment"># 145    3.0</span>
<span class="hljs-comment"># 146    2.5</span>
<span class="hljs-comment"># 147    3.0</span>
<span class="hljs-comment"># 148    3.4</span>
<span class="hljs-comment"># 149    3.0</span>
<span class="hljs-comment"># Name: sepal_width, Length: 150, dtype: float64</span>
</code></pre>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"Distribution of Sepal Width"</span>)
plt.hist(flowers_df.sepal_width);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-165.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can immediately see that the sepal widths lie in the range 2.0 - 4.5, and around 35 values are in the range 2.9 - 3.1, which seems to be the most populous bin.</p>
<h4 id="heading-how-to-control-the-size-and-number-of-bins"><strong>How to C</strong>ontrol the<strong> S</strong>ize and<strong> N</strong>umber of<strong> B</strong>ins<em>**</em></h4>
<p>We can control the number of bins or the size of each one using the bins argument.</p>
<pre><code class="lang-py"><span class="hljs-comment"># Specifying the number of bins</span>
plt.hist(flowers_df.sepal_width, bins=<span class="hljs-number">5</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-166.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># Specifying the boundaries of each bin</span>
plt.hist(flowers_df.sepal_width, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-167.png" alt="Image" width="600" height="400" loading="lazy"></p>
<pre><code class="lang-py"><span class="hljs-comment"># Bins of unequal sizes</span>
plt.hist(flowers_df.sepal_width, bins=[<span class="hljs-number">1</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">4.5</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-168.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-how-to-manage-multiple-histograms-in-matplotlib"><strong>How to Manage Multiple Histograms in MatPlotLib</strong></h4>
<p>Similar to line charts, we can draw multiple histograms in a single chart. We can reduce each histogram's opacity so that one histogram's bars don't hide the others'.</p>
<p>Let's draw separate histograms for each species of flowers.</p>
<pre><code class="lang-py">setosa_df = flowers_df[flowers_df.species == <span class="hljs-string">'setosa'</span>]
versicolor_df = flowers_df[flowers_df.species == <span class="hljs-string">'versicolor'</span>]
virginica_df = flowers_df[flowers_df.species == <span class="hljs-string">'virginica'</span>]

plt.hist(setosa_df.sepal_width, alpha=<span class="hljs-number">0.4</span>, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
plt.hist(versicolor_df.sepal_width, alpha=<span class="hljs-number">0.4</span>, bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>));
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-169.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can also stack multiple histograms on top of one another.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">'Distribution of Sepal Width'</span>)

plt.hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>), 
         stacked=<span class="hljs-literal">True</span>);

plt.legend([<span class="hljs-string">'Setosa'</span>, <span class="hljs-string">'Versicolor'</span>, <span class="hljs-string">'Virginica'</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-170.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-bar-charts-in-matplotlib">Bar Charts <strong>in MatPlotLib</strong></h3>
<p>Bar charts are quite similar to line charts, that is they show a sequence of values. However, a bar is shown for each value, rather than points connected by lines. We can use the <code>plt.bar</code> function to draw a bar chart.</p>
<pre><code class="lang-py">years = range(<span class="hljs-number">2000</span>, <span class="hljs-number">2006</span>)
apples = [<span class="hljs-number">0.35</span>, <span class="hljs-number">0.6</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.65</span>, <span class="hljs-number">0.8</span>]
oranges = [<span class="hljs-number">0.4</span>, <span class="hljs-number">0.8</span>, <span class="hljs-number">0.9</span>, <span class="hljs-number">0.7</span>, <span class="hljs-number">0.6</span>, <span class="hljs-number">0.8</span>]

plt.bar(years, oranges);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-171.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Like histograms, we can stack bars on top of one another. We use the <code>bottom</code> argument of <code>plt.bar</code> to achieve this.</p>
<pre><code class="lang-py">plt.bar(years, apples)
plt.bar(years, oranges, bottom=apples);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-172.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h4 id="heading-bar-plots-with-averages-in-seaborn"><strong>Bar Plots with Averages in Seaborn</strong></h4>
<p>Let's look at another sample dataset included with Seaborn called <code>tips</code>. The dataset contains information about the sex, time of day, total bill, and tip amount for customers visiting a restaurant over a week.</p>
<pre><code class="lang-py">tips_df = sns.load_dataset(<span class="hljs-string">"tips"</span>);

tips_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-173.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We might want to draw a bar chart to visualize how the average bill amount varies across different days of the week. One way to do this would be to compute the day-wise averages and then use <code>plt.bar</code> (try it as an exercise).</p>
<p>However, since this is a very common use case, the Seaborn library provides a <code>barplot</code> function which can automatically compute averages.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-174.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The lines cutting each bar represent the amount of variation in the values. For instance, it seems like the variation in the total bill is relatively high on Fridays and low on Saturdays.</p>
<p>We can also specify a <code>hue</code> argument to compare bar plots side-by-side based on a third feature, for example sex.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-175.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can make the bars horizontal simply by switching the axes.</p>
<pre><code class="lang-py">sns.barplot(x=<span class="hljs-string">'total_bill'</span>, y=<span class="hljs-string">'day'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-176.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-heatmaps-in-seaborn">Heatmaps in Seaborn</h3>
<p>A heatmap is used to visualize 2-dimensional data like a matrix or a table using colors. The best way to understand it is by looking at an example. </p>
<p>We'll use another sample dataset from Seaborn, called <code>flights</code>, to visualize monthly passenger footfall at an airport over 12 years.</p>
<pre><code class="lang-py">flights_df = sns.load_dataset(<span class="hljs-string">"flights"</span>).pivot(<span class="hljs-string">"month"</span>, <span class="hljs-string">"year"</span>, <span class="hljs-string">"passengers"</span>)

flights_df
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-177.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p><code>flights_df</code> is a matrix with one row for each month and one column for each year. The values show the number of passengers (in thousands) that visited the airport in a specific month of a year. We can use the <code>sns.heatmap</code> function to visualize the footfall at the airport.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"No. of Passengers (1000s)"</span>)
sns.heatmap(flights_df);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-178.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>The brighter colors indicate a higher footfall at the airport. By looking at the graph, we can infer two things:</p>
<ul>
<li>The footfall at the airport in any given year tends to be the highest around July and August.</li>
<li>The footfall at the airport in any given month tends to grow year by year.</li>
</ul>
<p>We can also display the actual values in each block by specifying <code>annot=True</code> and using the <code>cmap</code> argument to change the color palette.</p>
<pre><code class="lang-py">plt.title(<span class="hljs-string">"No. of Passengers (1000s)"</span>)
sns.heatmap(flights_df, fmt=<span class="hljs-string">"d"</span>, annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">'Blues'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-179.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-images-in-matplotlib">Images <strong>in MatPlotLib</strong></h3>
<p>We can also use Matplotlib to display images. Let's download an image from the internet.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> urllib.request <span class="hljs-keyword">import</span> urlretrieve

urlretrieve(<span class="hljs-string">'https://i.imgur.com/SkPbq.jpg'</span>, <span class="hljs-string">'chart.jpg'</span>);
</code></pre>
<p>Before displaying an image, it has to be read into memory using the <code>PIL</code> module.</p>
<pre><code class="lang-py"><span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image

img = Image.open(<span class="hljs-string">'chart.jpg'</span>)
</code></pre>
<p>An image loaded using PIL is simply a 3-dimensional numpy array containing pixel intensities for the red, green &amp; blue (RGB) channels of the image. We can convert the image into an array using <code>np.array</code>.</p>
<pre><code>img_array = np.array(img)

img_array.shape
# (<span class="hljs-number">481</span>, <span class="hljs-number">640</span>, <span class="hljs-number">3</span>)
</code></pre><p>We can display the PIL image using <code>plt.imshow</code>.</p>
<pre><code class="lang-py">plt.imshow(img);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-180.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>We can turn off the axes &amp; grid lines and show a title using the relevant functions.</p>
<pre><code class="lang-py">plt.grid(<span class="hljs-literal">False</span>)
plt.title(<span class="hljs-string">'A data science meme'</span>)
plt.axis(<span class="hljs-string">'off'</span>)
plt.imshow(img);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-181.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>To display a part of the image, we can simply select a slice from the numpy array.</p>
<pre><code class="lang-py">plt.grid(<span class="hljs-literal">False</span>)
plt.axis(<span class="hljs-string">'off'</span>)
plt.imshow(img_array[<span class="hljs-number">125</span>:<span class="hljs-number">325</span>,<span class="hljs-number">105</span>:<span class="hljs-number">305</span>]);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-182.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-how-to-plot-multiple-charts-in-a-grid-in-matplotlib-and-seaborn">How to Plot Multiple Charts in a Grid <strong>in MatPlotLib and Seaborn</strong></h3>
<p>Matplotlib and Seaborn also support plotting multiple charts in a grid, using <code>plt.subplots</code>, which returns a set of axes for plotting.</p>
<p>Here's a single grid showing the different types of charts we've covered in this tutorial.</p>
<pre><code class="lang-py">fig, axes = plt.subplots(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>, figsize=(<span class="hljs-number">16</span>, <span class="hljs-number">8</span>))

<span class="hljs-comment"># Use the axes for plotting</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(years, apples, <span class="hljs-string">'s-b'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].plot(years, oranges, <span class="hljs-string">'o--r'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_xlabel(<span class="hljs-string">'Year'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_ylabel(<span class="hljs-string">'Yield (tons per hectare)'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].legend([<span class="hljs-string">'Apples'</span>, <span class="hljs-string">'Oranges'</span>]);
axes[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Crop Yields in Kanto'</span>)


<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Sepal Length vs. Sepal Width'</span>)
sns.scatterplot(x=flowers_df.sepal_length, 
                y=flowers_df.sepal_width, 
                hue=flowers_df.species, 
                s=<span class="hljs-number">100</span>, 
                ax=axes[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>]);

<span class="hljs-comment"># Use the axes for plotting</span>
axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].set_title(<span class="hljs-string">'Distribution of Sepal Width'</span>)
axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].hist([setosa_df.sepal_width, versicolor_df.sepal_width, virginica_df.sepal_width], 
         bins=np.arange(<span class="hljs-number">2</span>, <span class="hljs-number">5</span>, <span class="hljs-number">0.25</span>), 
         stacked=<span class="hljs-literal">True</span>);

axes[<span class="hljs-number">0</span>,<span class="hljs-number">2</span>].legend([<span class="hljs-string">'Setosa'</span>, <span class="hljs-string">'Versicolor'</span>, <span class="hljs-string">'Virginica'</span>]);

<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>].set_title(<span class="hljs-string">'Restaurant bills'</span>)
sns.barplot(x=<span class="hljs-string">'day'</span>, y=<span class="hljs-string">'total_bill'</span>, hue=<span class="hljs-string">'sex'</span>, data=tips_df, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">0</span>]);

<span class="hljs-comment"># Pass the axes into seaborn</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>].set_title(<span class="hljs-string">'Flight traffic'</span>)
sns.heatmap(flights_df, cmap=<span class="hljs-string">'Blues'</span>, ax=axes[<span class="hljs-number">1</span>,<span class="hljs-number">1</span>]);

<span class="hljs-comment"># Plot an image using the axes</span>
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_title(<span class="hljs-string">'Data Science Meme'</span>)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].imshow(img)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].grid(<span class="hljs-literal">False</span>)
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_xticks([])
axes[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>].set_yticks([])

plt.tight_layout(pad=<span class="hljs-number">2</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-183.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>See this page for a full list of supported functions: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.3.1%2Fapi%2Faxes_api.html%23the-axes-class">https://matplotlib.org/3.3.1/api/axes_api.html#the-axes-class</a> .</p>
<h4 id="heading-pair-plots-with-seaborn"><strong>Pair</strong> P<strong>lots with Seaborn</strong></h4>
<p>Seaborn also provides a helper function <code>sns.pairplot</code> to automatically plot several different charts for pairs of features within a dataframe.</p>
<pre><code class="lang-py">sns.pairplot(flowers_df, hue=<span class="hljs-string">'species'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-184.png" alt="Image" width="600" height="400" loading="lazy">
<em>See the full output <a target="_blank" href="https://jovian.ai/embed?url=https://jovian.ai/aakashns/python-matplotlib-data-visualization/">here</a>.</em></p>
<pre><code class="lang-py">sns.pairplot(tips_df, hue=<span class="hljs-string">'sex'</span>);
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/10/image-185.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-summary-and-further-reading-2">Summary and Further Reading</h3>
<p>We have covered the following topics in this tutorial:</p>
<ul>
<li>How to create and customize line charts using Matplotlib</li>
<li>How to visualize relationships between two or more variables using scatter plots</li>
<li>How to study distributions of variables using histograms and bar charts</li>
<li>How to visualize two-dimensional data using heatmaps</li>
<li>How to display images using Matplotlib's <code>plt.imshow</code></li>
<li>How to plot multiple Matplotlib and Seaborn charts in a grid</li>
</ul>
<p>In this tutorial we've covered some of the fundamental concepts and popular techniques for data visualization using Matplotlib and Seaborn. Data visualization is a vast field and we've barely scratched the surface here. Check out these references to learn and discover more:</p>
<ul>
<li>Data Visualization cheat sheet: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fjovian.ml%2Faakashns%2Fdataviz-cheatsheet">https://jovian.ml/aakashns/dataviz-cheatsheet</a></li>
<li>Seaborn gallery: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fseaborn.pydata.org%2Fexamples%2Findex.html">https://seaborn.pydata.org/examples/index.html</a></li>
<li>Matplotlib gallery: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fmatplotlib.org%2F3.1.1%2Fgallery%2Findex.html">https://matplotlib.org/3.1.1/gallery/index.html</a></li>
<li>Matplotlib tutorial: <a target="_blank" href="https://jovian.ai/outlink?url=https%3A%2F%2Fgithub.com%2Frougier%2Fmatplotlib-tutorial">https://github.com/rougier/matplotlib-tutorial</a></li>
</ul>
<h3 id="heading-review-questions-to-check-your-comprehension-2">Review Questions to Check Your Comprehension</h3>
<p>Try answering the following questions to test your understanding of the topics covered in this notebook:</p>
<ol>
<li>What is data visualization?</li>
<li>What is Matplotlib?</li>
<li>What is Seaborn?</li>
<li>How do you install Matplotlib and Seaborn?</li>
<li>How you import Matplotlib and Seaborn? What are the common aliases used while importing these modules?</li>
<li>What is the purpose of the magic command <code>%matplotlib inline</code>?</li>
<li>What is a line chart?</li>
<li>How do you plot a line chart in Python? Illustrate with an example.</li>
<li>How do you specify values for the X-axis of a line chart?</li>
<li>How do you specify labels for the axes of a chart?</li>
<li>How do you plot multiple line charts on the same axes?</li>
<li>How do you show a legend for a line chart with multiple lines?</li>
<li>How you set a title for a chart?</li>
<li>How do you show markers on a line chart?</li>
<li>What are the different options for styling lines and markers in line charts? Illustrate with examples.</li>
<li>What is the purpose of the <code>fmt</code> argument to <code>plt.plot</code>?</li>
<li>Where can you see a list of all the arguments accepted by <code>plt.plot</code>?</li>
<li>How do you change the size of the figure using Matplotlib?</li>
<li>How do you apply the default styles from Seaborn globally for all charts?</li>
<li>What are the predefined styles available in Seaborn? Illustrate with examples.</li>
<li>What is a scatter plot?</li>
<li>How is a scatter plot different from a line chart?</li>
<li>How do you draw a scatter plot using Seaborn? Illustrate with an example.</li>
<li>How do you decide when to use a scatter plot vs a line chart?</li>
<li>How do you specify the colors for dots on a scatter plot using a categorical variable?</li>
<li>How do you customize the title, figure size, legend, and son on for Seaborn plots?</li>
<li>How do you use a Pandas dataframe with <code>sns.scatterplot</code>?</li>
<li>What is a histogram?</li>
<li>When should you use a histogram vs a line chart?</li>
<li>How do you draw a histogram using Matplotlib? Illustrate with an example.</li>
<li>What are "bins" in a histogram?</li>
<li>How do you change the sizes of bins in a histogram?</li>
<li>How do you change the number of bins in a histogram?</li>
<li>How do you show multiple histograms on the same axes?</li>
<li>How do you stack multiple histograms on top of one another?</li>
<li>What is a bar chart?</li>
<li>How do you draw a bar chart using Matplotlib? Illustrate with an example.</li>
<li>What is the difference between a bar chart and a histogram?</li>
<li>What is the difference between a bar chart and a line chart?</li>
<li>How do you stack bars on top of one another?</li>
<li>What is the difference between <code>plt.bar</code> and <code>sns.barplot</code>?</li>
<li>What do the lines cutting the bars in a Seaborn bar plot represent?</li>
<li>How do you show bar plots side-by-side?</li>
<li>How do you draw a horizontal bar plot?</li>
<li>What is a heat map?</li>
<li>What type of data is best visualized with a heat map?</li>
<li>What does the <code>pivot</code> method of a Pandas dataframe do?</li>
<li>How do you draw a heat map using Seaborn? Illustrate with an example.</li>
<li>How do you change the color scheme of a heat map?</li>
<li>How do you show the original values from the dataset on a heat map?</li>
<li>How do you download images from a URL in Python?</li>
<li>How do you open an image for processing in Python?</li>
<li>What is the purpose of the <code>PIL</code> module in Python?</li>
<li>How do you convert an image loaded using PIL into a Numpy array?</li>
<li>How many dimensions does a Numpy array for an image have? What does each dimension represent?</li>
<li>What are "color channels" in an image?</li>
<li>What is RGB?</li>
<li>How do you display an image using Matplotlib?</li>
<li>How do you turn off the axes and gridlines in a chart?</li>
<li>How do you display a portion of an image using Matplotlib?</li>
<li>How do you plot multiple charts in a grid using Matplotlib and Seaborn? Illustrate with examples.</li>
<li>What is the purpose of the <code>plt.subplots</code> function?</li>
<li>What are pair plots in Seaborn? Illustrate with an example.</li>
<li>How do you export a plot into a PNG image file using Matplotlib?</li>
<li>Where can you learn about the different types of charts you can create using Matplotlib and Seaborn?</li>
</ol>
<p>Congratulations on making it to the end of this tutorial! You can now apply these skills to analyze real world datasets from sources like <a target="_blank" href="https://kaggle.com/datasets">Kaggle</a>. </p>
<p>If you're pursuing a career in data science and machine learning, consider joining the <a target="_blank" href="https://zerotodatascience.com">Zero to Data Science Bootcamp by Jovian</a>. It's a 20-week part-time program where you'll complete 7 courses, 12 coding assignments and 4-real world projects. You will also receive 6 months of career support to help you find your first data science job.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.jovian.ai/zero-to-data-science-bootcamp">https://www.jovian.ai/zero-to-data-science-bootcamp</a></div>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Data Analytics with Pandas – How to Drop a List of Rows from a Pandas Dataframe ]]>
                </title>
                <description>
                    <![CDATA[ A Pandas dataframe is a two dimensional data structure which allows you to store data in rows and columns. It's very useful when you're analyzing data. When you have a list of data records in a dataframe, you may need to drop a specific list of rows ... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/drop-list-of-rows-from-pandas-dataframe/</link>
                <guid isPermaLink="false">66bb8ac1c332a9c775d15b63</guid>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ dataframe ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Vikram Aruchamy ]]>
                </dc:creator>
                <pubDate>Tue, 01 Jun 2021 20:47:43 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2021/05/cut_lemons--1-.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>A Pandas dataframe is a two dimensional data structure which allows you to store data in rows and columns. It's very useful when you're analyzing data.</p>
<p>When you have a list of data records in a dataframe, you may need to drop a specific list of rows depending on the needs of your model and your goals when studying your analytics. </p>
<p>In this tutorial, you'll learn how to drop a list of rows from a Pandas dataframe. </p>
<p>To learn how to drop columns, you can read here about <a target="_blank" href="https://www.stackvidhya.com/drop-column-in-pandas/">How to Drop Columns in Pandas</a>. </p>
<h2 id="heading-how-to-drop-a-row-or-column-in-a-pandas-dataframe">How to Drop a Row or Column in a Pandas Dataframe</h2>
<p>To drop a row or column in a dataframe, you need to use the <code>drop()</code> method available in the dataframe. You can read more about the <code>drop()</code> method in the docs <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">here</a>. </p>
<p><strong>Dataframe Axis</strong></p>
<ul>
<li>Rows are denoted using <code>axis=0</code></li>
<li>Columns are denoted using <code>axis=1</code></li>
</ul>
<p><strong>Dataframe Labels</strong></p>
<ul>
<li>Rows are labelled using the index number starting with 0, by default.</li>
<li>Columns are labelled using names. </li>
</ul>
<p><strong>Drop() Method Parameters</strong></p>
<ul>
<li><code>index</code> - the list of rows to be deleted</li>
<li><code>axis=0</code> - Marks the rows in the dataframe to be deleted</li>
<li><code>inplace=True</code> - Performs the drop operation in the same dataframe, rather than creating a new dataframe object during the delete operation. </li>
</ul>
<h3 id="heading-sample-pandas-dataframe">Sample Pandas DataFrame</h3>
<p>Our sample dataframe contains the columns <em>product_name</em>, <em>Unit_Price</em>, <em>No_Of_Units</em>, <em>Available_Quantity</em>, and <em>Available_Since_Date</em> columns. It also has rows with NaN values which are used to denote missing values. </p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

data = {<span class="hljs-string">"product_name"</span>:[<span class="hljs-string">"Keyboard"</span>,<span class="hljs-string">"Mouse"</span>, <span class="hljs-string">"Monitor"</span>, <span class="hljs-string">"CPU"</span>,<span class="hljs-string">"CPU"</span>, <span class="hljs-string">"Speakers"</span>,pd.NaT],
        <span class="hljs-string">"Unit_Price"</span>:[<span class="hljs-number">500</span>,<span class="hljs-number">200</span>, <span class="hljs-number">5000.235</span>, <span class="hljs-number">10000.550</span>, <span class="hljs-number">10000.550</span>, <span class="hljs-number">250.50</span>,<span class="hljs-literal">None</span>],
        <span class="hljs-string">"No_Of_Units"</span>:[<span class="hljs-number">5</span>,<span class="hljs-number">5</span>, <span class="hljs-number">10</span>, <span class="hljs-number">20</span>, <span class="hljs-number">20</span>, <span class="hljs-number">8</span>,pd.NaT],
        <span class="hljs-string">"Available_Quantity"</span>:[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">10</span>,<span class="hljs-string">"Not Available"</span>,<span class="hljs-string">"Not Available"</span>, pd.NaT,pd.NaT],
        <span class="hljs-string">"Available_Since_Date"</span>:[<span class="hljs-string">'11/5/2021'</span>, <span class="hljs-string">'4/23/2021'</span>, <span class="hljs-string">'08/21/2021'</span>,<span class="hljs-string">'09/18/2021'</span>,<span class="hljs-string">'09/18/2021'</span>,<span class="hljs-string">'01/05/2021'</span>,pd.NaT]
       }

df = pd.DataFrame(data)

df
</code></pre>
<p>The dataframe will look like this:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.000</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.000</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Monitor</td>
      <td>5000.235</td>
      <td>10</td>
      <td>10</td>
      <td>08/21/2021</td>
    </tr>
    <tr>
      <th>3</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
    <tr>
      <th>4</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Speakers</td>
      <td>250.500</td>
      <td>8</td>
      <td>NaT</td>
      <td>01/05/2021</td>
    </tr>
    <tr>
      <th>6</th>
      <td>NaT</td>
      <td>NaN</td>
      <td>NaT</td>
      <td>NaT</td>
      <td>NaT</td>
    </tr>
  </tbody>
</table>
</div>

<p>And just like that we've created our sample dataframe. </p>
<p>After each drop operation, you'll print the dataframe by using <code>df</code> which will print the dataframe in a regular <code>HTML</code> table format. </p>
<p>You can read here about how to <a target="_blank" href="https://www.stackvidhya.com/pretty-print-dataframe/">Pretty Print a Dataframe</a> to print the dataframe in different visual formats. </p>
<p>Next, you'll learn how to drop a list of rows in different use cases. </p>
<h2 id="heading-how-to-drop-a-list-of-rows-by-index-in-pandas">How to Drop a List of Rows by Index in Pandas</h2>
<p>You can delete a list of rows from Pandas by passing the list of indices to the <code>drop()</code> method. </p>
<pre><code class="lang-python">df.drop([<span class="hljs-number">5</span>,<span class="hljs-number">6</span>], axis=<span class="hljs-number">0</span>, inplace=<span class="hljs-literal">True</span>)

df
</code></pre>
<p>In this code,</p>
<ul>
<li><code>[5,6]</code> is the index of the rows you want to delete</li>
<li><code>axis=0</code> denotes that rows should be deleted from the dataframe</li>
<li><code>inplace=True</code> performs the drop operation in the same dataframe</li>
</ul>
<p>After dropping rows with the index 5 and 6, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.000</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.000</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Monitor</td>
      <td>5000.235</td>
      <td>10</td>
      <td>10</td>
      <td>08/21/2021</td>
    </tr>
    <tr>
      <th>3</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
    <tr>
      <th>4</th>
      <td>CPU</td>
      <td>10000.550</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can delete rows with a specific index. </p>
<p>Next, you'll learn about dropping a range of indices. </p>
<h2 id="heading-how-to-drop-rows-by-index-range-in-pandas">How to Drop Rows by Index Range in Pandas</h2>
<p>You can also drop a list of rows within a specific range. </p>
<p>A range is a set of values with a lower limit and an upper limit. </p>
<p>This may be useful in cases where you want to create a sample dataset exlcuding specific ranges of data. </p>
<p>You can create a range of rows in a dataframe by using the <code>df.index()</code> method. Then you can pass this range to the <code>drop()</code> method to drop the rows as shown below. </p>
<pre><code class="lang-python">df.drop(df.index[<span class="hljs-number">2</span>:<span class="hljs-number">4</span>], inplace=<span class="hljs-literal">True</span>)

df
</code></pre>
<p>Here's what this code is doing:</p>
<ul>
<li><code>df.index[2:4]</code> generates a range of rows from 2 to 4. The lower limit of the range is inclusive and the upper limit of the range is exclusive. This means that rows 2 and 3 will be deleted and row 4 will <em>not</em> be deleted. </li>
<li><code>inplace=True</code> performs the drop operation in the same dataframe</li>
</ul>
<p>After dropping rows within the range 2-4, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.00</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.00</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
    <tr>
      <th>4</th>
      <td>CPU</td>
      <td>10000.55</td>
      <td>20</td>
      <td>Not Available</td>
      <td>09/18/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can drop the list of rows in the dataframe using its range. </p>
<h2 id="heading-how-to-drop-all-rows-after-an-index-in-pandas">How to Drop All Rows after an Index in Pandas</h2>
<p>You can drop all rows after a specific index by using <code>iloc[]</code>. </p>
<p>You can use <code>iloc[]</code> to select rows by using its position index. You can specify the start and end position separated by a <code>:</code>. For example, you'd use <code>2:3</code> to select rows from 2 to 3. If you want to select all the rows, you can just use <code>:</code> in <code>iloc[]</code>. </p>
<p>This may be useful in cases where you want to split the dataset for training and testing purposes. </p>
<p>Use the below snippet to select rows from 0 to the index 2. This results in dropping the rows after the index 2. </p>
<pre><code class="lang-python">df = df.iloc[:<span class="hljs-number">2</span>]

df
</code></pre>
<p>In this code, <code>:2</code> selects the rows until the index 2. </p>
<p>This is how you can drop all rows after a specific index. </p>
<p>After dropping rows after the index 2, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Keyboard</td>
      <td>500.0</td>
      <td>5</td>
      <td>5</td>
      <td>11/5/2021</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.0</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can drop rows after a specific index. </p>
<p>Next, you'll learn how to drop rows with conditions. </p>
<h2 id="heading-how-to-drop-rows-with-multiple-conditions-in-pandas">How to Drop Rows with Multiple Conditions in Pandas</h2>
<p>You can drop rows in the dataframe based on specific conditions. </p>
<p>For example, you can drop rows where the column value is greater than <em>X</em> and less than <em>Y</em>. </p>
<p>This may be useful in cases where you want to create a dataset that ignores columns with specific values. </p>
<p>To drop rows based on certain conditions, select the index of the rows which pass the specific condition and pass that index to the <code>drop()</code> method.  </p>
<pre><code class="lang-python">df.drop(df[(df[<span class="hljs-string">'Unit_Price'</span>] &gt;<span class="hljs-number">400</span>) &amp; (df[<span class="hljs-string">'Unit_Price'</span>] &lt; <span class="hljs-number">600</span>)].index, inplace=<span class="hljs-literal">True</span>)

df
</code></pre>
<p>In this code, </p>
<ul>
<li><code>(df['Unit_Price'] &gt;400) &amp; (df['Unit_Price'] &lt; 600)</code> is the condition to drop the rows. </li>
<li><code>df[].index</code> selects the index of rows which passes the condition. </li>
<li><code>inplace=True</code> performs the drop operation in the same dataframe rather than creating a new one.</li>
</ul>
<p>After dropping the rows with the condition which has the <code>unit_price</code> greater than 400 and less than 600, you'll have the below data in the dataframe:</p>
<div>

<table>
  <thead>
    <tr>
      <th></th>
      <th>product_name</th>
      <th>Unit_Price</th>
      <th>No_Of_Units</th>
      <th>Available_Quantity</th>
      <th>Available_Since_Date</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>Mouse</td>
      <td>200.0</td>
      <td>5</td>
      <td>6</td>
      <td>4/23/2021</td>
    </tr>
  </tbody>
</table>
</div>

<p>This is how you can drop rows in the dataframe using certain conditions. </p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>To summarize, in this article you've learnt what the <code>drop()</code> method is in a Pandas dataframe. You've also seen how dataframe rows and columns are labelled. And finally you've learnt how to drop rows using indices, a range of indices, and based on conditions. </p>
<p>If you liked this article, feel free to share it. </p>
<h3 id="heading-you-may-also-like">You May Also Like</h3>
<ul>
<li><a target="_blank" href="https://www.stackvidhya.com/add-column-to-dataframe/">How to Add a Column to a Dataframe in Pandas</a><ul>
<li><a target="_blank" href="https://www.stackvidhya.com/rename-column-in-pandas/">How to Rename a Column in Pandas</a></li>
</ul>
</li>
</ul>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Get Started with Pandas in Python – a Beginner's Guide ]]>
                </title>
                <description>
                    <![CDATA[ By Suchandra Datta The Pandas package in Python gives you a bunch of cool functions and features that help you manipulate data more efficiently. It also lets you perform numerous data cleaning and data preprocessing steps with very little hassle.  Th... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/python-pandas-functions/</link>
                <guid isPermaLink="false">66d8526dafbaabf7a144af17</guid>
                
                    <category>
                        <![CDATA[ beginners guide ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analysis ]]>
                    </category>
                
                    <category>
                        <![CDATA[ data analytics ]]>
                    </category>
                
                    <category>
                        <![CDATA[ pandas ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Tue, 09 Mar 2021 00:48:41 +0000</pubDate>
                <media:content url="https://cdn-media-2.freecodecamp.org/w1280/6040d911a7946308b768178e.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Suchandra Datta</p>
<p>The Pandas package in Python gives you a bunch of cool functions and features that help you manipulate data more efficiently. It also lets you perform numerous data cleaning and data preprocessing steps with very little hassle. </p>
<p>That's great isn't it? Here's a list of some of the most frequently used Pandas functions and tricks to help you enjoy your data science journey. </p>
<h2 id="heading-how-to-remove-missing-values-in-dataframe">How to Remove Missing Values in DataFrame</h2>
<p>Getting rid of missing values is one of the most common tasks in data cleaning. Missing values could be just across one row or column or across multiple rows and columns. </p>
<p>Depending on your application and problem domain, you can use different approaches to handle missing data – like interpolation, substituting with the mean, or simply removing the rows with missing values. </p>
<p>Pandas offers the <code>dropna</code> function which removes all rows (for axis=0) or all columns (for axis=1) where missing values are present. Some of the arguments for the dropna function are as follows:</p>
<ul>
<li><code>axis</code> which specifies if rows are to be dropped (axis=0) or if columns are to be dropped (axis=1)</li>
<li><code>subset</code> which specifies a list of columns to consider for missing values when axis=0</li>
<li><code>inplace</code> which specifies if changes are to be made in the existing DataFrame itself</li>
</ul>
<p>Check out the docs linked <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html">here</a> for more in-depth coverage. </p>
<p>In the example below, we're creating a small DataFrame with missing values and then discarding rows with missing values in any column.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-4.png" alt="Image" width="600" height="400" loading="lazy">
<em>Drop missing values in Pandas</em></p>
<h2 id="heading-how-to-remove-duplicates-in-dataframe">How to Remove Duplicates in DataFrame</h2>
<p>Another common data cleaning task is removing duplicate rows. The <code>drop_duplicates</code> function performs this with arguments similar to <code>dropna</code> such as:</p>
<ul>
<li><code>subset</code>, which specifies a subset of columns to consider for duplicate value when axis=0</li>
<li><code>inplace</code></li>
<li><code>keep</code>, which specifies which duplicated values to keep. Keep can be equal to first, last, or False to drop all duplicates.</li>
</ul>
<p>Check out the docs linked <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html">here</a> for more detailed info. </p>
<p>Let's duplicate a few rows and remove them from our dataset:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-5.png" alt="Image" width="600" height="400" loading="lazy">
<em>Drop duplicate values in Pandas</em></p>
<h2 id="heading-how-to-remove-rows-with-column-specific-values">How to Remove Rows with Column-specific Values</h2>
<p>Suppose we want to keep only those rows where project type is Web or where the number of hours worked is equal to 12. Here's how we can do it. </p>
<p>Using this method, we can filter out rows based on certain specific column values:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-8.png" alt="Image" width="600" height="400" loading="lazy">
<em>Remove rows with column specific values</em></p>
<h2 id="heading-how-to-convert-dataframes-to-json">How to Convert DataFrames to JSON</h2>
<p>DataFrames are super cool optimized structures that are great to work with. And JSON is one of the most popular data formats for seamless data exchange. </p>
<p>Let's convert our DataFrame to JSON using <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html"><code>to_json</code></a> which requires arguments like:</p>
<ul>
<li><code>orient</code>, which specifies what should be the key and value pairs. Default is columns, so column name is the key and each column is the value.</li>
<li><code>date_format</code> which specifies the format of the date. The default is epoch. </li>
</ul>
<p>Look at the example below:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-11.png" alt="Image" width="600" height="400" loading="lazy">
<em>Convert DataFrame to JSON</em></p>
<p>We can see that <code>to_json</code> has returned a string with the following schema:</p>
<pre><code>column_0 :
{ <span class="hljs-attr">row_index_0</span>: column_value_0, <span class="hljs-attr">row_index_1</span>:column_value_1, ...}, 
<span class="hljs-attr">column_1</span>:
{ <span class="hljs-attr">row_index_0</span>: column_value_0, <span class="hljs-attr">row_index_1</span>:column_value_1, ...}, 
...
column_N:
{ <span class="hljs-attr">row_index_0</span>: column_value_0, <span class="hljs-attr">row_index_1</span>:column_value_1, ...}
</code></pre><p>If we want to convert each row to a dictionary, we need to specify that <code>orient=records</code> and parse it using the JSON module.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-12.png" alt="Image" width="600" height="400" loading="lazy">
<em>Convert DataFrame to JSON with orient=records</em></p>
<h2 id="heading-how-to-count-the-number-of-unique-values-in-a-column">How to Count the Number of Unique Values in a Column</h2>
<p>Let's say we want to know how many different project types exist. We can get that information using the <code>nunique</code> function.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-13.png" alt="Image" width="600" height="400" loading="lazy">
<em>Count number of unique values in a column</em></p>
<h2 id="heading-how-to-save-dataframe-as-csv-file">How to Save DataFrame as .csv File</h2>
<p>Just one line of code is required to save the DataFrame as a csv file:</p>
<pre><code>dataset.to_csv(<span class="hljs-string">"save_as_csv.csv"</span>)
</code></pre><h2 id="heading-how-to-save-multiple-lists-as-one-csv-file">How to Save Multiple Lists as One .csv File</h2>
<p>Suppose we have three separate lists as our data source and we want to save them together in one csv file. This just involves two steps:</p>
<ul>
<li>converting it to a number of tuples using zip, </li>
<li>and then converting it to a list.</li>
</ul>
<p>In the example below, we follow this approach to convert the three lists into one DataFrame which we can now save as a .csv file.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-15.png" alt="Image" width="600" height="400" loading="lazy">
<em>Save multiple lists as one csv file</em></p>
<h3 id="heading-how-to-read-dataframes-in-a-memory-efficient-way">How to Read DataFrames in a Memory Efficient Way</h3>
<p>Often we need to read files which are so large that they can't fit into memory. For such mammoth datasets, we use a different approach. </p>
<p>First, we create a <code>TextFileReader</code> object. Next we specify a parameter called <code>chunksize</code> which specifies how many rows of the file we want to read at a time, let's say 4 rows. So we read 4 rows at a time, perform some tasks on that chunk, and move on to the next 4 rows. </p>
<p>Small chunks are more likely to fit into memory than the entire file of thousands of rows. The following example shows how chunking works.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-16.png" alt="Image" width="600" height="400" loading="lazy">
<em>Read DataFrame in a memory efficient manner</em></p>
<p>Here we read the <code>california</code> dataset 1000 rows at a time, remove all rows where <code>median_income</code> is less than or equal to 3, and append these reduced chunks together to make a smaller dataset. </p>
<p>You can save more memory by reading only those columns which you need and specifying smaller datatypes for columns as described in detail in the docs linked <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html">here</a>.</p>
<h2 id="heading-how-to-change-all-values-in-a-dataframe-using-apply">How to Change All Values in a DataFrame Using <code>apply</code></h2>
<p>Let's go back to our example of a projects DataFrame to illustrate this. We focus on the <code>Hours_Worked</code> column, increasing the count by 1 if it's an even number and by 2 if it's an odd number. We use a lambda function for this purpose.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2021/03/image-17.png" alt="Image" width="600" height="400" loading="lazy">
<em>Change all values in a DataFrame using apply</em></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Pandas is a powerful package which can seem daunting sometimes due to its vastness. This is why I tried to list out some of the most useful functions I've come across. </p>
<p>These Pandas functions will help you accelerate your data analysis endeavors. Thank you for your time and I hope you enjoyed reading this article. </p>
<p>### </p>
<p>### </p>
 ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
